Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Lecture Notes on Sources of Unwanted Variability | JEWISH 0182A, Study notes of Judaic Studies

Material Type: Notes; Class: ANCIENT JEWISH HIST; Subject: Jewish Studies; University: University of California - Los Angeles; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 08/31/2009

koofers-user-a7z
koofers-user-a7z 🇺🇸

10 documents

1 / 16

Related documents


Partial preview of the text

Download Lecture Notes on Sources of Unwanted Variability | JEWISH 0182A and more Study notes Judaic Studies in PDF only on Docsity! LECTURE 2A: SOURCES OF UNWANTED VARIABILITY SLIDE 1 Welcome to lecture block two in our online lecture series. In block two, we are going to talk to you about how to design and implement an expression array experiment so as to identify the genetic underpinnings of human disease. SLIDE 2 The first thing we need to discuss before we get into the details of how to design an array experiment is how to eliminate sources of unwanted variability in expression profiles. You can imagine this as noise in the system that we are trying to rise above and get around so we can identify disease specific correlates. There are three primary sources of unwanted noise. The first is what we will term “SNP noise”, and this is the normal inter-individual variability that is caused by one SNP every 1300 bases that lead to difference in the way we look, differences in intelligence, etc, all defined as within the normal range of human variation. The next source of unwanted variability is in the tissue itself. Imagine you have two patients, each with the same disease, and you get a biopsy from each of those. You might have a different proportion of unaffected and affected cells in the first biopsy as compared to the second biopsy. This tissue heterogeneity can add elements of noise into your experiment. Finally, the disease itself might be caused by completely different genetic bases. This is a fundamental problem in a lot of human genetics, be it positional cloning or complex human genetic studies like array analysis. You might have two families linked to completely different chromosomal loci and caused by different genes in a Mendelian study, or you might have different genetic backgrounds coming together to form the same disease, for example diabetes, and that genetic background might be different between patients and yet still look like the same disease. Genetic heterogeneity is a significant confounding factor. Defining this noise will be the subject of the first two mini-lectures in this block. How to get around this noise will be the subject of module three. How we obtain a candidate gene list from the arrays will be the subject of mini-lecture four in this block, and how to statistically validate these results so that you can take this forward into a translational setting for diagnostics or therapeutics based on these array experiments and the validation will be the subject for mini-lecture five in this block. SLIDE 3 How do we design an array experiment? How do we assay the specific clinical variable under study without getting noise from the SNPs or the disease heterogeneity or the tissue heterogeneity? Again, the sources of unwanted variables will be primarily of these three types and each of these can obscure the interpretation of expression profiling results. So, this prompts an interesting discussion of what is normal versus what is disease. To a large extent disease is societally defined. SLIDE 4 There is a whole spectrum of normal variation and most of this is caused by just different genetic backgrounds. And, when we say different genetic backgrounds we basically mean the one base-pair out of 1300 that we are all different at. So some people have blue eyes, some people have brown eyes, some are a little taller, some people are a little shorter. All of these things are within the normal spectrum of what society defines is normal, but at some point along this spectrum you reach a point at which society deems you to have a disease, which in fact is again a term imposed on you because of what society identifies as a Page 1 of 16 compromise to your body. And, to some degree intelligence can be sort of an interesting cut point that is easily recognized and you go all the way down to actually complex genetics like diabetes and cancer and asthma on the other end of the spectrum, which you inherit as part of your genetic background, and that is clearly a disease state because it causes your body direct harm. SLIDE 5 So what causes this spectrum? On the left side of the slide you see that normal variation is caused by SNPs and as you move towards the right end of the spectrum the term mutation applies and in general a mutation is a much more devastating event to genomic DNA. SLIDE 6 Mutations could be gaping holes or huge deletions in your genetic code or it could be caused by point mutations, which in just looking at the sequence, look exactly like SNPs, but they will actually change the amino acid coding sequence. They could be out of frame mutations, so they’ll turn the resultant proteins into a garbled mess of amino acids. Thus, in general, mutations are more severe events and SNPs are more benign events, but when you have a whole pattern of SNPs which comes together with another pattern of SNPs to form a disease state that is now classified as mutation and falls in the middle of this spectrum. Moving forward to the right of this spectrum, you have SNPs which change the amount of RNA that is produced. They can be in regulatory regions. They change the composition of the transcripts that are produced meaning they change amino acids, and these are generally termed non-synonymous SNPs and these are going to be the SNPs which are going to obscure our expression profiling results. They are normal variants. They change transcript levels within a range of normal which differ from person to person to person and in general don’t cause mutations but cause variations between people. Now say you have a mutation which changes the level of the transcript as well. That change will become buried in all of the normal variations. So how do you get around that? And again this is just to reinforce the point that these regulatory SNPs can come together in one combination and form a pattern of normal variation, but when they come together in a non-advantageous combination, say in an offspring, they could cause a disease phenotype. And this becomes very difficult to parse out because it is along the spectrum of normal human variation. So, what we are really interested in doing is identifying those transcript changes which are consistently found in people with a disease state and may sometimes be found in people without the disease state, but if you look across a large number of individuals there is no consistent theme in terms of that SNP noise tracking with a normal or unaffective person’s phenotype. So, you’ll have random fluctuations in the unaffected person or population, but you will have no fluctuation in the mutation. Thus, the take-away message here is to identify expression correlates of disease samples in cohorts that are consistent across the state, and thus cannot be attributed to random “SNP noise”. Page 2 of 16 LECTURE 2C: METHODS TO OVERCOME UNWANTED VARIABLES SLIDE 1 In this session we are going to go into quite a bit more detail on experimental design. We are going to talk about methods to overcome unwanted variables or isolating variables and particularly address the issue of whether you want to do lots of individual profiles or you want to do mixed samples on a limited number of profiles. We will go over the statistical analysis about the different options within this consortium and show how your experimental design ends up influencing how you can interpret your data in the end. SLIDE 2 So, let’s talk about a couple more sources of variability. In the last two mini-lectures, you heard about “SNP noise” and what that means and how it depends on whether you are studying humans, rats, or mice. Here let’s address a couple more variables. One is tissue heterogeneity and noise due to that. Now let’s compare and contrast tissues versus cell cultures. Tissues have mixed cell populations and obviously, between different regions of the same tissue or different individuals the same tissue, you can have different ratios of those different cell populations. So of course, your expression profiles will reflect that. You have the additional complication with patients in that you have pathology. So, let’s take cystic fibrosis again. In lungs with cystic fibrosis patients you might want to compare them to normal lungs but recognize that there are a lot of differences going on in that tissue, a lot of cellular changes in addition to just the CFTR mutation and the loss of the cystic fibrosis transmembrane regulation protein. You have inflammatory cells. You have bacteria even invading the lung. So, when you are looking at pathological tissue keep track of what is primary and what is secondary and what you want to study in that experiment. Now it is tempting to say that cell cultures are the best system to expression profile, and indeed you often have a single cell type on the dish. But there is a whole other plethora of variables, and in fact, in looking at these variables we typically discourage people from cell culture experiment, both from our own personal experience and because of a number of variables that are difficult to control. For example, just take serum. If you take different serum lots, say fetal or even neonate or different types of serum, different organisms, those are all a tremendous variable. There is different amount of growth factors, even if you take the same fetal bovine serum. Different fetal bovines are outbred, so you have all sorts of variability within that serum. You have problems with cell density, problems with maintaining temperature and CO2. And in general, we find that maintaining all those extrinsic variables in cell culture can be quite difficult, and so often we go back to the tissue to look at primary instead of secondary problems that you can’t control. SLIDE 3 Let’s talk about one other thing which is noise due to the disease. So for example, do the patients and animals under study share the same primary problem? An obvious example that was brought up a number of times is with patients with CFTR mutations. They can have the same mutation and the same gene. So, there that is controlled and you shouldn’t have much noise, but we went over before about environmental noise and SNP noise. An extreme example, on the other hand, is chronic obstructive pulmonary disease, and you have many different etiologies and many different environmental effects which can cause COPD. So, you have a very heterogeneous population. Chronic obstructive pulmonary disease is considered genetically heterogeneous, so you have a lot of noise due to the different disease mechanisms. Whereas patients with CFTR pretty much are considered genetically homogeneous, where all patients involve mutations of the same gene. So, in this paradigm it can be okay to mix CF patients to normalize noise, as far as tissue noise and SNP noise, but it can be very problematic to mix COPD patient tissues because you don’t even know if they share the same primary etiology. Often in profiling experiments we search for the underlying genetic heterogeneity – these are termed subclassification experiments. SLIDE 4 So show me the noise. Where is it? Give me an example. Show me a how you design an experiment and how you account for different sources of noise. So, here we are going to use an example of muscular dystrophy. You can also see a written example, which is downloadable from this site as the “Sample Proposal Form”. Now muscular dystrophy patients have defects in muscle tissue. There is usually a single genetic mutation leading to a single biochemical defect. Now muscle has some advantages. It is relatively Page 5 of 16 homogeneous. It is about 50% or 30% of your body mass, and as you can tell from going to the supermarket with meat, when you buy different cuts of meat it is still recognizable as meat. So, it is somewhat homogeneous at least. It is generally flash frozen which is nice because the standard pathological preparation for muscle is just immediately taking it from a patient and as soon as possible putting it in an isopentane cooled liquid nitrogen and flash freezing it very quickly which is ideal for RNA preparation. Now let’s take a specific example, Duchenne muscular dystrophy, the first positionally cloned gene. All patients have mutations in the dystrophin gene and are lacking dystrophin in their muscles. So, like cystic fibrosis, we have a genetically homogeneous population under study. So, in this case, we would like to match for age of the patient because the disease is progressive so we want to limit our studies to one stage in the disease. We want to control for sex. In this case, Duchenne dystrophy generally affects only males, so we make sure everyone is a male. And, then we want to make sure the same muscle is sampled. SLIDE 5 So, how much noise is there in this system and where does the noise come from. So, if we look at this, what we have done is taken a single muscle biopsy from a patient, divided it into two parts, and then expression profile those individually. So, this graph here is a scatter graph. On one axis is profile from one muscle biopsy and on the other axis, the y-axis, is a profile from the second piece of the same muscle biopsy. And, you can see that things line up pretty well. So in this case, you have two different regions of the same biopsy that give similar expression profiles. However, let’s take a different patient and do the same thing. And, here is the same experimental design, but simply done with a different patient. We again have two different regions of another biopsy, but in this case look how different the profiles are. You see considerable scatter. So, what this analysis tells us is that tissue heterogeneity may or may not be substantial source of noise in the interpretation. If you are lucky, you can take two regions of the same biopsy and find very similar expression profiles which imply that the cellular content is very similar between those two pieces. However, a second patient, we did the same thing, and you find very different profiles. So, tissue heterogeneity can be a significant source of variation. SLIDE 6 So, how do you visualize the source of variabilities? So, here what you typically do is what is called cluster analysis or hierarchical clustering, and you define the extent of sharing between different profiles. Now in this case I am going to show you unsupervised clustering where we are not telling the program that there are any variables between the samples that we know about. We are just saying, “Here are all the profiles from a bunch of different Duchenne muscular dystrophy patients. Some of them are mixed. Some of are individual. Some are just duplicates, but tell me what is related to each other.” So, the software itself in unsupervised clustering determines which profiles are most closely related. So, here is a large series of profiles from Duchenne muscular dystrophy patient muscle biopsies. One I want to point out here first is in one instance we took the same RNA sample, the same hybridization cocktail, and just put it on two different arrays. And that is these 3A-D for duplicate and 3A. Now you see the branches of this dendrogram, the lower the branch point, the more closely related the samples are. And, if you look across all these profiles, the arrow here is pointing to two profiles that are very highly related. And, this isn’t generally what we find. We find that the actual procedure, the actual hybridization of the chip, the scanning of the chip, and the use of the Affymetrix algorithms to determine if absolute intensity, the level of variability is very low and based on hybridization intensity is extremely reproducible. And in fact, it has gotten quite a bit better over the last couple years. Those Affymetrix production facilities have been upgraded, and the chips have become more and more consistent. So in general, we find that experimental variability, as far as the process of the experiment, is very low and that is shown here in these particular duplicate arrays, which is what we find in general. SLIDE 7 Okay, what other sources? Now here as I mentioned previously, we have taken two different regions of the same biopsy as shown in the scatter graphs. Here we are seeing profiles of 6A and 6B, which is patient six, one region of the biopsy versus the second region of the same biopsy, and we find that the branch point is relatively low. And in fact, the program looks at all these profiles and says, “Yes, these two profiles are highly related” and we say, “Good, because they are from the same patient so we were hoping they were Page 6 of 16 highly related.” This also suggests that at least in this patient, there isn’t much tissue heterogeneity. However, we can look at two different regions of the same biopsy of a different patient, and in fact, as shown here, these 1A and 1B profiles, again from different regions of the same biopsy, actually show up in entirely different regions of the dendrogram. In fact, they are branched all the way up at the top even before control individuals that are normal are branched. So, this is the same sample corresponding to the dendrogram previously, and it can show that depending on your patient, depending on your tissue, it can either be a huge source or very little source of variability in your expression profile. SLIDE 8 So, let’s continue this analysis and continue looking at this dendrogram. Now, if we divide five biopsies from five different patients into two parts each. We mix equal amounts of the five biopsies into two pools. Keep in mind that the two pools here are from the same patients, but the two pools are derived from different regions of the biopsies. So, there is actually no RNA or tissue shared between the two pools. So then we hybridize each pool to an individual chip in this case. So, where do those show up? Well that is the arrow here and in this case it is control individuals. You can see that even though we used different regions of the biopsies the branch point is extremely low. It is almost all the way on top of the profiles, and this suggests that when you’ve mixed you have in fact normalized SNP noise, because there are multiple individuals here, and you have normalized tissue heterogeneity. And, when you think about it that makes sense. You are now taking many different regions of different biopsies and mixing them all together so any variability due to tissue heterogeneity should be normalized. You are basically averaging it all out and the same with the multiple individuals. Any SNP noise inherent in this population of patients with Duchenne muscular dystrophy patients is also normalized out. You can see that that has normalized out all that SNP noise. Mixing does normalize out all these different sources of variability. Of course, a critical point here is that if you are focusing your study on tissue heterogeneity or if you are studying SNP noise you definitely don’t want to mix patients because you are effectively normalizing that all out. SLIDE 9 Okay so what is appropriate? The bottom line is it depends. It depends on what you want to study. If you know the primary variable where we know mutations in the CFTR gene cause CF or in the dystrophin gene cause Duchenne dystrophy and we only care about what is shared and not what is different between these genetically homogeneous patients, then it is perfectly appropriate to mix samples. It saves cost, time, and samples. If you are interested in variation between patients then certainly this is not appropriate. It is statistically much more powerful to do many individual profiles. Now there are some analytical considerations that must be kept in mind. In other words, once we generate either mixed profiles or individual profiles, how we can analyze that data statistically and functionally in the end, will differ depending on the approach. SLIDE 10 Now one method that we have used quite extensively is iterative or pair-wise comparisons between small numbers of profiles. Now we have a very extensive description of this protocol, and it is published and we refer you to the website where there is extensive amount of text and examples and data that shows the output of this and the website is given here. It is microarray.cnmcresearch.org\PGA.hgm. Basically, what we are doing is we are taking a low number of profiles, let’s say mixed control profiles, just two of them and mixed Duchenne muscular dystrophy profiles, just two of them, and with those two controls and two Duchenne profiles it is possible to do four iterative comparisons. We can do control one versus Duchenne one or control one versus Duchenne two, etc, etc. You end up with four different comparisons and we simply say, “Show us all genes that consistently have fold-changes greater than two.” So, two fold or greater changes in gene expression levels between these four iterative or pair-wise comparisons. So, the output then is average fold-changes. You do not get P values. A big important variable to recognize when you are doing analysis by fold-changes is that you are creating a ratio, a ratio of expression level between a control profile and a specific gene in a control profile, and a muscular dystrophy or experimental profile. If a gene is not expressed significantly above background in one or the other profile, that low level of gene expression becomes your denominator in the ratio. So, you end up with a five fold increased expression say in you experimental. In which case, your control becomes the denominator. When your denominator Page 7 of 16 LECTURE 2D: SCREENING STRATEGY TO ID SLIDE 1 Based on the large amount of inter-individual variability caused by SNP noise or transcriptional flux that we have talked about in the first or second mini-lecture in this block, in the third module we have put forth a strategy to identify candidate genes which may be correlated with these disease specific processes. Remember that is what we are trying to figure out and get to rise above the noise caused by SNPs. In the fourth mini-lecture, we will talk about validation techniques. How do we validate these candidates at the protein or functional levels, the ultimate test of whether your experiment was a success? We can use a pooling strategy to eliminate noise and get disease-specific correlates to shine through without using really any statistical analyses. SLIDE 2 So, the first phase of this screening strategy is to use pooled samples on very large Affymetrix arrays to identify preliminary candidate genes. In the second phase, those candidate genes can printed down onto glass spotted arrays, and a large number of samples are screened with this inexpensive array on a subset of the tens of thousands genes that we preliminary screen with. Alternatively, we can screen this independent sample set on larger arrays if we have the funds. Now that costs have dropped dramatically for all large arrays this becomes a feasible (and in fact encouraged) approach. SLIDE 3 So again, phase one is candidate gene identification. We take five individuals with a given phenotype and five matched individuals without that phenotype, isolate RNA from each individual, synthesize cDNA and cRNA from each sample, and pool the members of each group of five in equimolar amounts to form pools. Thereafter, each pool is hybridized to an individual Affymetrix array and for example in humans we are using the U133A and B chip set, representing ~45,000 transcripts. For mouse we are using the complete U74 chip set, and for rat we are using the complete U34 chip set, and each one of these is screening tens of thousands of different transcripts in a single hybridization. Again, we routinely identify at least 80% true positives through this strategy. There are certain confounding factors which we have talked a little bit about which make this strategy suboptimal and primarily those would be heterogeneous disease-causing processes. So for example, if there is one primary mutation which results in the same phenotype in three out of the five samples and the other two has a different mutation which results in a similar but non- identical phenotype and both operate through independent processes, basically you will not have a consolidated pathway which is going to rise about the noise, and in fact they might actually negate each other. So, whichever way this falls out, whether you obtain a very nice gene list of candidates, which all make sense or you don’t, you know that if you don’t there may be some heterogeneity going on and you have only hybridized four array sets so then you can actually go back and assume there may be heterogeneous processes and flush this out with individual arrays. So, as a screening approach we feel this is a very valid screening approach. SLIDE 4 So, the schematic of what this looks like is presented here where five individuals illustrated at the top one through five have biopsies taken. Those biopsies are actually split in half in this example and half of those biopsies go into one pool and the other half go into another pool. And, what this does is basically account for experimental variability and really isolates just the biological noise from the experimental noise and all of these labeling processes are independently done on each sample and finally they are pooled and labeled onto the GeneChips. SLIDE 5 We have discussed the iterative comparison strategy in the previous mini-lecture. Page 10 of 16 SLIDE 6 The second phase is validation. One option is to take these lists of change calls that are illustrated by steps 1,2,3 and 4 and print all those gene probes onto glass slides. And, we have talked a little bit about where those clones sets available from, for example though research genetics you can just send them a list of clones and they will send you back the clones. We can use our cDNA array printer to print those onto glass arrays in larger volumes, meaning we can make a hundred slides at a time, and we can screen a very large sample set or patient set on a large number of arrays and thereby statistically validate that these outliers are true outliers over a large number of samples. And, that is done through the standard correlation analysis, which we will talk about in this subsequent lecture. Again, cost permitting, validation should be done on the largest arrays possible on these new samples so that that data may be archived for additional use in other projects. SLIDE 7 This was an example an array that is currently used in the lab. This happens to be a 22,000 element cDNA array. SLIDE 8 So, the phase two statistics are relatively standard correlation statistics which will be addressed in a subsequent lecture, and there are many different ways you can look at this where you look at standard differences in means between two classes. You can look at differences in the ratio of variances between two classes. You can look at inter-gene interactions for example using relevance networks, etc. But the end goal of the phase two is to actually get a P value for each of the genes on this subset arrays say whether they really are correlated to the disease process or they aren’t. SLIDE 9 So in conclusion, there are two phases to our experimental design, and this is really based on all of the SNP noise discussion that we have already went through. A very large screen using large arrays can be done in phase one on two pools, a disease pool and a control pool, as a training set. These candidate gene can either be printed onto custom spotted arrays and much larger sample sets screened for statistical validation (validation set) or this can be done on large arrays as well. Finally, of those genes which show statistical correlations, i.e. P values less than 0.05 with whatever correlation statistic we are using, will be shunted into protein and functional validation as discussed next. Page 11 of 16 LECTURE 2E: VALIDATION TECHNIQUES SLIDE 1 Welcome to mini-lecture five in block two. In this mini-lecture we are going talk about how to validate the results that you have gotten through your screening strategy. So how do we that? There are three different levels at which we can validate. We can validate the RNA level. We can validate at the protein level, which is taking those the next step, and we can finally, which is the gold standard, validate through a functional assay to see whether this protein causes the disease or has something to do with the disease. SLIDE 2 So if we look at the RNA level, there are three standard techniques that are available today to really validate the array results. Northern blots are the standard low-tech lab technique which every molecular biology lab has access to and can be very quickly and easily done. Quantitative reverse transcription followed by PCR is a more robust technique in that it requires lower RNA input amounts, but it is also a little more expensive and difficult to do. And finally we can use arrays to screen a large number of independent samples for every gene in parallel. So, now you get validation results on every single one of your change calls in a single hybridization over many samples. So, all of those validation techniques at the RNA level really get at the question of are your array results real and do they hold up over a large number of samples in these two clinical groups. The next step, and this is almost really mandatory these days, is to validate those results at the protein level as well. So, remember RNA is translated into protein and if that protein is not concordantly dysregulated as its transcript was then the RNA results are really meaningless. And so, either by Western blot or by immunohistochemistry we want to say, “Yes, this upregulated transcript results in an upregulated protein which has something to do with the disease.” Finally, we want to look at that protein which is disregulated and say, “If we tone it down or tone it up does it mediate the disease.” And, that is the final gold standard for a validation technique. SLIDE 3 So, validation at the RNA level. The first thing we’ll talk about is the Northern blot and again, this is standard low-tech procedure in which total RNA is electrophoresed through an agar-gel. It is size fractionated, it is transferred onto a nitrocellulose or nylon membrane, and it is probed with the gene you are interested in looking at and validating. So, the pros are it is very easy, very cheap, and very straightforward to do. The protocols have been optimized, and the cons are that you need a fairly large amount of starting RNA. So, in each well of your Northern blot you need at least 10 micrograms of total RNA and if you are only working from a very small chunk of flash frozen human tissue, that could be more than you can get. And, in addition, that could be an extremely valuable tissue that you don’t want to chew up needlessly. And so, this is where this technique falters in that it is very difficult to make that decision as to whether to use your entire tissue on this validation technique. And finally, it is a low-throughput assay, meaning you can only look at one transcript or one probing of a Northern. And so, this requires multiple stripping and rehybridizations of the same blot if RNA quantity is limiting, and this really is not going to work for hundreds or thousands of different probes. SLIDE 4 So, to get around the limiting amount of RNA input that is required on a Northern blot, you can do quantitative reverse transcription followed by PCR. And, in general, what is done here is you isolate a very small amount of total RNA. For example, one microgram of total RNA, we will reverse transcribe it into a cDNA or a copied DNA strand and use that as an input into a PCR reaction which is quantitative, meaning that given a set input amount of cDNA, you know how much amplification you get out on the back end and you can compare that across tissues. This is usually done with an internal standard for amplification which doesn’t vary across the different tissues, and it is really a nice robust technique which requires very little RNA input. The cons are that it is very expensive. You need up to 100,000 dollars of hardware sometimes to do this and each single PCR probe set is fairly expensive to design and use, and this is primarily because fluorescence is involved. So, expensive assays are run, perhaps prohibitively expensive, for hundreds of Page 12 of 16 LECTURE 2F: QUALITY CONTROL In this mini-lecturelecture we will talk about the generation of high-quality microarray data. As you know, microarrays are powerful tools for the study of gene expression. The number of laboratories relying on this technology is constantly increasing and to avoid experiment redundancy, it is becoming a common practice to share microarray profiles in the worldwide web through the implementation of expression profiles databases. In many circumstances, however, the quality of data in these databases is poor due to the lack of quality control. In this lecture, I will list some of the quality control standards that each Affymetrix GeneChip microarray here at Children’s National Medical Center in Washington, D.C. needs to meet before it is made public to the scientific community. Quality control checkpoints are applied at different levels during the generation of microarray data, and the ones we mostly take care of are shown in this slide. Now let’s see them one by one. Regarding RNA extraction, we generally prefer to extract total RNA instead of messenger RNA to minimize the loss of transcripts during the procedure. The minimal amount of total RNA that we work with is six micrograms when doing a single-round of amplification/labeling, and its integrity is generally analyzed in a 1% agar- gel. Intensity and size of the two ribosomal bands is the main characteristic that we look at. If the sample does not show good ribosomal bands, it is discarded. The next checkpoint is at the level of the cRNA fold amplification. During this reaction using a T7 RNA polymerase promoter the cDNA is biotin labeled and amplified into cRNA. The cRNA amplification rate is measured and samples showing amplification rates below four are not accepted. In addition, what this slide shows is the use of replicates. In our lab, generating replicates means that each tissue is divided into two parts and each part is analyzed on a microarray. As you can see in the slide, the first biological sample, 1A, is divided into two parts, 1Aa and 1Ab. These two are the replicates of the same biological sample. The reason for doing so is to build a statistical strength so that we can compute variances and P values based on real biological replicates rather than using the error model. The error model is described in section three of GeneSpring tutorial. When replicates are generated, we also calculate the correlation coefficient or R2 value. This is done by loading into excel the values deriving from each metrix file and based on the results we decide to keep or discard the microarray. Acceptable R2 values vary depending on the tissue used. For microarrays deriving for example from inbred mice, the minimal acceptable value between replicates is 0.98. For any other tissue we don’t accept values below 0.95. Another checkpoint is to look for saturated probe sets. Each microarray, according to protocol is stained with Streptavidin twice, staining one and staining two. The second staining, staining two, is done to enhance the signal and detect a low abundant transcripts. However, it can lead to a saturation of the probe sets. The saturation is generated by what we call the seaming effect that is the situation in which the signal intensity measured by the laser is not correlated with the abundance with the real biological transcript, and this happens when the signal reaches its maximal intensity. The only way to check for saturation of probes is to plot a scatter graph with the data deriving from the first and second staining. As you can see in this slide, in the upper right corner, each dot represents a probe set or a gene and for some of these the similarity between the first and second staining is lost. Some dots are in fact outside the boundaries, and this indicates the saturation effect. The first and second staining are represented respectively on the x and y axis of this graft. Identification of saturation probes is very tedious work and currently an automated script has been developed in our center for a fast and efficient detection, and which can be downloaded from this portal site. Let’s talk about scaling factors. Scaling factors are computed by the Affymetrix software and are based on the average intensity and global scaling values. This is one form of normalization, linear normalization to be specific. The total intensity of the entire array is determined and then each gene multiplied by the same scalar to bring every chip to the same total intensity. Another excellent algorithm for scaling is available in the dCHIP suite of software, which is downloadable through this portal. In each microarray, the signal of Page 15 of 16 each probe is increased or decreased to target intensity, and this is done to get microarrays that are reciprocally comparable. The decrease or increase don’t alter the differences across genes in a microarray, but simply adjusts them, multiplying by a scaling factor to which the target intensity set by the user. Experiments done here at Children’s have demonstrated that only with scaling factors valued between 0.5 and 5 are expression profiles reproducible. The last checkpoint I am going to talk about is scrubbing of data. Scrubbing the data means to remove all the probe sets that did not change across time points in your experiment, or that are not expressed across all profilies. This can be done in GeneSpring, and we have also written a script which is downloadable here to do this. It is entitled “Array Data Manipulation”. After doing all of these quality controls for Affymetrix arrays, we have found the technology to be extremely stable, reproducible, and transportable. Well, that is all for quality control procedures so far. Please consider that the above mentioned quality control checkpoints are based on our experience and other laboratories might compile different ones to best fit their purposes. Page 16 of 16
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved