Microarrays have been the latest favorite choice of the researchers in life science. They provide the best means to measure the gene expression, whether from live tissue samples or fossilized tissues.
The most common issue is the quality of data. The data from microarray experiments is the normalized ratio of the Cy3 to Cy5 dye fluorescence which is an indirect measure of hybridization with cDNA and therefore the similarity of sequences. The significant deviations from the normal value of 1 (no change) to increased values (more than 1 reflecting up regulation) and decreased value (less than 1 for down regulated genes) are measured and analyzed. The experimental designs can be either with or without reference. Data mining tools are used to analyze data from multiple experiments also.
It is natural for such an experimental procedure to evoke issues at the different steps involved. General experiments issues affect the target preparation, isolation of mRNA from source and probe designing, preparation of microarrays, hybridization and analysis steps.
The general sources of error at the different steps involved in microarray data analysis are summarized below
Sample collection: The common issues in the step are tissue variations and inadequate RNA. They can be resolved through careful isolation of RNA from the best source and using biological replication designs. Experimental replicates can be used to increase precision when fossilized tissues are to be used where RNA degradation over time is a concern. RNA quality can also affect the data during RNA extraction procedure. RNA integrity number provides as an acid test for purity of RNA sample.
Dye swap methods are used to rectify errors due to variations in binding of dyes. Robotic printing methods provide perfect spotting thereby ensuring the spot quality. Maintaining the most ideal conditions helps in uniform hybridization and avoid cross hybridization defects. Technical replicated can also be used to eliminate such errors. Using the standard scanners with in built background correction factors eases the task. Most scanners have automatic gridding facility.
Once the data is collected it is preprocessed or normalized to eliminate the experimental and other types of bias. Randomization can also limit the effect of confounding variables in the experiment. An ideal statistical test with the right correction factors can always yield promising and biologically significant results. Statistical software can be employed for gaining statistical strength to the data despite the issues in the research.
The log transformed array expression value therefore incorporates all or any of these correction factors.
Hence log(y) = μ+ A + D + G + S + A (G) + S (G) + n
Where μ - Average expression value
A - Array effect
D - Dye effect
G - Gene effect
S - Sample Variety effect
A (G) - Combined effect of array and gene
S (G) - Combined effect of Sample variety and gene
And n - independent experimental noise
Apart from the issues cited above, there are chances of technical variation of data due to many factors.
Replication spots within an array or within slide may differ in expression due to sampling or scanning errors
Variations can be introduced by dust, scratches on slides or local hybridization conditions such as moisture, humidity, temperature etc.
Error can also be introduced by improper background correction from spot signal intensities.
There can be in scanning like alignment of spots for scanning and gridding errors. These systematic errors are corrected using local data normalization.
There are specific normalization methods to correct different errors. For example, print tip group normalization accounts for multiple pin effects, where as Global LOWESS regression accounts for general errors in sampling.
With the advent of statistical software and other tools to handle large data, microarray data analysis includes multiple data comparison, statistical algorithms and complex tests for analysis. Adequate data is needed for statistical power of the experiment.
DNA microarray performance is also affected by the type of robotic printing, pin type, humidity, temperature at spotting, spotting buffer, probe concentration, type of immobilization used, hybridization procedure, conditions of hybridization like diffusion, gridding technology, preparation of target and probe etc. Contamination, misplaced spots and variable density shift the mean expression value significantly.
Among the microarray fabrication procedures, spotting is more variable than Affymetrix probes. The design of the experiment also affects the data quality. Reference designs are more susceptible to technical variation than balanced designs. The number of samples can be doubled for the same number of slides in a balanced design thus increasing its statistical significance improving the precision of data.
The analysis tools for the data are also equally important. For larger variations within or between slides, complex statistical tests are required and hence the use of higher statistical software. The ideal choice however depends on the choice of test, choice of correction factors and affordability.
The process of normalization can be modified to include the necessary corrections. The preprocessing can rectify systematic bias due to technical artifacts preserving the biologically relevant transcriptional changes in gene sequences.
About Author / Additional Info: