AffyRNADegradation: control and correction of RNA quality effects in GeneChip expression data

Motivation: Gene expression experiments aim to accurately quantify thousands of transcripts in parallel. Factors posterior to RNA extraction can, however, impair their accurate representation. RNA degradation and differences in the efficiency of amplification affect raw intensity measurements using Affymetrix expression arrays. The positional intensity decay of specifically hybridized probes along the transcript they intend to interrogate is used to estimate the RNA quality in a sample and to correct probe intensities for the degradation bias. This functionality, for which no previous software solution is available, is implemented in the R/Bioconductor package AffyRNADegradation presented here. Availability: The package is available via Bioconductor at the URL http://bioconductor.org/packages/release/bioc/html/AffyRNA Degradation.html Contact: Fasold@izbi.uni-Leipzig.de Supplementary information: Supplementary data are available at Bioinformatics online.


Tongs plot and degradation hook
We present two graphical representations that allow assessing the degradation of RNA-transcripts in a chip-specific fashion. These so-called 'degradation hook' and 'tongs plot' estimate the 3'-enrichment of the probes and thus their degradation level in dependence on the expression degree. They depict the mean intensity difference between two selected subsets of perfect-match probes taken from the 3'-and 5'-ends of each probe set, respectively, as a function of the average logged intensity p pset y = Σ ≡ Σ of all probes within the probe sets which estimates the expression degree to a rough approximation. The subscript s= s3', s5' assigns the respective subsets of three consecutive probes from either the 3' or the 5' end of the probe set. Three probes are chosen from each side of the probe sets to ensure robust averaging over their intensities and to ensure sufficiently large differences between the averaged intensities of both subsets, and thus a proper tradeoff between robustness and sensitivity. Note that the number of three probes refers to about 1/3 to 1/4 of the size of each probe set of typically 11 probes for most array types. On the other hand, our choice is not crucial: Selecting subsets of two or four probes from both ends of the probe sets only marginally alters the results (data not shown). Figure S 1 shows the tongs plot and degradation hook for the same two samples that were used also in the main paper. In general, the degradation hook is more suited to compare the degradation between different samples. The tongs plot reveals additional information such as an asymmetrical behavior of the 3' and 5' subsets.

Probe positional intensity decays
Two main factors related to RNA quality potentially affect the intensities of the probes : (i) the distance of a probe relative to the 3'-end of the transcript, L (or, alternatively, the probe index in the probe set, k, which counts the probes in direction away from the 3'-end of the transcript) and (ii), the hybridization mode (Fasold and Binder, 2012). In the specific (S-) hybridization mode the probes bind amplified RNA (aRNA) fragments of complementary sequence originating from the mRNA transcripts which they intend to detect. In the N-hybridization mode the probes bind to aRNA fragments of partly complementary sequence originating however from other mRNA transcripts not referring to the interrogated gene. We normalize the intensity of probes at position x= k, L with respect to the intensity of probes located at the 3'-end to obtain the degradation index The degradation index due to non-specific hybridization is virtually constant d N (x)≈ 1. The degradation index due to specific hybridization shown in Figure 1 of the main paper is described using an exponential decay of the form

Correcting the probe positional intensity bias
The raw probe intensities of each sample are corrected as follows: 1. The degradation hook 3'/5' ∆Σ -vs-Σ is calculated using all perfect-match probe intensities for the current array as described in section 1. 5. The correction function is calculated as weighted sum of the decay functions due to specific and non-specific hybridization where the latter one is simply set to unity, d N (x)= 1, i.e.
The biased probe intensities are then corrected using the inverse of the correction function, P,x corr P p p Note that each probe intensity is rescaled according to value of the mean intensity decay at its position (x= k or L) and according to its hybridization mode as indicated by the abscissa-value of its probe set y. Consequently, probe intensities taken from the non-specific hybridization range remain uncorrected. With increasing degree of specific hybridization the probes are progressively scaled up with increasing distance from the 3'-end of the transcript. The maximum correction applies to probe sets in the Shybridization range. MM probe intensities are scaled using the mean logged MM-intensity of the probe set as argument.

Positional information of the probes
The distances of the probes to the 3' end are obtained by aligning their sequences to the respective transcript sequence serving as target for the respective probe set as provided by Affymetrix. We have computed files containing probe positional information for a large number of GeneChip expression arrays. They are available via the website http://www.izbi.uni-leipzig.de /downloads_links/programs/rna_integrity.php. These files are stored in R binary file format. The package documentation contains a section describing the contents of this file and explaining how the user can easily create and use custom probe location files, for example if he uses custom microarrays.

Choosing between absolute and relative probe positions
In the supplementary text of ref. (Fasold and Binder, 2012) we compare the two correction metrics based either on the absolute probe position ('L-correction') or on the relative probe position (indexbased, 'k-correction') relative to the 3′ transcript end. The k-correction applies the same positional factor to all probe sets. In consequence, the probe set-specificity of the correction is solely determined by the degree of specific hybridization. Contrarily, the L-correction applies a specific factor to each probe-set depending on the particular location of its probes. Comparison of both correction methods shows that probe sets located on the average nearer to 3′-end of the transcript are corrected to a less degree using their absolute position than probe sets located more distant from the 3′-transcript end. Hence, the L-correction is more specific with respect to each particular probe set. On the other hand, the k-correction is more robust with respect to outliers. We recommend use of absolute probe positions to cope with the effect of differently distributed probes. In practice the intensity changes due to index-based and position-based correction differ only slightly with, in general, small differences in the resulting expression values.