RAX2: genome-wide detection of condition-associated transcription variation

Almost all of mammalian genes have mRNA variants due to alternative promoters, alternative splice sites, and alternative cleavage and polyadenylation sites. In most cases, change in transcript due to choosing alternative cleavage and polyadenylation sites does not lead to change in protein sequence, while selection of alternative promoters and alternative splice sites would alter protein sequence. Nevertheless, all these alternations would give rise to different RNA isoforms. Selection of alternative RNA isoforms has been found to be associated with change in condition. For example, many studies have revealed that alternative cleavage and polyadenylation (poly(A)) are correlated to proliferation, differentiation, and cellular transformation. Thus, unlike gene expression in microarray, change in usage of splice sites or poly(A) sites associated with conditions does not involve differential expression and hence cannot be detected by differential analysis methods but can be done by association methods. Traditional association methods such as Pearson chi-square test and Fisher Exact test are single test methods and do not work on the RNA count data derived from replicate libraries. For this reason, we here developed a large-scale association method, called ranking analysis of chi-squares (RAX2). Simulations demonstrated that RAX2 worked well for finding association of changes in usage of poly(A) sites with condition change. We applied our RAX2 to our primary T-cell transcriptomic data of over 9899 tags scattered in 3812 genes and found that 1610 (16.3%) tags were associated in transcription with immune stimulation at FDR < 0.05 and most of these tags associated with stimulation also had differential expression. Analysis of two and three tags within genes revealed that under immune stimulation, short RNA isoforms were significantly preferably used. Like cell proliferation and division, short RNA isoforms are highly prioritized to be used for cell growth.

at 3'end may protect the mRNA from unregulated degradation, trigger export of the mRNA to cytoplasm, and assist recognition by translation machinery (Danckwardt et al. 2008;Shen et al. 2011;Zhao et al. 1999). A poly(A) signal at which the pre-mRNA is cleaved and referred to as poly(A) site is recognized and activated by a balk of protein factors called polyadenylation factors (Hunt et al. 2008;Shen et al. 2011;Shi et al. 2009;Zhao et al. 1999). Alternative splice and poly(A) sites significantly increase complexity of transcriptomes and proteomes because they lead to multiple isoforms or variants of a mRNA. Using next generation sequencing (NGS) techniques, it has been observed that over 50% of the transcriptome in the mammalian genome have alternative cleavage and polyadenylation (Ji et al. 2009;Lee et al. 2007;Meyers et al. 2004;Shen et al. 2008;Tian et al. 2007;Wu et al. 2011). As alternative splice and poly(A) sites can uncover variation of transcription at subgene level, more and more investigators attempted to find transcription variation associated with tissue types or cell states and explore etiologic mechanism of disease of interest in transcription variation (Flavell et al. 2008;Ji et al. 2009;MacDonald and McMahon 2011;Mayr and Bartel 2009;Sandberg et al. 2008;Wang et al. 2008; variation of an individual tag is associated with a change in cell state, we need to compare count of tag z to that of all the other tags (denoted by z ) ) within gene g. Therefore, a tag or poly(A) site has two states: z and z ) . We set i=1 for z and i=2 for z ) . Thus count gzjv n in the raw data table can be converted into two-by-two table count gzijv n where gzijv n is count of reads of tag z within gene g in tag state i in condition j in replicate v (v =1, 2, ..., r) and there are r Z g two-by-two table datasets ( (2)

Chi-square statistics
For poly(A) site s, the count of reads due to association effect sij n between site state i and cell

Null chi-squares
Similarly to using within-group variance to estimate null variance, we also employ within-group chi-square to estimate null chi-square. To this end, we need to construct another contingent

FDR estimation
In large-scale statistical analysis, we do not need to concern about so-called type 1 error but we must consider how to control false discovery rate ( Therefore, FDR is expected as ) / ( As F Δ is unknown in experimental data sets, FDR must be estimated.
The existing methods for estimation of FDR are not suitable to our current chi-square statistics because, as seen above, chi-square statistics are remarkably biased against standard chisquare statistics such that p-values obtained from standard chi-square distribution are also biased.
For this reason, we here propose a novel method. The principle of the method is shown in Figure   Δ 1 0 4. As seen in Figure 4, observed chi-square ) (  , null tag at has smaller chance to be chosen as a false positive than at varies with sample size. In other words, the probability that a null tag at position t* is chosen as false positive is determined by replicate number r, chi-square space, tag space S*, null chi-square values, and t*. The logic relationship is that position t*in the ranking space is positively related to the probability that the null tag at position t*is chosen as a false positive tag and null chi-square value is also positively related to the probability that the null chi-square is chosen as false positive chi-square. As position t* is fixed in a given tag space, the ratio ( is positively related to the probability. We therefore replace 0.5 with ( ) Thus, probability that tag at position * t is chosen as a false positive tag is given by Number of false positives in Δ N findings is estimated by . This is an extreme case in which FDR is very conservatively estimated (see Supplemental Fig.S1A1 . This is agreeable with the fact that larger sample sizes would have a smaller FDR.
With Δ F , FDR for Δ N findings declared by chi-square tests at threshold Δ is estimated as

Simulation evaluation of RAX2
We use the following steps to generate the null data of S tags with r replicates based on the real data in a group: Step1: Calculate variance and mean of each cell in two-by-two tables over r replicates.
Step2: Choose randomly one of four variances and one of four means.
Step 3: Generate a set of two-by-two random null count data ( sijv η ) with r replicates using negative binomial pseudorandom generator in R with the mean as mu and the variance as dispersion (size).
We then adopted uniform and normal pseudorandom generators to generate count data with association effects: and 2, s =1, …, S. The association effect sij n is randomly assigned to 10%, 20%, and 30% of null tags (the null data above). Using these simulated datasets, we compared the estimated FDR given by a statistical method to its true FDR across a set of given thresholds ( Supplemental Figure S1 summarizes these results in the simulated scenarios where 10% (A1 and B1), 20%, (A2 and B2), and 30% (A3 and B3) of tags have association effects on transcription with cell states and FDR was significantly overestimated in all three given scenarios such that many tags with true association effects would be missed at FDR <0.05 level. In contrast, with ( )

S1B1
-3), the estimated FDR curve is slightly higher than the true one in each of these three scenarios, finding tags with a true association effect at a given threshold with higher power.
Supplemental Figures S1C1-3 are obtained with ( ) from the simulated data with 3 = r and 10%, 20% and 30% of tags having association effect, respectively. The fact that these FDR curves are very similar to those in Supplemental Figure S1B1, B2, and B3 indicates that our estimate of FDR with ( ) is absolutely conservative but close to its true value across a set of given thresholds in these scenarios.

Comparison with Pearson chi-square, Fisher exact test and Cochran Mantel Haenszel (CMH) chi-square test approaches
In adjusted by HB-procedure and q-value method. Supplemental Figure S2 displays the results of plotting estimated against true FDR values across all cutoff points. One can find from Figure S2 that FDRs are completely overestimated by BH-procedure and q-value approach. The same case was also found in the simulated data with 5 replicates (data not shown). As indicated above, Pearson chi-square based on mean counts over replicates is not unbiased and the p-value for such chi-square obtained from a standard chi-square distribution is greatly biased against its true value (See Supplemental Table S2). So BH and qvalue procedures are not available to adjust such biased p-value profiles.

RAX2 analysis of the primary T-cell data
We then assessed the performance of RAX2 on 3p-seq datasets derived from resting and stimulated primary human CD4 + T cells (Supplemental Table S1, see Data Collection). After normalizing and filtering, our dataset contained 16247 tags scattered in 10160 transcription units or genes. We omitted transcription units with a single tag and the remaindered 9899 tags scattered in 3812 genes were available for RAX2 analysis. We used classical chi-square distribution to calculate p-value for each tag chosen and applied the BH-procedure to adjust α values. By comparing the ordered p-values to the ordered adjusted alpha values, as expected in Supplemental Figures S2, 747 tags were found to be associated with cell states at FDR<0.05 (see Supplemental Table S4). We performed RAX2 on our primary T-cell transcriptomic data. The estimated null chi-square value falls in a range of 0 to 6.7 (Fig. 5A). The treatment chi-square falls in interval between 0 and 363, which is larger than that in the estimated null chi-square distribution. The observed linear dots are deviated from the expected linear dots when the null chi-square is larger than 2.5 ( Fig. 5B).

5
The results obtained by application of RAX2 to our real transcriptomic data sets are illustrated in Supplemental Tables S3-S5. At FDR < 0.05, 1610 (16.3%) tags were found to be associated in transcription with antigen receptor stimulation (Supplemental Table S3). Among 1610 tags, 1228 tags have fold change >1.5, meaning that most of tags found by RAX2 also displayed differential expression between rest and stimulation. We chose genes with two tags identified at FDR<0.05 (see Supplemental Table S6 p  r  o  l  i  f  e  r  a  t  i  o  n  r  a  t  i  o  b  u  t  t  i  s  s  u  e  c  e  l  l  s  a  r  e  h  i  g  h  l  y  d  i  f  f  e  r  e  n  t  i  a  t  e  d  .  C  D  3  /  C  D  2  8  c  o  s  t  i  m  u  l  a  t  i  o  n  t  r  i  g  g  e  d  a  s  e  r  i  e  s  o  f  p  h  y  s  i  o  l  o  g  i  c  a  l  a  c  t  i  v  a  t  i  o  n  s  a  n  d  e  x  p  a  n  s  i  o  n  (  g  r  o  In our result, 162 genes were found to have three poly(A) sites whose usage were associated with stimulation (Supplemental Table S8). Figure  sites, positive tags were more than negative tags but in genes with distal poly(A) sites, negative tags were more than positive tags(Supplemental Fig.3SC2-3). These results strongly indicate that, as seen in genes with two tags, in genes with three tags, CD3/28cosimulation resulted in more usage of shorter tags.

Discussion
Transcriptional profile via next generation sequencing technologies has underscored the complexity of transcript isoform variation in mammalian cells. The extent of this variation, whether the identity and function of a protein product (e.g. alternative splicing) or the visibility of the gene product is altered due to the posttranscriptional regulatory machinery (e.g. ACP), has been broadly appreciated. It is of great interest to leverage newer technologies to globally assess the impact of these processes on human disease, particularly since several individual examples in which dysregulation of either process contributes to human disease exist (Danckwardt et al. 2008).
The identification of relative expression changes of individual mRNA isoforms derived from a single transcript unit can in theory be performed by considering each of these isoforms from the transcript unit as a single entity and applying statistical approaches DESeq(Anders and   2006;Tan et al. 2008) are designed to be applicable to continues data sets or cannot otherwise be used to this context. To accurately estimate FDR for the chi-square findings, we here proposed a novel approach. Our approach is based on a simple principle: given a threshold, tags were declared to be associated with conditions by comparing treatment ranking chi-square profile to null ranking chi-square profile and number of false tags associated with conditions was determined by comparing null chi-square at low ranking position to that at higher ranking position. For a given threshold ∆ , a tag at k b t + = Δ * in a linear space of the null chi-squares has growth of probability to be declared as a false positive with increment of k. As seen in Figure 4, if a treated chi-square distribution falls into the null chi-square distribution, then we cannot find Δ a value except for Δ a = 1 across a set of given thresholds, and hence we also cannot find Δ b value except for Δ b = S. Thus, 0 = Δ N , Δ F = 0, and FDR cannot be defined in such a case.
2 0 Supplemental Figure S1 shows that estimated FDR is larger than its true FDR when threshold is small but it tends to be very close to its true value as threshold increases. This property guarantees that estimation of FDR is conservative and reliable at any threshold level.

Cell Lines and Stimulation
Human primary CD4 T cells were obtained from buffy coats derived from the Gulf Coast with the exception that "barcoded" linkers were used to facilitate multiplexing. Libraries were sequenced via 50 bp paired-end sequencing on an Illumina GAIIx.

Library processing and mapping
Paired end reads were mapped to the mm9 build of the mouse genome using the paired-end mapping module of bwa 0.5.9 (Li and Durbin 2009), default alignment stringency, and requiring that each read be mapped in a proper pair. To rescue reads crossing splicing junctions, nonmapping reads were remapped to the UCSC KnownGene reference and projected back to the mm9. Individual reads were condensed to tags based on their 3' coordinate using a sliding window of 20 nucleotides, using the frequency-weighted median 3' coordinate as the tag identifier. Tags were then filtered for false priming using a progressive filtering strategy assessing adenosine and guanine composition in the five, ten, and fifteen bases followed the tagmapping site. Tags were assigned to individual transcription units based on UCSC KnownGene annotations. For each transcription unit, the aggregate tags mapping to the unit were ranked based on frequency. Tags were extracted from highest to lowest frequency until the extracted tags represented greater than 90% of the aggregate frequency for the gene. The remaining tags were discarded. Libraries were normalized using a negative binomial model within the DEseq package (Anders and Huber 2010). For "gene" analysis, the summed frequency of all tags mapping to each transcription unit was considered as a single entity.

RAX2 Packages and statistical analysis
The methods described in this paper, including estimations of null chi-square distribution and     Alternative polyadenylation changes the length of the 3'UTR and it also can change microRNA binding sites in 3'UTR. While microRNAs tend to repress translation and promote degradation of the mRNAs they bind to. Since transcription products are one-to-one corresponding to poly(A) sites, we define the transcription products as tags.