Detection of recurrent alternative splicing switches in tumor samples reveals novel signatures of cancer

The determination of the alternative splicing isoforms expressed in cancer is fundamental for the development of tumor-specific molecular targets for prognosis and therapy, but it is hindered by the heterogeneity of tumors and the variability across patients. We developed a new computational method, robust to biological and technical variability, which identifies significant transcript isoform changes across multiple samples. We applied this method to more than 4000 samples from the The Cancer Genome Atlas project to obtain novel splicing signatures that are predictive for nine different cancer types, and find a specific signature for basal-like breast tumors involving the tumor-driver CTNND1. Additionally, our method identifies 244 isoform switches, for which the change occurs in the most abundant transcript. Some of these switches occur in known tumor drivers, including PPARG, CCND3, RALGDS, MITF, PRDM1, ABI1 and MYH11, for which the switch implies a change in the protein product. Moreover, some of the switches cannot be described with simple splicing events. Surprisingly, isoform switches are independent of somatic mutations, except for the tumor-suppressor FBLN2 and the oncogene MYH11. Our method reveals novel signatures of cancer in terms of transcript isoforms specifically expressed in tumors, providing novel potential molecular targets for prognosis and therapy. Data and software are available at: http://dx.doi.org/10.6084/m9.figshare.1061917 and https://bitbucket.org/regulatorygenomicsupf/iso-ktsp.


Association of switches and mutations
Supplementary, Figure,31.,Soma@c!muta@ons!found!in!COAD!and!BRCA!samples!on! the! alterna@ve! ! exon! of! MYH11! involved! in! the! isoform! switch.! The! frameshim! inser@ons! and! dele@ons! fall! in! a! region! of! low! conserva@on.! Right! next! to! these! indels! we! find! a! puta@ve! SRSF1! binding! site,! with! score! 7.901! and! pPvalue! =! 0.000903425.!The!binding!site!was!predicted!using!the!the!program!FIMO!from!the! MEME!suite!and!the!mo@f!matrix!from!the!RNAcompete!datasets.!!   To assess sample quality, TCGA-provided estimated read-counts per gene were used to predict tissue type patterns with URSA (Lee et al. 2013) (Supplementary Figure 1) and we removed samples from the paired dataset that did not cluster with the rest of the samples from the same class (either normal or tumor). We also analysed the nonpaired tumor samples, but did not remove any of the samples.
We selected all those cancer types that were not under embargo at the time of performing the analysis and for which we had paired normal samples for the same patients. From those we kept only those that had sufficient number of paired samples.
As described in the article (Supplementary Figure 10), it is necessary to have at least 12-13 samples per class (e.g. tumor and normal) to obtain significant isoform-pairs. This discarded uterine cancer (UCEC), for which we only had 6 paired-samples available. For BLCA we had 16 paired-samples. However, from the URSA analysis we noticed that 10 of the normal BLCA samples were predicted to be similar to smooth muscle, whereas the other 6 were predicted to be similar to epithelial tissue and more similar to the tumor samples. Moreover, the smooth muscle like samples showed overexpression of mesenchymal markers, whereas the others had overexpression of epithelial markers, the latter being more similar to the tumor paired samples (data not shown). The analysis of splicing changes between tumor and normal samples including all paired-samples was thus yielding mostly changes related to epithelial-mesenchymal transition (EMT) (Warzecha et al. 2010). These mesenchymal-like samples may be indicative of a contamination from smooth muscle so we removed them from the analysis. Accordingly, we were left with just 6 patients; thus we decided to discard BLCA from further analysis.
For isoform expression, the abundance of every isoform in each sample was calculated in terms of transcripts per million (TPM) (Li et al. 2010) from the isoformestimated read counts provided by TCGA and the isoform lengths from their annotation (UCSC genes -June 2011). No further normalization on the TPM values was performed. For each isoform i in gene G, the PSI was calculated as Genes with one single isoform, or no HUGO ID, were removed from further analyses.

The iso-kTSP algorithm
The iso-kTSP algorithm is implemented in Java. Software and documentation are available at https://bitbucket.org/regulatorygenomicsupf/iso-ktsp. The algorithm is based on the relative expression reversals of isoforms. It implements the principles of the kTSP algorithm (Geman et al. 2004, Tan et al. 2005 applied to isoforms from the same gene, but with a different scoring function (see below). It reads the TPM values for isoform expression and it stores the rankings in each sample for a number of samples from two possible classes, C m , m=1,2. For every pair of isoforms I g,i and I g,j in each gene g, iso-kTSP calculates a score based on the frequencies of the two possible relative orders in both classes: S 1 (I g,i , I g, j ) = P(I g,i > I g, j | C 1 ) + P(I g,i < I g, j | C 2 ) −1 Where P(I g,i > I g,j |C 1 ) and P(I g,i < I g,j |C 2 ) are the frequencies at which the isoform I g,i appears later than, or before, I g,j in the expression ranking of class C 1 or C 2 , respectively. Score S 1 provides an estimate of how probable the two isoforms are to change relative order between the two classes. Accordingly, the higher the S 1 score, the more consistent the isoform change between classes. Our definition of S 1 differs from the one used in (Tan et al. 2005) to account for the fact that in RNA-Seq there are many transcripts with zero reads, hence the expression ranking is not always strictly monotonic. Furthermore, to avoid possible ties, a second score is used, S 2 , which is based on the average rank difference per class C m for each isoform pair, as defined previously (Tan et al. 2005). Defining R(I g,i |S a ,C m ) as the rank of isoform I g,i in sample S a and class C m , the average rank difference between two isoforms in a given class is calculated as where |C m | denotes the number of samples in class C m . The score S 2 for an isoform pair is then defined as previously (Tan et al. 2005): S 2 (I g,i , I g, j ) = g(I g,i , I g, j | C 1 ) − g(I g,i , I g, j | C 2 ) S 2 provides an estimate of the magnitude of the expression reversal for a pair of isoforms from the same gene between the two classes. All possible isoform pairs are sorted by the S 1 score and in the case of a tie, by the S 2 score. Even though the isoform expression ranking is global, only pairs of isoforms from the same gene are considered. Additionally, only one isoform-pair per gene is chosen, hence a gene can only be listed once in the ranking of isoform pairs.
The classification is given in terms of k isoform-pairs, where each isoform-pair provides a rule determined by the order that maximizes the score S 1 . For instance, a pair of isoforms I g,i and I g,j for which P(I g,i > I g,j |C 1 ) > P(I g,i < I g,j |C 1 ) and P(I g,i < I g,j |C 2 ) > P(I g,i > I g,j |C 2 ) would define the following rule The semantics of an isoform-pair rule is that if the first isoform is lower in expression than the second, the prediction is C 1 (the first class label in the input file), and in other cases the prediction is C 2 . In general, we are using C 1 ="normal" and C 2 ="tumor", hence: rule: I g,1 , I g,2 I g,1 < I g,2 ⇒ normal else ⇒ tumor Accordingly, we will call I g,1 the "tumor isoform" and I g,2 the "normal isoform". This semantics applies similarly when comparing a tumor subtype with the rest of tumor types. For instance, in the case of the comparison of basal-like breast tumors with the rest, one rule would read as: rule: I g,1 , I g,2 I g,1 < I g,2 ⇒ basal else ⇒ non-basal And as before, we will call I g,2 the "basal isoform" and I g,1 the "non-basal isoform".
The classification of a new sample is performed by evaluating each isoform-pair rule against the ranking of isoform expression of this new sample. Given k rules, the classifier selects for each isoform-pair rule the class for which the data fulfills the rule.
The final decision for classification is established by simple majority voting, by selecting the most voted class from the k rules. In order to avoid ties in this voting, k is always odd. For instance, for k=3, we could have:

IG(A) = H (S) − H (S, t, A)
Where H is Shannon's entropy according to the normal and tumor classes: where p N and p T are the proportion of normal and tumor samples. Since we are always using S to be a balanced set according to the classes, we always have H(S) = 1,. In our case, the attribute is an isoform-pair and the partitioning is whether the isoform-pair rule is fulfilled. Hence: where where PPV, FDR, NPV and FNR are the positive predictive value, false discovery rate, negative predictive value and false negative rate, respectively, the information gain defined above can be thus rewritten as: Thus, provided the measures of TP, FP, FN and TN for a given isoform-pair, we can calculate the associated information gain for that isoform pair using this formula. This provides in a single number a measure of the separation power of this isoform-pair rule, which can be then compared to the other isoformpairs.

Z-score calculation for score S 1 and Information Gain
To calculate a Z-score for score S 1 and the information gain (IG), we carried out an iso-kTSP analysis, where we calculated the single pair performance for all genes used in the analysis. That is, asking the algorithm to provide a ranking of isoform pairs for all genes, which provides the best isoform-pair per gene, with the corresponding individual IG and S 1 values. This resulted in 14630 values for S 1 and IG, which were used to calculate mean values and standard deviation of the population. We then subtracted the mean from each individual value and divided it by the standard deviation, resulting in a standardized Z-score.

Permutation analysis
Significance of the computed isoform switches was evaluated with iso-kTSP program using permutation analysis. Namely, labels from the two classes (e.g. tumor vs normal) were shuffled 1000 times. For each shuffling step, the iso-kTSP algorithm was re-run and the top-scoring isoform-pair was selected. The distribution of the 1000 top-scoring pairs from the shufflings provides an estimate of the expected recurrence of isoform changes obtained by chance (Supplementary Figure 10). An isoform-pair is significant if the information gain and Score S 1 are both larger than any of the values obtained from the 1000 shufflings.

Blind tests
A blind test of each model was carried out on the set of samples that were not used for cross-validation. These are all the unpaired tumor samples. For the analysis of the cancer subtypes, however, the overall accuracy is estimated on the complete set of tumor samples. In all these tests, we measure the accuracy as described above, calculating the proportion of samples correctly labeled by the classifier, as well as counting the number of correct votes per sample.

Anticorrelation
For each isoform pair in a rule we calculated the Spearman correlation of the PSI values for the paired tumor and normal samples. Those pairs with correlation value Spearman R < -0.8, i.e. anticorrelation, were considered proper isoform switches where the most abundant isoform in each sample is one of the isoforms in the pair.

Analysis of Subtypes
We applied the iso-kTSP algorithm to study the four BRCA subtypes as classified by were computed using permutation analysis as described above. Finally, for each subtype and for each isoform pair, the proportion of iterations in which it is used in a model and the proportion of iterations in which it is significant is reported.
We also analysed the four LUSC subtypes: basal, classical, primitive and secretory (Wilkerson et al. 2010 Similarly as before, both COAD subtypes were compared by sampling 100 times 40 samples from either subtype, and by building the iso-kTSP model at each iteration and computing the significance by permutation analysis.

Analysis of somatic mutations
We downloaded somatic mutation data from TCGA. Download locations: ( where, as before, (1) = https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor The number of samples with mutation data, and with both mutation data and RNA-Seq data are given in Supplementary Table 2 for each cancer type. For the purpose of finding associations, we simply considered whether a gene or the isoform-pair of the gene in a given sample either has or does not have mutations. Although most of the mutational data in TCGA comes from exome sequencing data, there are cases where it comes from whole genome sequencing. Accordingly, in order to homogenize the mutation sampling, we considered mutations that fall on exon regions or in the 100 nt upstream or downstream flanking regions. We carried out several different tests for the association of somatic mutations and isoform switches, which are described below.

Jaccard index
Given the samples with one or more mutations M on the transcripts involved in the switch, and the samples with the isoform switch S, a Jaccard index J for the association of these two variables was calculated: This index was calculated using the clujaccard function from the R package fpc (http://cran.r-project.org/web/packages/fpc) (Hennig 2014). A Z-score from the Jaccard index was calculated by comparison to genes with alternative splicing isoforms and with similar median isoform length. Genes used in the iso-kTSP analysis were ranked according to the absolute length difference of the median isoform length with the given gene. Lengths were calculated as the sum of the component exons from the TCGA annotation (UCSC genes, June 2011), which is equivalent to the length of the projection of the exons from the gene to the genome. The top 100 with the smallest length difference were selected, and two of its isoforms were chosen randomly. For each of these matched controls the Jaccard index was calculated as before. A Z-score was then calculated based on the mean and standard deviation of the Jaccard index from the 100 matched controls. The above analysis was also repeated using only mutations that fall on exons and that affect the protein sequence (Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Nonstop_Mutation) with very similar results.

Mutual information
The mutual information was calculated using the presence or absence of mutations and switches in each sample, using the mi.empirical function from the entropy package of R (Hausser et al. 2009). Using mutations falling on the isoforms undergoing the switch, the mutual information values correlate with the Jaccard index (Spearman R = 0.9) and with the Jaccard z-scores (Spearman R=0.7) calculated above.
This analysis was also performed using mutations falling on the entire gene locus with very similar results.

Wilcox-test of delta-PSI values!
For each isoform-pair we calculated a delta-PSI value for each tumor sample by subtracting the PSI of the tumor isoform from the PSI of the normal isoform. We then separated the samples into two classes, based on the presence or absence of any mutation in the isoform pair. We carried out a Mann-Whitney test to check if there are significant differences in the distribution of delta-PSI values between the two populations, and used the Benjamini-Hochberg method to correct the p-values for multiple testing. The same analysis was also performed using mutations on the entire gene locus with very similar results. !

Fisher test
A Fisher test was performed per isoform-pair by building a 2x2 contingency table with the number of tumor samples with or without mutations and the number of tumor samples with or without the corresponding switch. None of the isoform switches had a significant association according to this test after correcting for multiple testing with the Benjamini-Hochberg method. The same analysis was also performed using mutations on the entire gene locus with very similar results.

Mutual exclusion analysis of functional mutations and switches
A mutual-exclusion between isoform switches and protein-affecting mutations was measured as follows: given the number of samples having an isoform switch and no mutation (n 10 ), and those having a mutation but no isoform switch (n 01 ), a mutualexclusion score was defined to be: 2 min(n 10 , n 01 ) N where N is the total number of samples. A z-score was calculated similarly as above for the Jaccard index.

Reactome analysis!
For the Reactome (Croft et al. 2014) enrichment analysis we used the ReactomePA package from Bioconductor (Yu 2014) with default values, only changing the "universe" parameter, where we used the 14630 Entrez gene IDs that were used for the iso-kTSP analysis (all the genes present in the annotation with multiple alternative splicing isoforms).

GO enrichment analysis
GO-terms were tested for enrichment in the three ontologies: Biological Process, Cellular Component and Molecular Function. The GOstats package (Falcon and Gentleman 2007) was used for the test, with a P-value cutoff of 0.05, odds-ratio > 2 and at least 5 genes. The background set was the gene set used for the iso-kTSP analysis and the "conditional" option was used in the test to calculate the significance of a parent term, if its children were significant.

Supplementary Files description:
Supplementary File 1 (SupplementaryFile1.zip) This is a compressed file with the lists of patient-samples used for the analyses. Each sample is described by a short ID, which was also used in the data tables. The sample ID is the participant ID part of the TCGA barcode, with an additional N or T character at the end, meaning normal or tumor sample, respectively. We include the patients that had RNA-Seq data and the patients for which we had mutation data.