Recurrent alternative splicing isoform switches in tumor samples provide novel signatures of cancer

Cancer genomics has been instrumental to determine the genetic alterations that are predictive of various tumor conditions. However, the majority of these alterations occur at low frequencies, motivating the need to expand the catalogue of cancer signatures. Alternative pre-mRNA splicing alterations, which bear major importance for the understanding of cancer, have not been exhaustively studied yet in the context of recent cancer genome projects. In this article we analyze RNA sequencing data for more than 4000 samples from The Cancer Genome Atlas (TCGA) project, including paired normal samples, to detect recurrent alternative splicing isoform switches in 9 different cancer types. We first investigate whether alternative splicing isoform changes are predictive of tumors by applying a rank-based algorithm based on the reversal of the relative expression of transcript isoforms. We find that consistent alternative splicing isoform changes can separate with high accuracy tumor and normal samples, as well as some cancer subtypes. We then searched for those changes that occur in the most abundant isoform, i.e isoform switches, and are therefore more likely to have a functional impact. In total we detected 244 isoform switches, which are associated to functional pathways that are frequently altered in cancer and also separate tumor and normal samples accurately. We further assessed whether these isoform changes are associated to somatic mutations. Surprisingly, only a few cases appear to have association, including the putative tumor suppressor FBLN2 and the tumor driver MYH11, which show association of an isoform switch to mutations and indels on the alternatively spliced exon. However, the number of observed mutations is in general not sufficient to explain the frequency of the found isoform switches, suggesting that recurrent isoform switching in cancer is mostly independent of somatic mutations. In summary, we present an effective approach to detect novel alternative splicing signatures that are predictive of tumors. Moreover, the same methodology has led to uncover recurrent isoform switches in tumors, which may provide novel prognostic and therapeutic targets. Software and data are available at: https://bitbucket.org/regulatorygenomicsupf/iso-ktsp and http://dx.doi.org/10.6084/m9.figshare.1061917


Introduction
Cancer genome projects are instrumental to describe the genetic heterogeneity of tumors and to uncover recurrent alterations that may serve as new biomarkers for prognosis and therapeutic targets (TCGA 2012a-d, TCGA 2013. However, known actionable alterations tend to occur at low frequency or are often absent in an individual tumor sample, which hinders the choice of appropriate therapeutic strategies (Vogelstein et al. 2013, Hudson 2013. There is therefore a need to expand the catalogue of molecular signatures in cancer. Alternative splicing alterations, which bear major importance in terms of the understanding and treatment of cancer (Bonomi et al. 2013), have not been exhaustively studied yet in the context of recent cancer genomics efforts. Alternative splicing may confer a selective advantage to the tumor, such as angiogenesis (Amin et al. 2011), proliferation (Bechara et al. 2014), cell invasion (Venables et al. 2013) and avoidance of apoptosis (Izquierdo et al. 2005).
Some of these alterations may be caused by somatic mutations (Ward and Cooper 2010), but can also take place as a result of changes in expression, amplifications and deletions in splicing factors (Karni et al. 2007, Furney et al. 2013. This suggests that similar splicing alterations may have different genetic origins and still confer equivalent tumorigenic properties to cells. Accordingly, to uncover potential markers of prognosis and targets of therapy, it is of utmost relevance to describe the patterns of alternative splicing in tumors. Numerous genome wide surveys have highlighted the role of alternative splicing patterns in tumors. These have mostly been based on the measurement of local patterns of splicing, encoded as events, and studied using microarrays (Thorsen et al. 2008, Lapuk et al. 2010, Misquitta-Ali et al. 2011, RT-PCR platforms (Klinck et al. 2008), or RNA sequencing (Liu et al. 2012). The description of alternative splicing in terms of simple events facilitates the validation using PCR methods and the characterization of regulatory mechanisms using sequence analysis and biochemical approaches. However, alternative splicing takes place through a change in the relative abundance of the transcript isoforms expressed by a gene, and splicing alterations important for tumor progression may involve complex patterns that are not easily described in terms of simple events. Accordingly, to ultimately determine the functional impact of splicing alterations, it is important to describe these in terms of transcript isoforms changes. This has been shown to be relevant for TP53, which produces multiple isoforms with complex variations and with different roles in tumors (Bourdon et al. 2005). Similarly, recent analyses have shown that there are transcript isoforms specific of lung and breast cancers (Kalari et al. 2012, Eswaran et al. 2013, and that transcript analysis can improve expression-based tumor classification (Zhang et al. 2013, Pal et al. 2014. Additionally, transcript isoform changes can be essential to detect resistance to anti-tumor drugs (Mitra et al. 2009, Poulikakos et al. 2011).
Thus, the detection of transcript isoform changes characterizing specific tumor types can provide new cancer signatures and could be crucial for the development of diagnostic, prognostic and therapeutic strategies.
With the aim to describe the transcript isoform changes that are characteristic of tumors, we have analyzed more than 4000 RNA sequencing (RNA-Seq) samples available from the Cancer Genome Atlas (TCGA) project. In order to perform an analysis that is robust to the variability between samples from different individuals, we have applied a new rank-based algorithm that searches for consistent reversals of relative isoform expression. This algorithm provides the minimal set of isoform-pairs with relative expression changes that can accurately separate tumor from normal samples. Moreover, the obtained isoform-pairs can accurately classify unseen tumor data. We have applied the same algorithm to breast, lung and colon cancer subtypes to obtain isoform signatures that separate subtypes from each other. In particular, we found a highly significant signature for basal-like breast tumors that distinguish them from other breast cancer subtypes. We also found that a number of the identified significant isoform changes correspond to transcript isoform switches, for which the relative expression change occurs in the most abundant isoform, and are therefore more likely to have a functional impact. We found a total of 244 isoform switches in all cancer types. These switches can also accurately separate tumor and normal samples, and occur in genes belonging functional pathways frequently altered in cancer. Interestingly, only a few of these switches can be explained by somatic mutations in the gene locus, suggesting that recurrent isoform switching in cancer is mostly independent of somatic mutations. On the other hand, we found that for at least one case the isoform switch is mutually exclusive with mutations affecting the protein-coding region of the transcripts, suggesting that splicing alterations represent an alternative route towards cellular transformation. Our analyses show that recurrent transcript isoform switches represent important novel signatures in cancer that can serve as molecular markers and could lead to the development of new therapeutic targets.

Systematic analysis of splicing isoform changes in cancer
Changes in the relative abundance of the alternative transcripts from a gene translate into a variation in their relative order in the ranking of transcript expression.
Accordingly, the problem of finding alternative splicing changes in cancer can be reformulated in terms of the consistency of the reversals of their relative expression of transcript isoforms from the same gene. For this purpose we developed Iso-kTSP, which extends the principle of consistent expression reversals for gene expression (Geman et al. 2004, Tan et al. 2005, Price et al. 2007 to alternative splicing isoforms. The Iso-kTSP algorithm stores the ranking of isoform expression from multiple samples separated into two classes ( Figure 1A). All possible isoform pairs from the same gene are sorted according to the sum of frequencies of the two possible relative orders occurring separately in each class, defined as score S 1 (Methods). This score provides an estimate of the probability for the isoforms to change relative order between the two classes. The top scoring isoform pairs are therefore the most consistent changes in isoform relative abundance for a gene between two classes, tumor and normal, or between two tumor subtypes. Each one of these isoform pairs provides a classification rule based on the relative expression order, with a discrimination power related to the consistency of this reversal across samples.
Using cross-validation, the ranking of isoform-pairs is calculated at each iteration step on a balanced set leaving out one sample from each class, which are used for testing ( Figure 1B). The prediction class for a new sample is obtained by evaluating the expression ranking in the new sample against the isoform pair rules. At each iteration step in the cross-validation, the top k-pairs (k=1… k max , with k odd) are evaluated on the test set. Each isoform-pair from the same gene defines a rule, where each gene is only used once. The rule is defined such that for a pair, if the first isoform has lower expression than the second the sample is predicted to be normal, otherwise is predicted to be a tumor. Accordingly, the first isoform is defined as the tumor isoform, and the second as the normal isoform. Significance of the isoform-pair rules is measured by performing 1000 permutations of the sample labels ( Figure 1C). At each permutation, the algorithm is run as before keeping only the pair with the highest score S 1 . An isoform-pair is defined as significant if its score S 1 and information gain (IG) are larger than the maximum ones obtained from the permutation analysis. The global ranking of isoform-pairs and permutation analysis yields the list of significant isoform changes ( Figure 1D). From this set, we derive a minimal classification model with the smallest odd number of isoform-pairs with the highest average performance ( Figure 1E). The final classification is based on simple majority voting with an odd number of isoform pairs. On the other hand, some of the isoform-pairs from the list of significant cases are in fact isoform switches, for which the relative expression change occurs in the most abundant isoform of the gene, and which we detect by the anticorrelation of the relative inclusion levels or PSIs of the isoforms ( Figure F). Finally, to further assess the accuracy of the minimal classification model, or that of a set of isoform switches, a blind test is carried out on samples that were not used for crossvalidation ( Figure 1G). On this set we measure the proportion of samples correctly labeled by the classifier, as well as the number of correct votes for each prediction.  The isoform changes detected include FBLN2, which appears as a single gene model for COAD and is part of the BRCA model ( Figure 3A). FBLN2 has been proposed before to be a tumor suppressor (Law et al. 2012) and its cancer-related function seems to be specific of the protein produced in tumor cells (Baird et al. 2013). We found that FBLN2 undergoes an isoform change related to the skipping of a protein coding exon (Supplementary Figure 11). Additionally, this isoform switch occurs in more than 98% of the unpaired tumor samples in BRCA and COAD ( Figure 1A). In the case of LUAD, surprisingly, we found that the most informative isoform change does not occur in NUMB, known to have a splicing switch related to proliferation (Misquitta-Ali et al. 2011, Bechara et al. 2014), but in QKI. In fact, the isoform change in QKI cannot be described in terms of a simple alternative splicing event ( Figure 2C). In contrast, LUSC model involves a different set of genes compared to LUAD, including the gene ZNF385A ( Figure 2), whose protein product interacts with TP53 thereby promoting growth arrest (Das et al. 2007). The isoform change found is related to the use of an alternative first exon and alternative 3' splice-site (Supplementary Figure 12). Similarly to COAD, THCA and KIRC have single-gene models (Supplementary Figure 2). In particular, for THCA the model involves S100A13, which codes for a calcium binding protein and which has been proposed to be a new marker of angiogenesis in various cancer types (Massi et al. 2010). The isoform change involves an alternative first exon and classifies correctly 84.5% of the unpaired tumor samples (Supplementary Figure 13). Interestingly, S100A13 and another member of the S100 family, S100A16, also have isoform changes in KICH, even though they were not included in the KICH model (Supplementary Figure 13).
The single isoform change in KIRC involves the production of an intron-retention transcript, annotated as non-coding in the gene CPAMD8 (Supplementary Figure 14).
A similar case occurs in the gene NAGS, which is part of the KICH model (Supplementary Figure 2) and is related to an autosomal recessive urea cycle disorder (Häberle et al. 2003). We predict that a protein coding isoform changes in tumors into an isoform with a retained-intron that is annotated as non-coding (Supplementary Figure 15). Importantly, the loss of the protein coding isoform is predictive of 100% of the KICH tumor samples (Supplementary Figure 15). Other isoform changes are discussed in the supplementary material ( Supplementary Figures 16-18). GFF tracks for the isoform-pairs in all derived models can be found in Supplementary File 3.

Changes in alternative splicing isoforms can discriminate tumor subtypes
Cancers are generally classified into subtypes to facilitate patient stratification for more precise prognosis and selection of therapeutic strategy. In particular, breast cancer classification has been recently refined based on molecular information from multiple sources (TCGA 2012c). We thus decided to investigate whether breast cancer subtypes are associated with consistent isoform changes when compared to each other. We separated the BRCA tumor samples into luminal A, luminal B, Her2+ and basal-like as labeled by TCGA (TCGA 2012c) (Supplementary File 1) and run the Iso-kTSP algorithm comparing each subtype against a pool from the rest. In order to maintain balanced sets for the comparison and avoid biases due to sample selection, we subsampled 100 times 45 arbitrary samples for a given subtype and a pool of 15 from each of the other three subtypes together. At each iteration step, we performed permutation analysis of the labels to determine the significance of the detected isoform changes. We found that only basal-like tumors showed isoform changes that were significant in more than 80% of the sampling iterations ( Figure 4A and Supplementary Figure 19). Among the most significant cases we found KIF1B, which has been implicated in apoptosis (Schlisio et al. 2008); ATP1A1, proposed to have tumor suppressor activity (Cao et al. 1997); ITGA6, found to be required for the growth and survival of a stem cell like subpopulation of MCF7 cells (Cariati et al. 2008); and CTNND1, whose alternative splicing was previously related to cell

A catalogue of alternative isoform switches in cancer
The alternative isoform changes described above can separate tumor and normal samples, and in some cases specific cancer subtypes. However, these models are optimized to have the minimum number of isoform-pairs and maximum average accuracy, which is convenient for defining biomarkers with potential clinical applications. On the other hand, the frequency of these isoform changes does not imply functional relevance. However, among the recurrent isoform-changes, those more likely to be functionally relevant will be the ones for which the change occurs in the most abundant isoform, i.e. isoform switches ( Figure 1F). Accordingly, in order to obtain all the significant isoform changes with a possible functional relevance in cancer, we decided to calculate all the significant isoform switches between tumor and normal samples. We first retrieved all those genes with significant isoform changes according to our permutation analysis ( Figure 1D). This yielded a total 1178 genes for the 9 cancer types. We further filtered these genes by imposing a score S 1 > 0.5, which corresponds to selecting isoform-pairs with a change in more than 75% of the samples. To select for switches, we kept those cases for which the relative inclusion levels of the isoforms anti-correlate ( Figure 1F), as observed for FBLN2, QKI and other genes ( Figure 3 and Supplementary Figures 12-18). We thus selected those isoform-pairs having an anti-correlation of PSI values of R < -0.8 (Spearman).
Finally, we kept only those with average expression per isoform of > 1 TPMs in either tumor or normal samples.
These criteria gave rise to a total of 244 isoform switches, with 59 of them appearing in more than one cancer type ( Figure 5) (Supplementary File 4). The most common of the switches is the one described above for FBLN2. From the total 244 switches, 10 occur in known cancer drivers ( Figure 5). Moreover, we also found switches in genes whose splicing has been associated before with cancer, like CD44, which has been observed to be relevant in colon cancer initiation (Du et al. 2008), and SLC39A14, whose alternative splicing is regulated by WNT in colon cancer ( Figure 6C). The location of these indels coincides with a region of low conservation and is next to a putative binding site for the splicing factor SRSF1 (Supplementary Figure 28). Nonetheless, the number of found mutations cannot explain either the frequency of the switches observed.
Somatic mutations could also affect the magnitude of the splicing change in specific samples. We therefore tested, for each isoform-switch, whether the presence of mutations is associated with a larger difference of PSI between the tumor and normal isoforms involved in the switch (Methods). Among the four most significant cases, we found FBLN2 and EHBP1 (Supplementary Figure 29). These two cases show some differences in the distributions of samples with and without mutations (Supplementary Figure 29). However, the proportion of mutated samples is very small to make a reliable comparison and after multiple-testing correction, none of the found cases remained significant. This suggests that, except for a limited number of cases, mutations may not be the main cause of the recurrent splicing switches we have found in cancer.
We thus hypothesized that mutations and isoform switches may occur independently as two alternative mechanisms of functional transformation in cancer. To test this possibility, we measured how frequently mutations that affect the protein-coding region occur in tumor samples without the isoform switch in the same gene by defining a mutual-exclusion score based on the number of samples with no switch but with protein-affecting mutations (Methods). We found that in general the mutual exclusion score correlates with the overall proportion of mutated samples (Supplementary Figure 30). However, the number of samples with switch and mutation is generally comparable or higher, except for the gene Tenascin C (TNC), for which we find more samples with a protein-affecting mutation and no switch than with switch and protein-affecting mutation (Supplementary Figure 30). This indicates that there is no strong bias towards this mutual exclusion. We conclude that there are currently not a sufficient number of mutations that can provide an explanation of the described recurrent isoform switches. Nonetheless, there are a few cases for which this association may exist, as described for the genes FBLN2, MYH11 and TNC.

Discussion
We have applied the principle of relative expression reversals (Geman et al. 2009, Tan et al. 2005 to the search of recurrent alternative splicing isoform changes in tumors using available RNA-Seq data from the TCGA project for 9 different cancer types. In our implementation of this algorithm for isoforms (available at: https://bitbucket.org/regulatorygenomicsupf/iso-ktsp) each classification rule is described in terms of the relative expression of a single pair of isoforms per gene. In this context, the principle of reversals has a natural interpretation as an alternative splicing change between two conditions. This algorithm provides robust classifiers despite of the variability of isoform expression across tumor samples, as the models are not dependent on parameterizations or on any normalization that would maintain the order of the isoform expression. This rank-base method is especially useful for isoform expression from RNA-Seq data, since between-sample normalization methods are not yet fully established. In fact, our approach is applicable to integrate data from heterogeneous platforms, as long as they provide a meaningful ranking of expression. Moreover, the method produces significant isoform changes from a relatively small number of samples, which makes it useful for medium-sized sequencing projects.
We have derived classification rules based on isoform changes that can distinguish Our initial analysis shows that isoform changes hold sufficient information to separate tumor and normal samples, and specific tumor subtypes. This suggests they could serve as effective molecular markers, as they would only require measuring the expression of two isoforms per gene, for a small number of genes. Furthermore, the application of this method to separate tumors according to clinical information will provide a useful prognostic tool. On the other hand, these signatures are not necessarily related to a biological effect specifically relevant for the tumor. To investigate this aspect, we selected all those significant isoform changes that are also isoform switches, i.e. the change occurs in the most abundant isoform, and therefore more likely to have a functional impact. We found 244 such isoform switches, which can also accurately separate tumor and normal samples, and some of which appear in multiple cancer types. Interestingly, many of the found isoform switches occur in pathways that are often altered in tumors, and some of them occur in known cancer driver genes, including CDKN2C, CTNNB1, ABI1 and MYH11. Additionally, we found that of the splicing switches cannot be described in terms of simple events. This is the case for QKI, for which we predict an isoform switch specific to lung adenocarcinoma. The found isoform switches provide an opportunity to develop experimental strategies based on the detection of specific protein isoforms. In particular, we found genes with isoform switches involved in cell communication pathways, including DST and FLNA, which could be potentially used for diagnostic or prognostic applications, or even for developing tumor-specific therapeutic targets with reduced cross-reactivity to other proteins.
Surprisingly, we did not find strong associations of somatic mutations with the isoform switches. It has been recently proposed that synonymous mutations in known cancer drivers may contribute to the oncogenic process (Supek et al. 2014). However, a direct link was not made between the observed somatic mutations and specific splicing changes measured in the same tumor samples. We found only a handful of genes with significant association between the isoform change and somatic mutations occurring in the same samples. These include the putative tumor suppressor FBLN2 and the cancer driver MYH11, the latter showing a recurrent indel in the alternative exon in samples where it is skipped. Still, mutations alone cannot explain the splicing changes observed, as 99% of the transcripts analyzed appear mutated in less than 5% of the tumor samples, whereas the majority of switches occur in at least 50% of the samples. This could mean that there are many intronic mutations not represented in the currently available exome-based data, which could explain the observed variations.
Alternatively, the recurrent switches could be explained by alterations in splicing factors. Although point mutations and indels on splicing factors do not occur with sufficient frequency to explain the switches (Furney et al. 2013), some splicing factors show frequent amplifications, deletions and changes in expression in tumors (Karni et al. 2007). Another interesting hypothesis is whether alterations in chromatin modifications or DNA methylation may be responsible for the observed changes.
These alterations are frequent in cancers (Esteller 2007, Ellis et al. 2008) and can induce changes in splicing (Luco et al. 2010, Maunakea et al. 2013. Interestingly, the gene FBLN2, which presents a switch in various cancers, has been observed frequently methylated in breast and other epithelial tumors (Hill et al. 2010).
Consistent isoform switching in tumors thus seems generally independent of somatic mutations; and moreover, only 10 of the detected 244 switches occur in known cancer drivers. This raised the question of whether these two alterations could be actually mutually exclusive. We found that only TNC, linked to cell invasion in tumors (Hancox et al. 2009), has this mutual exclusion pattern between the isoform switch and somatic mutations affecting the coding regions, suggesting that, albeit to a limited extent, splicing switches may provide an alternative mechanism towards functional transformation in cancer. In summary, we have detected recurrent alternative splicing isoform changes that are predictive of various tumoral conditions, and which may have potential applications for diagnostic and prognostic purposes. The same methodology has allowed us to uncover recurrent isoform switches in tumors, which are likely to have a functional impact, and which may be useful to explore novel therapeutic strategies. Further research will be necessary to determine the functional impact produced by the described isoform changes and how these may actually contribute to the tumor. We hypothesize that the observed recurrent changes in splicing, regardless of their cause, may contribute together with mutations and other alterations to explain tumor formation; hence, providing novel signatures for cancer.

Methods
Available processed RNA-Seq data for tumor and normal samples was downloaded from the TCGA data portal (https://tcga-data.nci.nih.gov/tcga/) for all cancer types together with the UCSC gene annotation from June 2011 (assembly hg19) and the somatic mutation data. To assess sample quality, the provided estimated read-counts per gene were analyzed using URSA (Lee et al. 2013) and sample pairs that did not cluster with the rest of the samples were removed (Supplementary Figure 1). The Iso-kTSP algorithm is based on the kTSP algorithm (Geman et al. 2004, Tan et al. 2005. It stores the isoform expression rankings in samples from two groups, C m , m=1,2. For every pair of isoforms I g,i and I g,j in each gene g, Iso-kTSP calculates a score based on the frequencies of the two possible relative orders in both classes: S 1 (I g,i , I g, j ) = P(I g,i > I g, j | C 1 ) + P(I g,i < I g, j | C 2 ) −1 Where P(I g,i > I g,j |C 1 ) and P(I g,i < I g,j |C 2 ) are the frequencies at which the isoform I g,i appears later than, or before, I g,j in the expression ranking of classes C 1 or C 2 , respectively. To avoid possible ties, a second score S 2 is used, which is based on the average rank difference per class C m for each isoform pair, as proposed previously (Tan et al. 2005). Defining R(I g,i |S a ,C m ) as the rank of isoform I g,i in sample S a and class C m , the average rank difference between two isoforms in a given class is calculated as Significance of the computed isoform changes was evaluated by shuffling labels from the two classes 1000 times. For each shuffling step, the Iso-kTSP algorithm was rerun and the top-scoring isoform-pair was selected. An isoform-pair is defined as significant if its Information Gain and Score S 1 are larger than any of the values obtained from the 1000 shufflings of the same cancer type. The Iso-kTSP is implemented in Java. Software and documentation are available at https://bitbucket.org/regulatorygenomicsupf/iso-ktsp.
For the purpose of finding associations, the number of samples for which a gene has a switch was compared with the number of samples for which the transcripts involved overlap mutations. Given the samples with one or more mutations M, and the samples with the isoform switch S, a Jaccard index J for the association of these two variables was calculated: A Z-score from the Jaccard index was calculated by comparing each value J for an isoform switch to 100 genes with similar median isoform length. The above analysis was also repeated using only mutations that affect the protein sequence or considering the overlap with genes rather than transcript regions with similar results (see S 2 (I g,i , I g, j ) = g(I g,i , I g, j | C 1 ) − g(I g,i , I g, j | C 2 ) Supplementary Methods). The mutual information for the association of isoform switches and mutations, and corresponding z-score was also computed (see Supplementary Methods). The distribution of the differences between tumor and normal isoform PSIs was compared between mutated and non-mutated samples using a Mann-Whitney test. A mutual-exclusion between isoform switches and proteinaffecting mutations was measured as follows: given the number of samples having an isoform switch and no mutation (n 10 ), and those having a mutation but no isoform switch (n 01 ), a mutual-exclusion score was defined to be:       We also indicate the presence of a significant difference (p-value < 0.05) of the relative inclusion (PSI) difference between tumor and normal isoforms in mutated and nonmutated tumor samples before multiple-testing correction (brown color). We further show the Reactome Pathway annotation for those genes for which this was available.