- Split View
-
Views
-
Cite
Cite
Yue Wang, Kunqi Chen, Zhen Wei, Frans Coenen, Jionglong Su, Jia Meng, MetaTX: deciphering the distribution of mRNA-related features in the presence of isoform ambiguity, with applications in epitranscriptome analysis, Bioinformatics, Volume 37, Issue 9, May 2021, Pages 1285–1291, https://doi.org/10.1093/bioinformatics/btaa938
- Share Icon Share
Abstract
The distribution of biological features strongly indicates their functional relevance. Compared to DNA-related features, deciphering the distribution of mRNA-related features is non-trivial due to the existence of isoform ambiguity and compositional diversity of mRNAs.
We propose here a rigorous statistical framework, MetaTX, for deciphering the distribution of mRNA-related features. Through a standardized mRNA model, MetaTX firstly unifies various mRNA transcripts of diverse compositions, and then corrects the isoform ambiguity by incorporating the overall distribution pattern of the features through an EM algorithm. MetaTX was tested on both simulated and real data. Results suggested that MetaTX substantially outperformed existing direct methods on simulated datasets, and that a more informative distribution pattern was produced for all the three datasets tested, which contain N6-Methyladenosine sites generated by different technologies. MetaTX should make a useful tool for studying the distribution and functions of mRNA-related biological features, especially for mRNA modifications such as N6-Methyladenosine.
The MetaTX R package is freely available at GitHub: https://github.com/yue-wang-biomath/MetaTX.1.0.
Supplementary data are available at Bioinformatics online.
1 Introduction
Recent development of high-throughput sequencing technologies has enabled the transcriptome-wide profiling of RNA modification sites (Dominissini et al., 2012, 2013; Meyer et al., 2012; Schaefer et al., 2008). To date, more than 170 different types of RNA modification have been identified in all three kingdoms of life, many of which have been found to play important roles in various biological processes. For example, N6-Methyladenosine (m6A) can regulate the stability and translation efficiency of mRNA (Mauer et al., 2017; Slobodin et al., 2017; Wang et al., 2014, 2015), and affect the circadian clock, cell differentiation, neuron production, alternative splicing and RNA–protein interaction (Fustin et al., 2013; Geula et al., 2015; Liu et al., 2015; Pendleton et al., 2017).
One basic way to characterize a biological feature is to see how it is distributed with respect to a gene, which may be shown in the form of a metagene plot (Beauparlant et al., 2016), also referred to as a meta-gene (Shin et al., 2009) or aggregation plot (Kundaje et al., 2012). The distribution of a biological feature strongly indicates the potential functional relevance of the feature of interests, although such association may not be direct or causal. For example, the enrichment of histone modification H3K4me3 near to transcription start sites is clearly linked to its transcription initialization function (Barski et al., 2007). However, compared to DNA-related features (such as histone modification and DNA methylation), deciphering the distribution of mRNA-related features (such as mRNA modifications) is non-trivial due to the following reasons:
Isoform ambiguity. Although actually located in the heterogeneous transcriptome, mRNA-related biological features are often denoted only by genome-based coordinates in bioinformatics databases. The isoform-specific belongings of mRNA-related features may be unavailable in the presence of multiple isoform transcription of the same gene due to technical limitations. For example, most of the existing epitranscriptome profiling approaches, such as MeRIP-seq and miCLIP, suffer from the isoform ambiguity issue, and instead report only the genome-based coordinates of RNA modifications. When an RNA modification site overlaps with multiple transcripts according to the transcriptome annotation, it is not clear which transcript is associated with the site. This creates obvious difficulty when characterizing the distribution of the feature of interest.
More complex landscape of mRNAs. The distribution analysis for DNA-related features is usually based on two landmarks only, i.e. the transcription start and transcription end positions, while the analysis of mRNA-related features should involve four landmarks of the mRNAs molecule, i.e. 5′ end, start codon, stop codon and 3′ end of mRNAs, making it more complex than the case of DNA-related features.
Compositional diversity of mRNAs. In addition, it is important to note the existence of compositional diversity among different mRNAs. For example, some mRNAs may have short CDS but a super long 3′UTR, while some others may have very short 3′UTR, or even no 3′UTR at all for some cases, according to the transcriptome annotation database. The vast compositional diversity among different mRNAs makes it difficult to compare across multiple mRNAs (or genes), and may bring concerns about the validity of the overall distribution pattern necessary for characterizing mRNA-related features.
The reasons provided above (see Fig. 1), together compound the difficulty and complexity of distribution characterization for mRNA-related features.
To date, a number of software tools have been developed for deciphering the distribution of mRNA-related features. Guitar was the first method dedicated to sketching the transcriptomic view of RNA-related genomic features, and provided an open source R/Bioconductor package (Cui et al., 2016). The Perl/R pipeline MetaPlotR was invented for plotting metagenes of various modified sites (Olarerin-George and Jaffrey, 2017). A Shiny web framework-based web server txCoords was invented for transcriptomic peak re-mapping (Yan et al., 2017). The epitranscriptome database MeTDB (Liu et al., 2018) provided a web-based graphical user interface for the Guitar R package (Cui et al., 2016). The RNA modification annotation database, RNAmod, also supported metagene plot functionality along with various annotations of diverse mRNA modifications in different species (Liu and Gregory, 2019). Despite the efforts that have been made, only simple heuristic strategies have been taken by the above approaches to resolve the aforementioned difficulties, e.g. retaining only the longest transcript of a gene in the analysis to avoid isoform ambiguity, keeping only the mRNAs with both 5′UTR and 3UTR longer than 100 nt to ensure they are relatively comparable. None of the above approaches quantitatively formulated the problem of concern, and a general and rigorous solution has yet to be available.
We propose here a rigorous statistical framework, MetaTX, for deciphering the distribution of mRNA-related features in the presence of isoform ambiguity and compositional diversity of mRNAs. Through a standardized mRNA model, MetaTX first unifies various mRNA transcripts, some of which may have vastly different compositions, and then corrects the isoform ambiguity by incorporating the overall distribution pattern of the feature through an EM algorithm via a latent variable. MetaTX was tested on both simulation data and real RNA N6-methyladenosine data generated from three different epitranscriptome profiling approaches (Chen et al., 2015; Olarerin-George and Jaffrey, 2017; Schwartz et al., 2014). Results suggested that MetaTX consistently exhibited an improved performance with higher accuracy on simulated data compared to the Guitar (Cui et al., 2016) and MetaPlotR (Olarerin-George and Jaffrey, 2017), which did not consider biases from isoform ambiguity, and reported more prominent distribution patterns for all the three datasets tested. MetaTX is available as an open source R package, and is a useful tool for studying the distribution and functions of mRNA-related biological features, especially for mRNA modifications, such as N6-Methyladenosine and Pseudouridylation.
2 Materials and methods
In this section, we first introduce the standard mRNA model, through which mRNAs of diverse compositions may be unified, and then propose our overall formulation for the distribution analysis of mRNA-related features. An EM solution is then provided to resolve the isoform ambiguity problem via the overall distribution pattern inferred.
2.1 Coordinate standardization
To unify the mRNAs with diverse composition, we considered a standard mRNA model, in which the three main components of mRNA (5′UTR, CDS and 3′UTR) were each divided into the same number of bins of equal width, for every individual mRNA. Figure 2 illustrates the process of coordinate standardization, and we refer to each bin of mRNA by its coordinate on the standardized mRNA model. Conceivably, as the corresponding coordinates on different mRNAs are located on biologically similar regions, they are likely to be associated with the same type of biological features, or regulated by the same type of signal (such as the same type of RNA modification). The coordinates of different mRNA were then made comparable. It is worth noting that, although not explicitly stated, many existing approaches will have assumed a similar standardized mRNA model in their analysis.
In practice, we also considered the flanking regions of the mRNAs, including 1 kb promoter regions before the 5′ end and 1 kb tail region after the 3′ end. These two regions are also independently divided into the same number of bins with equal width. Although theoretically there should be no mRNA-related features originating from these two regions, there are always quite a few mRNA-related features that fall into these regions due to incomplete transcriptome annotation, isoform ambiguity, noise, etc. These two regions are used as the negative control regions in our analysis; the signal within these two regions directly reflects analysis bias.
2.2 The MetaTX model
2.2.1 Basic formulation
We refer mRNA features to biological features that reside on mRNAs, such as, m6A RNA methylation sites, microRNA binding site, etc. Of interests here is the distribution of mRNA-related features over the standard mRNA model, as is shown in a metagene plot. We denote the total number of mRNA-related features by S. As previously stated, we independently divided the 5′UTR, CDS and 3′UTR of mRNAs into a number of bins with equal width, respectively. We assume there are a total of C bins (or coordinates) on each mRNA. The parameter C is essentially the resolution of our distribution analysis. A greater C may result in a distribution analysis with high-resolution, but also increases the computation load.
With the coordinates of mRNA standardized, it is now possible to calculate the standardized coordinate of each mRNA-related feature in the mRNA that overlapped with it. Let T represent the total number of transcripts. We denote the overlap between features and mRNAs with a three dimensional matrix with indicating the sth feature overlap with the cth coordinate on the tth mRNAs on the genome, suggesting a possible association between the two; and otherwise. It should be noted that a feature may overlap with more than one mRNAs due to isoform ambiguity, and does not necessarily mean that the sth feature is actually associated with the tth mRNA.
In addition, we denote the width of the cth coordinate on the tth transcript by with and This parameter is important in penalizing the varying width of the corresponding bins on different mRNAs. When the corresponding component is not available, e.g. an mRNA without 3′UTR, the width of the corresponding bin (coordinate) should be set to 0.
2.2.2 Maximum a likelihood estimation (MLE)
We define a parameter set where is the probability that a site is located on the cth coordinate of mRNA, and with . Due to the alternative splicing, a site may be located on several distinct transcript isoforms. So, we denote the probability that the sth site in the observed data resides on the tth transcript by a variable . We should have .
2.2.3 Weight assignment
Due to isoform ambiguity, a given mRNA-feature may overlap with multiple transcripts, and the significance of different overlapping events may not be the same. There exist two ways to resolve the isoform ambiguity problem. One way is to pick the highest expressed isoform when gene expression data is available. Considering the limited detectability (or sensibility) of biotechnology, it is often safe to assume the observed phenomenon is associated with the most abundant molecule rather than one with a lower abundance; Alternatively, an equally popular but more convenient method, is to consider only the primary transcript, which is usually regarded as the longest one (Olarerin-George and Jaffrey, 2017).
2.3 The MetaTX model
The MLE problem in (5) may be solved using the Expectation–Maximization framework (Dempster et al., 1977). We introduced the latent variables where means the sth feature is physically located at the cth coordinate in the tth transcript; and otherwise. Note that is different from , i.e. when , we should have ; but the other way around may not be true. Let be the conditional probability of the sth feature located at the cth bin of the tth transcript conditioned on parameters and the observed data . For each feature s, we should have . In addition, is equivalent to . Particularly, takes the same value of where .
The EM algorithm of our model may then be summarized as follows:
Given estimate by E step (7).
Given estimate by M step (12).
Iteratively perform the above two steps until convergence. The initial value of each is set to 1/C.
2.4 Absolutely abundance of mRNA features
It is important to note that mRNA (rather than DNA) is used as the background during the calculation of feature abundance. The shared exons of multiple isoform mRNAs were counted multiple times for each individual transcript and with all the introns removed. The absolute and average density and estimated from our model is thus likely to be different from those returned from existing genome-based methods. With the help of ggplot2 (Ginestet, 2011) and other tools, our MetaTX R package provides a visualization of the distribution of mRNA-features alongside the standard mRNA model. Inclusion of the promoter and tail regions are also optional.
3 Results
3.1 Testing on simulated data
We firstly validated the proposed method on simulated datasets, which contained the 5′UTR, CDS and 3′UTR regions respectively. When generating the simulated datasets using 1000 transcripts randomly selected from the UCSC gene database, 10 sites were then randomly picked from each transcript within the relative mRNA component. As a result, there were a total of 10 000 sites, chosen from each mRNA component. After remapping, it may be expected that these sites are arranged evenly within the corresponding mRNA component, but not in other regions.
We then drew the distribution of the three simulated datasets, corresponding to 5′UTR, CDS and 3′UTR, via the Guitar, MetaPlotR, filter method and MetaTX. Note none of the other methods except MetaTX consider biases from isoform ambiguity problem. Guitar counts the mRNA-features multiple times for all transcripts when isoform ambiguity exists; while MetaPlotR by default retrains only the primary (or longest) isoform transcript of a gene to avoid such ambiguity. The filter method discards short mRNA components (less than 100 nt) to keep only the information located on long components, which should more informative. We also included the promoter and tail regions as negative control regions, which do not correspond to mRNA regions and thus in theory should not contain signals from mRNA-related features. As shown in Figure 3a–c, stronger bias was observed in the results of the direct estimation method and the filter method. After correcting this isoform ambiguity via the MetaTX model, the accuracy of estimates increased notably.
To quantitatively measure the accuracy of our model, we calculated the consistency between the estimated distribution and the ground truth distribution with the Kullback–Leibler (KL) divergence (Table 1). Results suggested that MetaTX substantially outperformed the competing methods (Guitar and MetaPlotR), and the filter strategy can slightly boost the quality of results by removing less informative datasets.
Methods . | The 3 tests (and KL divergence) . | ||
---|---|---|---|
3′UTR . | CDS . | 5′UTR . | |
MetaTX | 0.07 | 0.12 | 0.06 |
Guitar | 0.29 | 1.02 | 0.22 |
MetaPlotR | 0.44 | 1.09 | 0.32 |
Filtered | 0.25 | 1.01 | 0.26 |
Methods . | The 3 tests (and KL divergence) . | ||
---|---|---|---|
3′UTR . | CDS . | 5′UTR . | |
MetaTX | 0.07 | 0.12 | 0.06 |
Guitar | 0.29 | 1.02 | 0.22 |
MetaPlotR | 0.44 | 1.09 | 0.32 |
Filtered | 0.25 | 1.01 | 0.26 |
Methods . | The 3 tests (and KL divergence) . | ||
---|---|---|---|
3′UTR . | CDS . | 5′UTR . | |
MetaTX | 0.07 | 0.12 | 0.06 |
Guitar | 0.29 | 1.02 | 0.22 |
MetaPlotR | 0.44 | 1.09 | 0.32 |
Filtered | 0.25 | 1.01 | 0.26 |
Methods . | The 3 tests (and KL divergence) . | ||
---|---|---|---|
3′UTR . | CDS . | 5′UTR . | |
MetaTX | 0.07 | 0.12 | 0.06 |
Guitar | 0.29 | 1.02 | 0.22 |
MetaPlotR | 0.44 | 1.09 | 0.32 |
Filtered | 0.25 | 1.01 | 0.26 |
3.2 Testing on real data
Next we analyzed N6-methyladenosine (m6A) datasets derived from different high-throughput sequencing approaches, including an miCLIP-seq dataset (Linder et al., 2015; Olarerin-George and Jaffrey, 2017), a PA-m6A-seq dataset (Chen et al., 2015) and an m6A-seq dataset (Schwartz et al., 2014). N6-methyladenosine is the most abundant RNA modification on mRNA, and has been regarded to be enriched near the stop codon of mRNAs (Dominissini et al., 2012; Meyer et al., 2012).
As is shown in Figure 3d–f, all three methods reported an enrichment of m6A RNA methylation near the stop codon of mRNAs. Although the ground truth distribution of m6A on mRNAs is unavailable, it is evident that MetaTX reported a more prominent and reasonable pattern for all the three datasets tested, which is reflected by the reduced signal at the negative control regions (promoter and tail DNA regions). The negative control regions do not correspond to mRNA transcripts and thus should not carry m6A signals. It is worth noting that the promoter and tail regions were defined in a transcript-specific manner, i.e. each transcript has a different set of promoter and tail. Including all isoforms does not necessarily reduce the number of sites being assigned to the negative control region. Because the MetaTX model considers all isoform transcripts including the very short ones, it actually has larger proportion of negative control regions compared to the method which considers only the longest transcript.
Meanwhile, consistent of our knowledge (Dominissini et al., 2012; Meyer et al., 2012), a strong enrichment pattern was also observed around the stop codon compared to the methods without correction of the isoform ambiguity. The correction seems particularly effective for the PA-m6A-seq dataset shown in Figure 3e. Importantly, MetaTX method achieved very stable performance even when we discard all the features without isoform ambiguity and keep only those overlap with multiple isoform transcripts, suggesting the capability of MetaTX in dealing with features with high degree of isoform ambiguity (see Supplementary Fig. S3). In another case study, we showed a different RNA modification mark, m5C RNA methylation, is enriched at the 5′UTRs in human (Supplementary Fig. S4).
3.3 MetaTX package
An R package implementing the proposed MetaTX model was developed for estimating and visualizing the distribution density of mRNA-related features () along the standard mRNA model. MetaTX requires genome-based locations of mRNA-related features and the associated transcriptome annotations in TxDb format (Lawrence et al., 2013). The standardized coordinates of mRNA-related features are calculated for the 5′UTR, CDS and 3′UTR regions, with each region is divided into a user-defined number of bins (Default 10) with equal length. Including the promoter or the tail DNA regions is optional.
The MetaTX package has a few useful functions. The remapCoord function calculates which bin and which component a particular feature overlaps with according to the genome-based coordinates (the matrix). The metaTXplot function returns a density plot of the input feature set on a standard mRNA model, which supports customized relative length of different mRNA components during visualization, e.g. a shorter 5′UTR and longer CDS (see Supplementary Figs S5 and S6). The package also provides an isoformProb function that can return the probabilities of a particular feature being located on different isoforms (). The newly developed MetaTX R package is freely available at the GitHub repository(https://github.com/yue-wang-biomath/MetaTX.1.0) with examples and detailed documentation.
4 Conclusions
The distribution of a biological feature strongly indicates its functions, and is often the first step when characterizing its functional relevance. To date, a number of studies have been proposed for distribution analysis of mRNA-related features, but none of them provided a quantitative formulation for the problem of concern.
We proposed here the first rigorous statistical model, MetaTX, together with its EM solution for estimating the distribution of mRNA-related features in the presence of isoform ambiguity and differential composition among mRNAs. MetaTX was tested on both simulated data and RNA N6-methyladenosine data derived from different high-throughput sequencing approaches, and demonstrated stable performance with more prominent and reasonable distribution patterns for all the datasets tested. An open source R package was developed for estimating and sketching a global view of mRNA-related features along the standard mRNA model. We believe that MetaTX should make a useful tool for studying the distribution and functions of mRNA-related biological features, especially for mRNA modifications such as N6-methyladenosine.
From modeling perspective, MetaTX model is substantially different from existing approaches for resolving the ambiguity in RNA data, which usually relies on the fact that the read distribution is uniform on a transcript. On the contrary, MetaTX model relied on the non-uniform distribution of mRNA-related features on the entire transcripts, i.e. the tendency of the features to be enriched or depleted at different transcript coordinates.
For the work reported here we have not discussed: the possibility of an mRNA feature overlapping with multiple bins of the same transcript, the filtering of highly noisy features mapped to too many transcripts to improve data quality, the possibility of multiple features located on different transcripts being mapped to the same genome-based coordinate, or the possibility of incomplete or even incorrect transcriptome annotation. All of these will have a profound impact on the analysis results obtained using the MetaTX method. Meanwhile, the EM algorithms do not guarantee to converge to the global optimum, especially on a small dataset.
It is worth noting that there may be other interesting landmarks on transcripts. For example, it was reported previously that 70% of m6A sites were found around the 3′UTR’s last exons (Ke et al., 2015), making the last exon also of interests. However, because the starting position of the last exon can appear before or after the stop codon of the corresponding transcripts, the last codon cannot be integrated into the current MetaTX model. Nevertheless, it should be fairly straightforward to capture the distribution patterns with respect to new landmarks by altering the transcriptome annotations, e.g. for the MetaTX R package to acknowledge the last exon as the new landmark of interests, we may simply alter the transcriptome annotation by labeling the last exon–exon junctions as the ‘pseudo stop codons’.
We discussed in this work the application of MetaTX model to base-resolution features only. It should be fairly straightforward to extend the model to cover wider features such as microRNA binding sites. However, more complicated weighting strategy will be necessary for dealing cases, e.g. an mRNA feature of 250 nt width has 100 nt overlap with one transcript and 150 nt overlaps with another transcripts. Furthermore, although the MetaTX model is fairly efficient and can handle transcriptome-wide feature sets, it cannot be applied directly to millions of features due to the computational complicity of the model. Additional work will be needed to allow the model correct raw sequence alignments generated from high throughput RNA sequencing experiment.
In addition, although our model was originally aimed at deciphering the overall distribution pattern of mRNA-related features, it is noteworthy that it has explicitly resolved the isoform-specific belongings of mRNA-related features through the calculation of the parameter , i.e. a by-product of the MetaTX model is the probabilities of a particular feature being located on different isoforms; however, the accuracy and reliability of these estimates remains to be examined and tested.
Funding
National Natural Science Foundation of China [31671373]; XJTLU Key Program Special Fund [KSF-T-01]. This work was partially supported by the AI University Research Centre through XJTLU Key Programme Special Fund (KSF-P-02).
Conflict of Interest: none declared.