Defining data-driven primary transcript annotations with primaryTranscriptAnnotation in R

Abstract Summary Nascent transcript measurements derived from run-on sequencing experiments are critical for the investigation of transcriptional mechanisms and regulatory networks. However, conventional mRNA gene annotations significantly differ from the boundaries of primary transcripts. New primary transcript annotations are needed to accurately interpret run-on data. We developed the primaryTranscriptAnnotation R package to infer the transcriptional start and termination sites of primary transcripts from genomic run-on data. We then used these inferred coordinates to annotate transcriptional units identified de novo. This package provides the novel utility to integrate data-driven primary transcript annotations with transcriptional unit coordinates identified in an unbiased manner. Highlighting the importance of using accurate primary transcript coordinates, we demonstrate that this new methodology increases the detection of differentially expressed transcripts and provides more accurate quantification of RNA polymerase pause indices. Availability and implementation https://github.com/WarrenDavidAnderson/genomicsRpackage/tree/master/primaryTranscriptAnnotation. Supplementary information Supplementary data are available at Bioinformatics online.

: Evaluation of TTS inferences Figure S2: TTS inference methodology Figure S3: TTS inference results Figure S4: Differences in transcript length between inferred and largest interval annotations Figure S5: Annotating TUs that overlap with a single transcript Figure S6: Annotating TUs that overlap with multiple transcripts Figure S7: Principal component analysis of adipogenesis for inferred and conventional annotations Figure S8: Inferred coordinates confer enhanced sensitivity for detecting differential expression Figure S9: Inferred coordinates confer enhanced sensitivity for detecting polymerase pausing

Supplementary references
identified from a given Hidden Markov Model (HMM) parameterization. The evaluateHMMInAnnotations() function uses the inferred gene annotations to document 'merge errors' and 'dissociation errors'. Merge errors occur when a given TU overlaps multiple gene annotations. Dissociation errors occur when multiple TUs overlap a given gene annotation. Distinct groHMM parameterizations produced varying degrees of merge and dissociation error. In addition to evaluating merge and dissociation errors, we evaluated HMM sensitivity by looking at how many reads mapped to regions of the genome that were not identified as transcribed. We selected the HMM with the lowest read count outside of HMM-defined transcribed regions. This HMM had relatively low merge error and dissociation error (both within the lowest error quartile). The code for these analyses is publicly available: https: //github.com/WarrenDavidAnderson/genomeAnalysis/tree/master/groHMMcode.
Principal component analysis and differential expression analysis: We used DESeq2 to identify 'size factors' to normalize the individual PRO-seq sample read data based on sequencing depth [14]. We then applied a variance stabilizing logarithmic transformation using rlogTransformation(). We applied principal component analysis (PCA) using the singular value decomposition-based R function prcomp().
For differential expression analysis, we identified genes with statistically significant temporally varying expression levels using a likelihood ratio test with the DESeq2 function DESeq(). This test compares the likelihood of a model incorporating time to the likelihood of a null model in which time is not considered.
Pause region analysis: We quantified promoter-proximal polymerase pausing by defining a pause index as the ratio of the PRO-seq read signal in the vicinity of the TSS to that signal within the gene body [15]. We defined the pause region between 20 bp and 80 bp downstream of the TSS. We defined the gene body region from 500 bp downstream of the TSS to the gene end. Within each region, we computed read densities and we took the pause/body density ratio as the pause index.
Code availability: Code for all analyses associated with this manuscript can be found here: https:// github.com/WarrenDavidAnderson/manuscriptCode/tree/master/primaryTranscriptAnnotation_ code.

Package details
Here we describe methodological details pertaining to our implementation of analyses performed by the primaryTranscriptAnnotation R package. This supplemental section is focused on methodological details, whereas implementation details are described extensively in the package vignette https:// github.com/WarrenDavidAnderson/genomicsRpackage/blob/master/primaryTranscriptAnnotation/ primaryTranscriptAnnotation-vignette.pdf.

Data-driven primary transcript annotation
The first problem addressed by the primaryTranscriptAnnotation R package is the data-driven inference of primary transcript coordinates. This problem is addressed by separately identifying TSSs and TTSs with the use of conventional annotations for constraining the regions within which to examine the presence of TTSs/TTSs.
Filtering for unexpressed transcripts: To identify transcripts with arbitrarily low expression, we used the read.count.transcript() function. This function determines the transcript with the highest read count for each gene and returns that read count along with the associated read density. Based on analyses of transcript read count and read density distributions, the user can select thresholds below which expression appears to be negligible.
Data-driven inference of TSSs: To empirically identify TSSs, we assumed that TSSs are proximal to sites at which polymerase pausing is observed. This assumption is based on a wealth of genome-scale data documenting promoter-proximal pausing [16]. These data indicate that pausing occurs 20-80 bp downstream of transcriptional initiation. We also assumed that the TSS for each gene is at an annotated first exon for that gene. Accounting for the strand specificity of gene annotations, we defined the TSS as the 5 end of the exon 1 isoform with the highest density in the region between 20 bp and 120 bp downstream. This analysis is implemented by get.TSS().
Filtering for overlapping genes: Annotated genes occasionally overlap with the coordinates defined for other annotated genes. For instance, the end of an upstream gene can be annotated to a coordinate beyond the start of a downstream gene. This can result in confounding gene expression quantification by erroneously counting the same reads for multiple genes. Our package includes a function for identifying such overlaps (gene.overlaps()). The user can then manually evaluate the gene annotations using a visualization tool such as a genome browser [17].
Data-driven inference of TTSs: To empirically identify TTSs, we assumed that transcription termination occurs within a region including the most downstream annotated end of all gene isoforms and extending kilobases beyond. We addressed two tasks related to TTS identification: (1) defining the region within which to search for transcriptional termination, and (2) identifying termination within this region.
To define the search region for TTS evaluation, we started by selecting a percentage of the gene end (e.g., 20%). We defined the start of the search region as the difference between the annotated gene end and the gene length multiplied by the specified percentage (see Figure S2c, term d e ): search region start = gene end − f raction.end * gene length The next step is to specify an upper limit on the total distance from the annotated gene end that could be considered for the TTS search region. This is user specified number of bases ( Figure S2c, term d t ). We examined whether there were any other gene TSSs in this region between the search region start and the user defined limit. If there was a TSS within the d t region, we documented the distance between the limit and the TSS. We restricted the search space for TTS evaluation by this value, thus the term d c is referred to as the clip distance ( Figure S2c). Note that the clip distance is dependent on the user defined upper limit.
Experimental data showed that attenuated rates of transcription at gene ends are associated with elevated levels of polymerase density [18]. Based on this finding, we identified peaks of polymerase density at gene ends for determining where transcription termination is likely to occur. We defined the peak search region d s as a sub-region of d a = d t − d c ( Figure S2c). This is because the identification of gene end polymerase peaks could be affected by distal enhancers that exhibit bidirectional transcription within very large search regions. We assumed that the peak search region should be a large fraction of d a for genes with the greatest numbers of clipped bases, because such cases occur when the conventional gene ends are proximal to downstream identified TSSs. Thus, in such cases, the entire d a region should be included for analysis of the gene end peak. The same logic applies for genes with substantially fewer clipped bases, and correspondingly larger d a regions; in such cases the peak search regions should be smaller proportions of d a . Hence, we defined the gene end peak search region d s as a function of the clip distance, d c : . The term f (d c ) gives a value for weighting d a that is dependent on the clip distance: x ∈ [0, d t ] where τ is a distance constant defining the rate of exponential decay, f max is the maximum weight, and f min is the minimum weight. The terms f max and f min should be between zero and one so that the outcome is a fraction. Figure S2b shows the form of the function with dashed lines corresponding to d c = τ = 50 kb, f min = 0.3, and f max = 1. This method is used to define the gene end peak search region, d s . Given the coordinates of the TTS search region and the peak search region, the next task is to identify the TTS within this region. We operationally defined the TTSs by binning the search regions, counting reads within the bins, fitting smooth spline curves to the binned counts, identifying gene end peaks of smooth curves, and detecting points at which the curves decay from the gene end peaks towards zero ( Figure S3a,b). For this analysis, we applied constraints on the number of bins used for quantification, and the number of knots for the spline fits, as described in the vignette. For the spline fit within the peak search region, we identified the largest peak and then we determined the point at which the trace decays to a point below a specified threshold of the peak height. The function get.TTS() implements the TTS evaluation procedures described above.

Annotation of de novo transcriptional units
The second problem addressed by the primaryTranscriptAnnotation R package is the annotation of transcriptional units (TUs), identified de novo, using primary transcript coordinates. The first step to annotating TUs is to intersect the TU coordinates with the primary transcript coordinates. We separately considered TUs that overlap with single primary transcripts and TUs that overlap with multiple primary transcripts. For this analysis, we considered the identified primary transcript coordinates to represent a 'ground truth' annotation, and we only made minimal modifications to the primary transcript coordinates if they had very close overlaps, as defined by the user, with identified TUs. Regions of identified TUs that did not very closely match identified transcripts were assigned generic TU identifiers. The analyses described below are implemented by the single.overlaps() and multi.overlaps() functions, which are called by get.tu.gene.coords().
For TUs that overlap with single transcripts, we considered a reference case in which the transcript coordinates are contained within the boundaries of the TU ( Figure S5a). We term this reference case 'class 1'. This analysis evaluates whether the beginning of a TU is in close proximity to the beginning of the transcript. If this criteria was not met, the portion of the TU that is upstream of the gene annotation was assigned an arbitrary identifier and the identified TSS defines the start of a TU with named for the overlapping gene. Similarly, if the annotated transcript end was far from the boundary of the TU, then we set the end of the TU, named by the respective overlapping transcript, to the identified TTS. Here we would then annotate the downstream region of the TU with an arbitrary label. For example ( Figure S5a), if the distance between the starts of transcript X and TU X is less than 200 bp, and if the distance between the respective ends is less than 1 kb, we set the identifier of TU X to transcript X. If the distance between starts is greater than 200 bp, we annotate the initial segment of the TU with a generic identifier (tu class1 1, see main text Figure 1f and the associated browser session: https://genome. ucsc.edu/s/warrena%40virginia.edu/primaryTranscriptAnnotation_20190801). If the distance between ends is greater than 1 kb, we annotate the final segment of the TU with a generic identifier (tu class1 2). Hence, we maintain a close correspondence between the identified transcript and an overlapping TU and we label marginal regions of transcription with generic identifiers.
Next we considered cases in which a TU was enclosed within a transcript (class 2, Figure S5b). Because this analysis was completed under the assumption that the identified TSS/TTS coordinates are veritable primary transcript boundaries, we simply extended the region of the enclosed TU to match the transcript. Then we implemented the same rules described for class 1 above. In this case, the algorithm will default to assigning the gene identifier to the TU, as the respective coordinates are identical.
We further considered cases in which TU-transcript overlaps were characterized by 'overhangs' (class3-4, Figure S5c,d). Again, here we extended the appropriate TU boundaries to match the corresponding transcripts such that no part of the transcript extended beyond the TU. We then applied the class 1 logic to assign TU identifiers based on the identified transcript coordinates.
For TUs that overlap with multiple transcripts, we considered a reference case in which all of the transcript coordinates were contained within the boundaries of the TU (class 5, Figure S6a). For the upstream-most and downstream-most overlaps, we applied analyses comparable to those for the class 1 scenario in which we used either the existing transcript annotation or introduce an arbitrary identifier depending on the proximities of the boundries to the TSSs and TTSs. If the regions between transcripts were sufficiently large, we introduced generic identifiers (e.g., tu class5 1; see regions i1 and i2 in Figure  S6a). As described above for single overlap classes 2-4, the logic implemented for class 5 could be applied to other scenarios of multiple overlaps ( Figure S6b-d). For instance, in cases of upstream and/or downstream overhangs, the TU could be extended such that all overlapping transcripts were enclosed, then the class 5 logic could be applied to classes 6-8 ( Figure S6b

TSS and TTS identification
We evaluated the performance of primaryTranscriptAnnotation by comparing results obtained by using transcript coordinates inferred using our package against largest interval coordinates derived from GENCODE. The largest interval coordinates were obtained by taking the upstream-most TSS and the downstream-most TTS for each gene. To evaluate the performance of our TSS evaluation method, we examined the distances between read density peaks and inferred TSSs. For both annotations, the distributions of these distances showed a peak downstream of zero, corresponding to promoter proximal polymerase pausing ( Figure S1a). However, we observed that the distribution for the inferred annotations showed a more focused peak with reduced variance (p = 2.1 ×10 −34 , Levene's test) ( Figure S1b,c). This shows that taking the longest of conventional mRNA annotations results in assigning transcription initiation sites to regions that are farther from pause sites than the corresponding distances based on our inferred TSSs.
We estimated TTSs using the novel method described above. We specified a distance beyond the most distal gene end annotation and evaluated whether there were intervening TSSs within this region. If TSSs were present, we clipped the region at the site of the most proximal TSS. Figure S2a shows the distribution of clip distances when we considered an interval of 100 kb beyond the most distal annotated gene ends. A prominent mode of clip distances is apparent at 100 kb, because many mammalian genes occur in tandem such that the end of one gene is close to the start of another. We used the clip distances to define a gene end peak search region for each transcript. We operationally defined TTSs as the points where spline fits to binned reads decayed to a specified fraction of the largest peak of the spline trace. Figure S3a,b shows examples of splines (blue or red) and estimated TTSs (vertical lines), in which the traces are shown throughout the TTS search regions. For example, Figure S3b shows the identified primary transcript coordinates for Lypla1 and Tcea1 (plus strand, red). The TTS for Lypla1 was identified at the end of the search region because Lypla1 immediately preceded the TSS for Tcea1. Note that the inferred TTSs typically extend beyond the largest interval TTSs ( Figure S4d).
To characterize the differences between inferred primary transcript coordinates and largest interval coordinates, we evaluated the differences between the TSSs and TTSs of the two annotation sets. We found that 49% of the cases in which the TSS did not change, there was only a single exon 1, in which case our analysis was not designed to re-annotate the TSS. We found that 57% of the expressed genes with multiple exon 1 isoforms were re-annotated from the largest interval TSS to an inferred TSS. Figure  S4a,b show the distribution of TSS differences. For the TSSs that were re-annotated relative to the largest interval, we found that 75% of the inferred TSSs are within 813 bp of the largest interval TSS ( Fig  S4b). As expected, the chance of being re-annotated increases as more exon 1 isoforms are characterized per gene (Fig S4c) As gene annotations progressively incorporate newer and more comprehensive TSS inference technologies, including 5 RACE and CAGE-based approaches [19,20], our methods will facilitate context-specific TSS inferences with enhanced precision.
The distribution of TTS differences is shown in Figure S4d. We found that 75% of TTS differences are within approximately 15 kb, and in the majority of cases, the inferred TTSs are downstream of the largest interval TTSs (Fig S4e). These data show that inferred annotations result in substantially longer transcripts due to the inference of more downstream TTSs.
We estimated an upper bound on the percentage of TTS inferences from our analysis that could have been influenced by the presence of transcribed regulatory elements at the 3 gene ends (e.g., enhancer RNAs). We assumed that all of the cases for which the inferred TTS was upstream of the largest interval TTS could potentially occur due to divergent transcripts resulting in more upstream TTS estimates. For instance, a large 'spike' in read density at the end of a gene could be identified as a peak in the fits used for TTS inference. If such a peak were significantly larger then the peak in read density corresponding to polymerase deceleration at the 3 gene end, the decay of this peak could influence the position of the inferred TTS. We found that only 1.9% of the TTS were inferred upstream of the largest interval annotations (∆TSS < 0, Fig S4d). Visual inspection of several transcripts for which ∆TSS < 0 was observed suggested that low read depth could account for the TTS inferences upstream of the largest interval gene ends. This suggests that the upper bound of 1.9% exaggerates the potential influence of divergent transcripts on TTS inference. Fig S4f shows an example of bidirectional transcription at the end of a gene for which the inferred TTS was upstream of the largest interval TTS. The low prevalence of such occurrences is consistent with the robustness of our TTS inference method.
Interestingly, while Fig S4f shows an example of how instances of bidirectional transcription influence TTS inferences, this example also shows a case in which the divergent transcript appears to be an unnannotated TSS. Whereas enhancer RNAs generally show symmetric read profiles on the plus and minus strands, the plus strand reads for Hmcn2 are apparent through several exons and significantly exceeding the distance observed for the minus strand. This example highlights the importance of obtaining TSS annotations from methods such as PRO-cap and START-seq [21,22].

Evaluation of transcript expression dynamics and RNA polymerase pausing
To determine whether applying inferred primary transcript coordinates leads to genome-wide variations in expression dynamics, we mapped the adipogenesis time-series PRO-seq reads to the inferred coordinates and projected the data onto principal components (PCs, Figure S7a). For comparison, we mapped the same data onto the conventional largest interval annotations and visualized the PC projections ( Figure  S7b). The results of this analysis show that the PC projections are nearly identical for both annotation sets. Thus, applying inferred primary transcript coordinates does not lead to genome-wide variations in expression dynamics, as defined by data projections onto PCs 1-3, which capture 83-89% of the variation in the data.
Next we addressed whether applying inferred primary transcript coordinates alters the results of differential expression analyses, as compared to the results obtained using largest interval annotations. To evaluate differential expression, we used a likelihood ratio test to determine whether transcript expression varies with respect to time over four hours. This analysis is analogous to using a 1-factor ANOVA to determine the effect of the time factor. We implemented this test for both inferred coordinates and largest interval coordinates. Figure S8a,b shows that the negative log values of the resulting false discovery rates (FDRs) tend to be higher for the expression levels quantified using inferred coordinates (i.e., lower FDRs; higher density above the unity line in Figure S8b). At a threshold of FDR < 0.001, we observed 11,351 genes that were differentially expressed for both annotations, 520 genes that were differentially expressed only for the inferred coordinates, and 241 genes that were differentially expressed only for the largest interval coordinates ( Figure S8c). To test whether the counts of exclusive differentially expressed genes were identical for both annotations, we applied a binomial test of the null hypothesis that the proportions of the 761 exclusive genes (520+241) were identical (i.e., 50% each). The analysis was not consistent with the null hypothesis in which 50% of genes are proposed to be differentially expressed for each annotation (p < 2.2 × 10 −16 ). This suggests that there are more differentially expressed transcripts after mapping read counts to inferred coordinates as compared to largest interval coordinates.
To determine whether the finding of elevated differential expression associated with inferred annotations was robust to the FDR threshold for defining significance, we varied this threshold and evaluated differential expression for both inferred and largest interval annotations. The number of differentially expressed genes observed for both annotations increased as the FDR was relaxed to larger values (Fig S8d, left). In contrast, the counts for exclusive differentially expressed genes decreased with respect to the FDR threshold ( Fig  S8d, right). Importantly, the counts of genes differentially expressed only for the inferred annotations were consistently higher than the corresponding counts associated with using the largest interval annotations. To evaluate the significance of this trend, we compared the results for the two gene annotations at each FDR using the binomial test as described above. We then applied the Benjamini-Hochberg correction to adjust the resulting set of binomial test p-values [23]. The resulting set of adjusted p-values ranged from p adj = 1.0 × 10 −35 for a likelihood ratio test FDR threshold of 1 × 10 −6 to p adj = 4.6 × 10 −6 for an FDR threshold of 1 × 10 −1 . These analyses demonstrate that mapping reads to inferred primary transcript coordinates results in improved sensitivity for the detection of differential transcript expression, as compared to the results observed when using conventional mRNA annotations.
Finally, we evaluated whether data-driven inferred coordinates enhance the detection of promoterproximal polymerase pausing. For this analysis we considered expressed genes with inferred annotations, and we evaluated pausing across all pre-adipogenic time points. The inferred annotations were associated with elevated pause region composite profiles (Fig S9a). To examine the average precision of the paused regions associated with inferred versus largest interval coordinates, we plotted 0-1 scaled composites (Fig S9b). The results show that inferred annotations are associated with sharper pause region read distributions, on average, as compared to profiles based on largest interval coordinates. We computed pause indices for inferred and largest interval annotations (see Methods above). The results revealed that the distributions were significantly different based on applying the Wilcox test at each time point (FDR < 5 × 10 −140 , Fig S9c). These analyses demonstrate that the application of inferred coordinates enhances the precision for quantifying the position and degree of promoter proximal RNA polymerase pausing from genome-wide run-on data.