Ranked choice voting for representative transcripts with TRaCE

Abstract Summary Genome sequencing projects annotate protein-coding gene models with multiple transcripts, aiming to represent all of the available transcript evidence. However, downstream analyses often operate on only one representative transcript per gene locus, sometimes known as the canonical transcript. To choose canonical transcripts, Transcript Ranking and Canonical Election (TRaCE) holds an ‘election’ in which a set of RNA-seq samples rank transcripts by annotation edit distance. These sample-specific votes are tallied along with other criteria such as protein length and InterPro domain coverage. The winner is selected as the canonical transcript, but the election proceeds through multiple rounds of voting to order all the transcripts by relevance. Based on the set of expression data provided, TRaCE can identify the most common isoforms from a broad expression atlas or prioritize alternative transcripts expressed in specific contexts. Availability and implementation Transcript ranking code can be found on GitHub at {{https://github.com/warelab/TRaCE}}. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Genome sequencing projects often use complex, automated annotation pipelines to build reference sets of gene models. These pipelines mask repeats in the assembled genome, align protein and transcript evidence, and build gene models by aggregating overlapping alignments that adhere to known or inferred splice site patterns (Campbell et al., 2014;Haas et al., 2003;Hoff et al., 2019). Before a project releases a set of high-confidence gene models, additional filtering steps may remove transcript models that lack homology or are subject to non-sense-mediated degradation.
Alternative splicing contributes to the functional diversity of a genome (Black, 2003); and new sequencing technology such as PacBio IsoSeq can capture splice variants at an unprecedented scale (Bruijnesteijn et al., 2018;Wang et al., 2016;Zhang et al., 2019). However, this heightened sensitivity can lead to the detection of transcriptional noise, which can be misreported by gene builders as biologically relevant splice variants. Furthermore, it is possible for partially processed transcripts containing retained introns that neither disrupt the reading frame nor introduce stop codons to be promoted to canonical transcripts (Fig. 1).
Comparative gene tree analysis platforms such as Ensembl Compara (Herrero et al., 2016) operate on a single canonical transcript for each gene locus. In the absence of a curated canonical transcript, this is usually defined as the longest transcript with the longest translation, but this definition does not necessarily select the best representative transcript for a gene locus. Subsequently developed techniques have defined canonical isoforms based on expression level, sequence conservation, annotation of functional domains or some combination of these features (Li et al., 2014;Pruitt et al., 2002;Rodriguez et al., 2018;The UniProt Consortium et al., 2016). For example, NCBI's RefSeq Select dataset uses an evidence hierarchy to identify a transcript in each protein-coding human and mouse gene model. The Matched Annotation from NCBI and EMBL-EBI (MANE) project has the goal of providing a unified set of human protein-coding gene annotations, but it is not known if and when such efforts will be applied to other species.
We developed Transcript Ranking and Canonical Election (TRaCE) to choose canonical transcripts based on data typically available at the time of a new genome annotation. In this approach, transcripts are ranked by length, domain coverage, and how well they represent a diverse population of transcriptome RNA-seq data. An 'election' based on ranked-choice voting selects a canonical transcript that is the first-or second-choice transcript for the majority of samples. The election proceeds through multiple rounds, effectively sorting all transcripts by relevance. Here we present the TRaCE algorithm and results obtained by running TRaCE on Zea mays and Homo sapiens gene annotations. In addition, we describe validation

Materials and methods
The first step in preparing to run TRaCE is to gather a diverse set of RNA-seq expression data covering a wide variety of tissues or conditions to act as 'voters' in the upcoming elections. The next step is to align the reads, assemble sample-specific transcripts, and quantify their expression. Each reference gene model with multiple transcripts (candidates) will hold an election to sort the reference transcripts by relevance (Fig. 2).
In each election, samples rank the candidate transcripts based on the annotation edit distance (AED) to the most highly expressed overlapping sample-specific transcripts (Eilbeck et al., 2009). AED scores range from 0 (perfect agreement) to 1 (no overlap) and are calculated from the pairwise similarity of reference transcripts and aligned evidence based on the proportion of exonic overlap. Because there may be insufficient data to assemble full-length transcripts from samples in which the gene is expressed at low levels, the AED score calculation is restricted to overlapping portions of candidate transcripts. A maximum AED score cutoff (default, 0.5) prevents samples from voting for candidate transcripts with very little similarity. There are also cutoff parameters for minimum expression level (default TPM, 0.5) and proportion overlapping (default, 0.5) to filter out some noise in the sample transcriptome data. The election includes additional voters that rank transcripts based on domain coverage, protein length and transcript length. To avoid overwhelming the length-based voters when running TRaCE with many samples, sample votes are weighted to balance the electorate. Default weights were selected to prioritize functional domain coverage over protein length and total transcript length.
Once each sample voter and the length-based voters have ranked the transcripts, the election proceeds in multiple rounds selecting winners until no candidates remain. In each round, TRaCE tallies votes for top-ranked candidates; and so long as there is a tie for first place, votes for the subsequent rankings are added to the tally.

Results
We ran TRaCE on a pre-release set of Zea mays B73 gene models with the set of 10 RNA-seq samples that had already been aligned to the genome as part of the evidence-based gene annotation pipeline (Hufford et al., 2021). The samples were derived from shoot, root, embryo, endosperm, ear, tassel, anther and three leaf sections (base, middle and tip). StringTie version 1.3.5 (with the -rf flag) was used for transcript assembly and quantification (Pertea et al., 2016) and InterProScan version 5.38-76.0 was run to identify Pfam domains (Mulder and Apweiler, 2007). The Zea mays B73 V5 annotation set (Zm00001eb) has 15 162 multi-transcript protein-coding gene models; for 5616 of these (37%), the canonical transcript chosen by TRaCE was not the longest isoform. TRaCE selected canonical transcripts for the genome annotations of 25 additional maize accessions, 33-38% of which were not the longest isoform (Supplementary Table S1).
We used two approaches to validate TRaCE's predictions on maize genes. First, we modified an interactive gene tree viewer, designed to flag problematic gene models by visual inspection of the multiple sequence alignment and domain annotations (Tello-Ruiz et al., 2021). We used this interface to compare maize B73 V5 canonical transcripts (Zm00001eb) selected by TRaCE with the prior set of maize V4 canonical transcripts (Zm00001d) selected by length criteria alone. A random selection of 173 pairs of genes for which the TRaCE canonical was not the longest transcript were evaluated in the gene tree viewer and flagged if the alignment was inconsistent with outgroup orthologs. Genes were flagged if there was a relative gain or loss of conserved sequence within the transcript or at either  Table S2). The most common issue in the flagged Zm00001d gene models was gain of sequence due to an intron retention. Thus, according to this approach, TRaCE was selecting betterconserved isoforms than the prior length-based algorithm.
In the second approach, TRaCE predictions were validated by student curators who were given a subset of 48 gene models with two to five transcripts, for which TRaCE's top-ranked isoform was not the longest isoform. The students, who were not aware of TRaCE's output, were asked to rate transcripts as best, good or poor, based on viewing the gene structure and expression evidence in the Apollo genome browser (Dunn et al., 2019). Each gene model was curated by at least three different students. The transcript ratings were mapped to a score (best 2, good 1, poor -1). Transcript rankings from TRaCE and rankings based on length alone were compared to rankings based on curator scores. For each rank (1-5), we calculated the sum of the curator scores for the associated transcripts. The correlation of these sums between the length-based ranking and the curator-based ranking was 0.917, whereas the TRaCE and curator ranking sums had a higher correlation coefficient of 0.985 (Supplementary Table S3).
The GRCh38 annotation set has 13 848 multi-transcript proteincoding gene models that were classified by both APPRIS and MANE Select. The TRaCE canonical was not the longest isoform in 3717 (27%) of these gene models. For comparison, the principal isoform according to APPRIS and the MANE Select transcript was not the longest isoform in 3061 (22%) and 4292 (31%) of gene models, respectively. There are 1202 gene models where APPRIS and MANE Select disagree. In these cases, TRaCE agrees with APPRIS on 408 (34%) genes, MANE Select on 597 (50%) genes and neither on 197 (16%) genes. On the 12 646 multi-transcript gene models where APPRIS and MANE Select agree, TRaCE gives 10 677 (84%) transcripts rank 1, 1470 (12%) rank 2, 351 (3%) rank 3 and 148 (1%) rank 4 or higher. To assess TRaCE's performance on gene models with many transcripts, we compared TRaCE to APPRIS and MANE Select on the 90% of genes with 2-10 transcripts and the remaining 10% of human protein-coding gene models with 11-151 transcripts. There are 1399 genes with many transcripts where APPRIS and MANE Select agree. In these cases, TRaCE selects 1021 (73%) of these as the canonical transcript, 215 (15%) have rank 2, 92 (7%) have rank 3 and 71 (5%) have rank 4 or higher. On the 11 247 genes with fewer transcripts where APPRIS and MANE Select agree TRaCE assigns 9656 (86%) rank 1, 1255 (11%) rank 2, 259 (2%) rank 3 and 84 (1%) rank 4 or higher. For the initial release of TRaCE, we manually tuned the weights on TRaCE's length-based Fig. 2. Flowchart of preparation of TRaCE inputs and a schematic of the rank-choice voting (RCV) approach to select transcripts for an example gene with three transcripts (blue, red, gray). Exon thickness corresponds to non-coding, coding and functional regions with Pfam domains. Voters are represented by rectangles, and rank transcripts by length criteria (9, 6 or 3 votes) or AED (1 vote per sample). Eight of the samples rank the red and blue transcripts equally (blue-red gradient), so both get tallied in round 1. RCV selects the blue transcript first with 24 rank 1 votes. After removing the blue votes from consideration, the red and gray transcripts tie with 10 rank 1 votes, but the red transcript is elected with 14 rank 2 votes votes, but future versions may benefit from an automated parameter sweep to minimize these differences.