Annotating high-impact 5′untranslated region variants with the UTRannotator

Abstract Summary Current tools to annotate the predicted effect of genetic variants are heavily biased towards protein-coding sequence. Variants outside of these regions may have a large impact on protein expression and/or structure and can lead to disease, but this effect can be challenging to predict. Consequently, these variants are poorly annotated using standard tools. We have developed a plugin to the Ensembl Variant Effect Predictor, the UTRannotator, that annotates variants in 5′untranslated regions (5′UTR) that create or disrupt upstream open reading frames. We investigate the utility of this tool using the ClinVar database, providing an annotation for 31.9% of all 5′UTR (likely) pathogenic variants, and highlighting 31 variants of uncertain significance as candidates for further follow-up. We will continue to update the UTRannotator as we gain new knowledge on the impact of variants in UTRs. Availability and implementation UTRannotator is freely available on Github: https://github.com/ImperialCardioGenetics/UTRannotator. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Upstream open reading frames (uORFs) are short sequences within 5 0 UTRs that regulate the rate at which the downstream coding sequence is translated into protein.Variants that create or disrupt uORFs (uORF-perturbing variants) have been shown to cause rare disease (Calvo et al., 2009;Whiffin et al., 2020).We recently used data from the Genome Aggregation Database (gnomAD) to systematically characterize the deleteriousness of different categories of uORF-perturbing variants and prioritize those that are more likely to be disease causing (Whiffin et al., 2020).Current variant annotation approaches focus on the impact of protein-coding variants, with only limited annotation of predicted consequences for noncoding variants.For example, the Ensembl Variant Effect Predictor (VEP) (McLaren et al., 2016), only annotates variants within UTRs as 3 0 or 5 0 to the coding sequence, without any further information about their predicted effect.
To aid the assessment of high-impact uORF-perturbing variants, we have developed a plugin for VEP to identify 5 0 UTR variants that create upstream start sites (uAUGs), disrupt the start or stop codon of existing uORFs, create a new stop codon within existing uORFs, or shift the frame of an existing uORF.In each case, the tool outputs detailed annotations that allow the user to predict the likely impact of the variant on protein translation.
Recently, the MORFEE tool was described (Aı ¨ssi et al., 2020), however, it is limited to annotating single nucleotide variants (SNVs) that create uAUGs.The UTRannotator is, to our knowledge, the first comprehensive annotation tool for 5 0 UTR uORF creating and disrupting variants.Our tool has initially been created to characterize the impact of uORF-perturbing variants, however, it will be updated to annotate additional UTR variants as we learn how to interpret these for a role in human disease.

Approach
For any SNV, 1-5 bp small insertion/deletion (indel) or multinucleotide variant (MNV) in a 5 0 UTR, we first summarize the number of uORFs in the 5 0 UTR in the reference sequence.Then, for each variant within the 5 0 UTR we evaluate whether it would have any of the following consequences, on any annotated transcript: (i) creating a new start codon AUG to introduce a new uORF; (ii) removing an existing start codon AUG; (iii) removing the STOP codon of an existing uORF; (iv) creating a new stop codon to shorten an existing uORF; (v) disrupting an existing uORF with a frameshift deletion or insertion, whose number of nucleotides inserted or deleted is not a multiple of three.Where a variant has multiple annotation consequences, it is evaluated for each separately.
To enable evaluation of the effect of each variant, the UTRannotator outputs detailed annotations for each type of uORFperturbing variant (Table 1).This includes describing the subtype of uORF created and/or disrupted (i.e.whether this is a distinct uORF with a stop codon in the 5 0 UTR, or an ORF that overlaps the coding sequence either in-or out-of-frame), and the strength of the created and/or disrupted uORF start site match to the Kozak consensus sequence (Kozak, 1989).For a variant disrupting an uORF, we also evaluate whether the uORF has any experimental evidence of translation, by assessing a curated list of uORFs previously identified with ribosome profiling from the online repository of small ORFs (www.sorfs.org)(Olexiouk et al., 2018).Users can also use their own customized list of translated uORFs.Given that ribosome profiling datasets are currently limited in the cell types/tissues and conditions analysed, we output results for all possible uORF-disrupting variants and include experimental evidence as an annotation.
Since a 5 0 UTR can have multiple existing uORFs, for each 5 0 UTR variant we output the annotations for all disrupted uORFs.
Detailed information on installing and running UTRannotator can be found in Supplementary Information.The time complexity of our implementation is linear to the number of input variants.The ratio of running time without the plugin to that with the plugin, tested on 1000 random variants (60% annotated as 5 0 UTR variants) is 1.02-1.07(5 replications).

Results
To show the utility of our UTR annotator tool, we annotated all 5 0 UTR variants interpreted as pathogenic/likely pathogenic and uncertain significance from ClinVar (version 202005) (Landrum et al., 2018).These variants do not have a coding annotation on any transcript.However, we note that 5 0 UTR variants are under-represented in ClinVar as they are rarely sequenced and/or reported.
We used the detailed annotations from the UTRannotator to illustrate how to prioritize 5 0 UTR VUS that are most promising for further follow-up.We first restricted to variants that form new overlapping ORFs (oORFs) with start sites that are Strong or Moderate matches to the Kozak consensus sequence, or that are uORFs with documented evidence of translation, as we previously showed that variants with these consequences are under strongest negative selection (Whiffin et al., 2020).Finally, we took variants in 3191 genes previously identified as having a 'High' likelihood that uORFperturbation could be an important disease mechanism (Whiffin et al., 2020).Through this approach, we identified 31 potential 'high-impact' ClinVar 5 0 UTR VUS (Supplementary Table S3).

Discussion
We have created a freely available tool, as a plugin to the Ensembl VEP, that annotates variants that create or disrupt uORFs.The

Fig. 1 .
Fig. 1. 5 0 UTR variants in ClinVar annotated by the UTRannotator.(a) A schematic showing the five distinct consequences of 5 0 UTR variants annotated by the tool: those that create an upstream AUG (uAUG_gained), those that disrupt the start site of an existing upstream open reading frame (uORF; uAUG_lost), those that cause a frameshift in the sequence of the uORF (uFrameShift), those that introduce a new stop codon into an existing uORF (uSTOP_gained) and those that disrupt the stop site of an existing uORF (uSTOP_lost).(b) The counts of each variant category that are classified as Pathogenic/Likely Pathogenic (teal) or Uncertain Significance (VUS; grey) in ClinVar

Table 1 .
Details of the annotations provided for different categories of uORF-perturbing variants.hmarkresults of the cascade oscillators model