aenmd: annotating escape from nonsense-mediated decay for transcripts with protein-truncating variants

Abstract Summary DNA changes that cause premature termination codons (PTCs) represent a large fraction of clinically relevant pathogenic genomic variation. Typically, PTCs induce transcript degradation by nonsense-mediated mRNA decay (NMD) and render such changes loss-of-function alleles. However, certain PTC-containing transcripts escape NMD and can exert dominant-negative or gain-of-function (DN/GOF) effects. Therefore, systematic identification of human PTC-causing variants and their susceptibility to NMD contributes to the investigation of the role of DN/GOF alleles in human disease. Here we present aenmd, a software for annotating PTC-containing transcript-variant pairs for predicted escape from NMD. aenmd is user-friendly and self-contained. It offers functionality not currently available in other methods and is based on established and experimentally validated rules for NMD escape; the software is designed to work at scale, and to integrate seamlessly with existing analysis workflows. We applied aenmd to variants in the gnomAD, Clinvar, and GWAS catalog databases and report the prevalence of human PTC-causing variants in these databases, and the subset of these variants that could exert DN/GOF effects via NMD escape. Availability and implementation aenmd is implemented in the R programming language. Code is available on GitHub as an R-package (github.com/kostkalab/aenmd.git), and as a containerized command-line interface (github.com/kostkalab/aenmd_cli.git).


Introduction
Nonsense-mediated mRNA decay (NMD) is a wellcharacterized, evolutionarily conserved quality-control mechanism that is essential for embryogenesis and other developmental processes, and it is known to play a role in human disease (Lykke-Andersen and Jensen 2015).NMD guards against compromised transcripts by affecting their degradation versus translation, including transcripts with variants that introduce premature termination codons (PTCs).PTC-causing variants where a resulting transcript is subject to NMD can exert loss-of-function (LOF) effects in case of haploinsufficiency, where transcripts from both chromosomes are required for normal protein function.For PTC-harboring transcripts that escape NMD, there are additional possibilities of dominant-negative (DN) or gain-offunction (GOF) effects, where the altered protein may interfere with the wild-type version (DN) or where it can possess an altered molecular function or activity domain (GOF).While molecular mechanisms of DN/GOF effects are generally less well understood compared with LOF effects (Backwell and Marsh 2022) and the pathogenicity of NMD-escaping variants is gene/transcript-specific, they do play a significant role in human disease (Cheng et al. 1994, Inoue et al. 2004, 2007, Khajavi et al. 2006, Ina ´cio et al. 2007, Bhuvanagiri et al. 2010, Miller and Pearce 2014, Lykke-Andersen and Jensen 2015, Coban-Akdemir et al. 2018, Miyake et al. 2020, Gerasimavicius et al. 2022).
Given the potential contribution of PTC variants with NMD escape in causing disease, significant new insights into mechanisms of disease pathogenicity can emerge from annotating PTC-containing transcripts with a prediction about their escape from NMD. (Hall and Thein 1994, Frischmeyer and Dietz 1999, Kerr et al. 2001, Ina ´cio et al. 2004, 2007, Inoue et al. 2004, 2007, Khajavi et al. 2006, Mort et al. 2008, Bhuvanagiri et al. 2010, Miller and Pearce 2014, White et al. 2015, 2016, Bayram et al. 2017, Jansen et al. 2017, Poli et al. 2018, Hamanaka et al. 2020, Miyake et al. 2020, Supek et al. 2021, Kim et al. 2022).Using an exon-exon junction complex-dependent model of NMD, a notable fraction of PTC-harboring transcripts is predicted to escape NMD (Coban-Akdemir et al. 2018).However, this model only describes about half of NMD-escaping human variation accurately, prompting the development of additional approaches for predicting transcript escape from NMD (Nagy and Maquat 1998, Le Hir et al. 2000, 2001, Ishigaki et al. 2001, Kim et al. 2001, Lykke-Andersen et al. 2001, Gehring et al. 2003, Ina ´cio et al. 2004, Silva et al. 2006, Lappalainen et al. 2013, Schweingruber et al. 2013, Lykke-Andersen and Jensen 2015, Rivas et al. 2015, Lindeboom et al. 2016, 2019, Hoek et al. 2019, Nogueira et al. 2021, Supek et al. 2021).Nevertheless, and despite the relevance of annotating PTC-causing variants with respect to a modified transcript's susceptibility to NMD, there is a lack of scalable and accessible software addressing that task comprehensively (i.e. for all types of PTC-causing variants).Therefore, we developed aenmd -a software tool for comprehensive annotation of PTC-causing variant-transcript pairs with (predicted) escape from NMD. aenmd makes use of well-established and experimentally validated rules based on PTC location within a transcript's intron-exon structure (Lindeboom et al. 2016(Lindeboom et al. , 2019)), and it integrates well into existing variant analysis pipelines.In the following, we describe aenmd in more detail and report statistics of NMD escape for PTC-causing variants in the Clinvar (Landrum et al. 2018), gnomAD (Karczewski et al. 2020), and NHGRI-EBI GWAS catalog (Sollis et al. 2023) resources.

Annotating escape from NMD
aenmd predicts escape from NMD for combinations of transcripts and PTC-generating variants by applying a set of NMD escape rules, which are based on where the PTC is located within the mutant transcript.First, the location of the 5 0 -most (novel) PTC is determined, and then escape from NMD is predicted by the following five rules (Lindeboom et al. 2016(Lindeboom et al. , 2019)): Whether • the PTC located in the last coding exon (last exon rule), • the PTC located within d_pen bp upstream of the penultimate exon boundary (penultimate exon rule; default: d_pen ¼ 50) • the PTC located within d_css bp downstream of the coding start site (css rule default: d_css ¼ 150) • the PTC located within an exon spanning more than 407bp (407 bp rule) • the transcript is intronless (single exon rule) See Fig. 1A.Distances (in bp) are calculated using the PTC nucleotide closest to the coding start site (CSS) or exon boundary for the css and penultimate exon rules, respectively; variants are assumed to be left-normalized (Tan et al. 2015) (aenmd provides this functionality).Variants that overlap exon-intron boundaries or splice regions are not currently analyzed by aenmd.Variant-transcript pairs with a PTC conforming to any of the above rules will be annotated to escape NMD, but results for all rules are reported individually by aenmd; this allows users to focus on subsets of rules, if desired.aenmd is implemented in the R programming language (www.r-project.org),making use of the VariantAnnotation (Obenchain et al. 2013) packages for calculating rules.An index containing all PTC-generating SNVs is pre-calculated for a given transcript set and stored in a trie data structure for lookup, using the triebeard package.For non-SNV variants, alternative alleles for overlapping transcripts are explicitly constructed and assessed.This strategy allows us to assess frameshift variants where a PTC is produced downstream of the variant location, and it accounts for both the size and content of sequence insertions, deletions, and insertion-deletions.

Data on genetic variants and transcript models
We obtained gnomAD version v2.1.11liftoverGRCh38, Clinvar version 20221211, and the NHGRI-EBI GWAS version 20220730, catalog from their respective download sites and annotated variants using aenmd.For our analyses, we used transcript models from ENCODE version 105, where we focused on protein-coding transcripts on standard chromosomes that: (i) have an annotated transcript support level of one (or NA for single exon transcripts), and (ii) have a coding sequence length divisible by three.

aenmd data package
The aenmd R-package provides functionality to annotate variant-transcript pairs for predicted escape from NMD within the R ecosystem.Data dependencies (i.e. transcript models) are implemented via specific data packages (see below), and functionality for data import and export (VCF files) is also provided, as is functionality for variant leftnormalization.Key differences that set aenmd apart from currently available tools for annotating escape from NMD are: all types of PTC-causing variants (including frameshift variants that do not cause stop codons at the variant site) are annotated, variants are annotated at scale, inserted sequence is considered for indels, and differentiated (i.e.rule-specific) output is provided for each transcript-variant pair where NMD escape rules are applicable.This enables users to focus on the subset of rules most applicable to their situation; for example, some users may choose to focus on the exon-exon junction complex related "canonical" NMD rules only and ignore the "css proximal," "single exon," and "407 bp plus" rules (see Section 2).In addition to the R-package, we also provide a command-line interface to aenmd's functionality.

aenmd_cli command-line interface
We constructed a containerized version of aenmd with all dependencies, which also provides a command-line interface.This allows end-to-end annotation of variants.An input VCF file is read, PTC-generating variants that overlap a specific transcript set (see the aenmd data packages section below) are annotated, and the annotation results are then included in the INFO column of an output VCF file.In this way, the aenmd_cli command-line tool makes aenmd easily accessible and its results reproducible; there are no external dependencies, no knowledge of the R programming language is required, and it can be seamlessly integrated into existing variant processing workflows.

aenmd data packages
Annotation for (predicted) escape from NMD is based on the location of a PTC in the context of a transcript model.With aenmd, we provide precompiled annotation packages that provide comprehensive protein-coding transcript sets for the GRCh37 and GRCh38 assemblies of the human genome (data packages: aenmd.data.gencode.v43and aenmd.data.gencode.v43.grch37,respectively), based on GENCODE version 43 annotations.We also provide a more stringently filtered transcript set based on ENSEMBL (version 105), containing transcripts with the highest level of transcript support (data package: aenmd.data.ensdb.v105).The aenmd package provides functionality to select between different transcript sets, allowing convenient prediction of NMD escape for GRCh37 and GRCh38 variants.

Annotation of gnomAD, Clinvar, and the GWAS catalog
We used aenmd with a high-quality ENSEMBL transcript set (aenmd.data.ensdb.v105annotation package, see above) to annotate the gnomAD, Clinvar, and GWAS catalog databases of human genetic variation for PTC-generating variants predicted to escape NMD.Our results are summarized in Supplementary Tables S1-S3.We observe that the fraction of NMD escape PTC-generating variants varies between 36% (ClinVar), 50% (gnomAD), and 57% (GWAS catalog).The fraction of coding variants in each database that introduce PTCs also varies (10% for ClinVar, 4.1% for gnomAD, and 4.5% for the GWAS catalog).While the absolute number of PTC-generating variants is low for the GWAS catalog (most of its variants are non-coding), we learn from gnomAD that half of the 300k PTC-generating variants recovered from 125k exome sequences are predicted to escape NMD.Analyzing the ClinVar database (Fig. 1B, Supplementary Table S2), we find that for the subset of variants that are considered pathogenic and generate PTCs, 34% (31k variants) are predicted to escape NMD.This suggests that escape from NMD may play a substantial role in the disease mechanisms underlying variants of clinical significance annotated in ClinVar.

Comparison with VEP NMD plugin
We note that Ensemble's Variant Effect Predictor (VEP, Hunt et al. 2022) provides an NMD annotation plugin that annotates escape from NMD for "stop_gained" variants.This set of variants does not include frameshift variants with a downstream PTC, so the set of variants considered by aenmd and the VEP plugin are inherently different.For example, aenmd annotates 200k variant-transcript pairs for ClinVar, while VEP considers 77k due to its restrictions on variant type (Supplementary Table S4).
Nevertheless, we systematically compared VEP and aenmd NMD escape predictions for the ClinVar database for variants that overlap in transcript set and variant type between the two Annotating escape from nonsense-mediated decay methods.Overall, we find high consistency of NMD escape predictions (97.5% identical predictions), with 773 (out of 75 840) variant-transcript pairs annotated as NMD-escaping by aenmd but not VEP, and with 1096 pairs annotated as NMD escaping by VEP but not aenmd.We manually examined a limited set of 20 randomly selected variants with different predictions; results are summarized in Supplementary Table S5, and differences are often due to understandable technical differences in the implementation of NMD escape rules.

Discussion
Here we present aenmd, a self-contained, accessible, and scalable computational tool for annotating (predicted) escape from NMD for variants that generate premature termination codons (PTCs) in a transcript.While there exist other tools that annotate escape from NMD for PTC variants, aenmd is unique in its specific features.For instance, VEP annotates NMD escape but is limited to "stop_gained" variants.Additionally, a user is unable to readily interpret which NMD escape rules underlie a certain prediction.This interpretability shortcoming is shared with NMDetective (Lindeboom et al. 2019), a tool that annotates annotate escape from NMD and provides a NMD efficacy prediction, a feature that aenmd lacks; however, NMDetective's approach does not consider inserted sequence for indels that cause PTCs.The latter is also the case for the tool NMDescPredictor (Coban-Akdemir et al. 2018), which is also different from aenmd in that it does not provide batch annotation functionality, and it implements a smaller set of NMD escape rules.The tool ALoFT (Balasubramanian et al. 2017) more generally predicts pathogenicity of loss-offunction variants, but it has the capability to annotate NMD escape.However, its output is less fine-grained than aenmd's as to which specific rules drive NMD escape annotations.Similarly, the variant annotator SNPEff (Cingolani et al. 2012) provides NMD escape prediction, but it only considers two NMD escape rules (penultimate and last exon rules).In summary, aenmd stands out in terms of its functionality, flexibility, and interpretability of results.
We also performed a detailed comparison of aenmd with the VEP NMD plugin, which annotates fewer variant types, uses a smaller set of NMD escape rules, and does not report the outcome of individual rules.While aenmd annotates substantially more variants, we nevertheless found that overlapping predictions were highly consistent.
We note that the rules aenmd (and other tools) utilize for predicting escape from NMD do not yield perfect annotations, and not all the rules are believed to work equally well.For instance, Lindeboom et al. (2019) in their NMDetective-B model observe the highest efficacy for the "last exon" rule, followed by the "CSS proximal" rule, followed by the "penultimate exon" rule, followed by the "407bp plus" rule when analyzing cancer data.Further on, it is conceivable that the efficacy of different rules changes across different tissues/ cell-types where affected transcripts are expressed.However, a recent study leveraging GTEx data (Teran et al. 2021) found that NMD effects were highly stable across tissues and individuals, and it concludes that NMD prediction tools' predictive power should be stable across tissues.
In addition, we note that NMD plays a role in designing CRISPR gene editing experiments (Lindeboom et al. 2019), and therefore aenmd's functionality will potentially be useful in this context as well.
In summary, aenmd's comprehensive features, flexibility, and ease of use allow for improved annotation of PTCgenerating variants at low computational cost.
2014) and vcfR(Knaus and Grunwald 2017) packages for importing/exporting variants from/ into variant call format (VCF) files, and the Biostrings(Lifschitz et al. 2022) and GenomicRanges (Lawrence et al. Figure 1.(A) Rules for predicting escape from NMD, with purple-shaded regions indicating areas that would harbor predicted NMD-escaping PTCs.The single exon rule and the 407 bp rule are not shown.(B) ClinVar variants, stratified by pathogenicity and annotated with predicted escape from NMD. transcript-dependent: the same variant overlaps multiple Transcripts and has differing NMD escape predictions.