RMDisease V2.0: an updated database of genetic variants that affect RNA modifications with disease and trait implication

Abstract Recent advances in epitranscriptomics have unveiled functional associations between RNA modifications (RMs) and multiple human diseases, but distinguishing the functional or disease-related single nucleotide variants (SNVs) from the majority of ‘silent’ variants remains a major challenge. We previously developed the RMDisease database for unveiling the association between genetic variants and RMs concerning human disease pathogenesis. In this work, we present RMDisease v2.0, an updated database with expanded coverage. Using deep learning models and from 873 819 experimentally validated RM sites, we identified a total of 1 366 252 RM-associated variants that may affect (add or remove an RM site) 16 different types of RNA modifications (m6A, m5C, m1A, m5U, Ψ, m6Am, m7G, A-to-I, ac4C, Am, Cm, Um, Gm, hm5C, D and f5C) in 20 organisms (human, mouse, rat, zebrafish, maize, fruit fly, yeast, fission yeast, Arabidopsis, rice, chicken, goat, sheep, pig, cow, rhesus monkey, tomato, chimpanzee, green monkey and SARS-CoV-2). Among them, 14 749 disease- and 2441 trait-associated genetic variants may function via the perturbation of epitranscriptomic markers. RMDisease v2.0 should serve as a useful resource for studying the genetic drivers of phenotypes that lie within the epitranscriptome layer circuitry, and is freely accessible at: www.rnamd.org/rmdisease2.


INTRODUCTION
Advances in high-throughput sequencing have revealed millions of single nucleotide variants (SNVs) in genomes. A key challenge lies in the functional annotation of genetic variants, especially if the mutations are synonymous or from the non-coding regions. There is increasing evidence that synonymous variants can affect essential biological functions via epigenetic regulation (1). Moreover, accurate identification of functional SNVs is crucial to better understand the molecular mechanisms underlying human diseases. To annotate the genetic variants with putative downstream mechanisms, enormous computational efforts have been made in exploring the effects of the mutations on various genomic phenomena, including post-transcriptional protein modification (2)(3)(4)(5)(6)(7)(8), transcriptional regulation (9,10), RNA-protein interaction (11), ceRNA networks (12), calpain cleavage (13), polyadenylation (14) and RNA modifications (15)(16)(17)(18).
To date, >170 different types of post-transcriptional RNA modifications (RMs) have been detected, which occurs on different types of RNA and regulates nearly every stage of RNA life. Emerging evidence has revealed that dysregulation of the modification status is involved in multiple human diseases including cancer contexts (19,20). RMs have also been shown to play an essential role in regulating the function of tRNA and mitochondrial RNA. Covalent modifications such as 2 -O-methylation (Nm) within eukaryotic and prokaryotic tRNAs can inhibit the innate immune responses (21)(22)(23). Mitochondrial RMs have been reported to shape metabolic plasticity in metastatic cancers (24), and the specific locations of 5-methylcytosine (m 5 C) and its derivative 5-formylcytosine (f 5 C) on mitochondrial transcriptome can be a potential therapeutic target for metastasis.
To understand the effects of genetic variants on RMs, we have built a database of RM-associated variants with an emphasis on their potential disease associations (17). In RMDisease, by integrating human genetic variants and 303 426 experimentally validated modified sites from eight types of RMs, a total of 202 307 human RM-associated variants were identified and each labeled with the association level (AL). Among them, >4000 disease-relevant variants were annotated, shedding light on the disease mechanisms potentially acting through altering the epitranscriptome layer.
To date, various techniques have been developed to profile RMs. Among these detection techniques, MeRIP-Seq (or m 6 A-Seq) occupies the majority of the market. MeRIP-Seq enables transcriptome-wide RM profiling with a resolution of ∼100 nt (25,26). Besides MeRIP-Seq, many techniques have been developed to detect RMs at base resolution, such as m6A-CLIP (27), m6ACE-Seq (28), Nm-Seq (29), m7G-Seq (30), Pseudo-Seq (31), m1A-Seq (32), RNA-BisSeq (33), FICC-Seq (15), miCLIP-Seq (34), f5C-Seq (35) and Rhp-Seq (36). MeRIP-Seq together with highresolution techniques revealed the transcriptomic profiles of RM sites from multiple species. These detected RM sites can be a valuable resource for computational analysis to unveil the functions of epitranscriptomes in gene regulation and their implication in human diseases (37)(38)(39). In response, efforts have been made to identify the RMassociated variants in human and mouse transcriptomes, resulting in the construction of databases such as RMDisease (17) and RMVar (18) (please refer to Supplementary Table S1 for a brief comparison in the coverage of databases serving a similar purpose). As data from more modification types and species are now available, it is necessary to extend our previous analyses to them.
We have recently upgraded RMDisease to v2.0 by collecting all available RM-associated variants and annotating their potential involvement in diseases and traits. By integrating 873 819 experimentally validated modified sites and a total of 146 396 315 genetic variants, RMDisease v2.0 reports a total of 1 366 252 RM-associated variants that may affect (add or remove) modified sites of 16 types of RM [N 6 -methyladenosine (m 6 A), 5-methylcytosine (m 5 C), , dihydrouridine (D) and 5-formylcytidine (f 5 C)] in 20 species (human, mouse, rat, zebrafish, maize, fruit fly, yeast, fission yeast, Arabidopsis, rice, chicken, goat, sheep, pig, cow, rhesus, tomato, chimpanzee, green monkey and SARS-CoV-2), including 14 749 disease-and 2441 trait-associated variants that may function through disturbing the epitranscriptome. In addition, RNA-binding protein (RBP)-binding sites, mi-croRNA (miRNA) targets and splicing sites were annotated at each RM-associated variant to highlight potential effects on post-transcriptional regulation. The overall design of RMDisease v2.0 is outlined in Figure 1.

Data resource
In RMDisease v2.0, we collected the epitranscriptome profiles of 16  For RMs identified using base-resolution techniques, the genomic coordinates of modified residues were extracted from the corresponding GSE file or supplementary materials of their original publications. For the low-resolution MeRIP-Seq data, the raw FASTQ files were downloaded and re-processed using a common pipeline. Specifically, the raw reads were firstly aligned to the reference genome with hisat2, and the peak-calling process was then performed using exomePeak2 (40) with GC contents corrected.
The genetic variants analyzed in this study consist of two groups. The germline variants in various species were derived from dbSNP (v151) (41), Ensembl 2022 (Ensembl release 106) (42) and 1000 Genomes (Phase3 Mitochondrial Chromosome Variants set). The somatic variants were obtained from 27 different human cancer types in the Cancer Genome Atlas (TCGA) (release version v27.0-fix) (43). We considered in this study a total of 144 117 977 germline variants and 2 278 338 somatic variants, identified in 20 species, respectively (see Supplementary Table S4).

Derivation of RNA modification-associated variants
An RM-associated variant is defined as the genetic mutation leading to either the gain or loss of a specific RM site, as predicted by our previously developed deep neural network model (44). The deep learning-based models were trained by modified residues from one modification type of a specific species. The tissue homogeneity assumption used is consistent with existing studies (17,18) while, in practice, little variation across training tissues is observed on the inference outcomes of cross-tissue validation (Supplementary Figure S1). The obtained RM-associated variants were further classified into three confidence levels. Specifically, these confidence levels were: (i) high: an experimentally validated RM site was directly altered by a genetic variant, resulting in the loss of its modified nucleotide; (ii) medium: a genetic variant altered a nucleotide within the 41 bp flanking window of a base-resolution modification site (the modified site lies in the center of the 41 bp sequence) or an experimentally validated RM-containing region (MeRIP-Seq with a resolution of ∼100 nt), leading to the loss of its modification sta- tus in the mutated sequence evaluated by the deep-learning model; and (iii) low: the transcriptome-wide prediction was performed for a genetic variant altering a nucleotide within the 41 bp flanking window centered on a modification site or the RM-containing region, resulting in the significant decrease or increase in the predicted probability of the modification status, compared between the original and the mutated sequence.
Using the same definition as RMDisease 1.0, we calculated the AL between genetic variant and RM site with the following: where P WT and P SNP represent the probability of RM status for the wild-type and mutated (SNP) sequences, respectively. The AL was then calculated. AL ranges from 0 to 1, with 1 indicating the greatest impact of the variant on the modification status. The statistical significance of AL was evaluated by calculating the P-values using the null distribution of AL from all genetic variants. We retained only the RM-associated variants passing a strict cut-off (AL >0.4 and P-value <0.05 or P-value <0.01 for species with an extremely low number of available variants) as predicted by the deep-learning model.

Functional annotation of RNA modification-associated variants
To aid functional interpretation, we annotated the identified variants with genomic information such as transcript region The tRNA information was annotated by extracting genomic ranges of tRNAs of seven species from GtRNAdb (53). RNAfold software (54) was used to calculate the secondary structure information of modification sites and to generate the corresponding graphic visualization. In addition, the potential post-transcriptional regulations of identified variants were annotated by checking whether they locate in RBP-binding regions from POSTAR2 (55), could be involved in miRNA-RNA interaction based on data from miRanda (56) and startBase2 (57), and/or lie at splicing sites as annotated by UCSC browser (58) annotation with a GT-AG role within 100 bp upstream and downstream of RM-associated variants.

Disease and trait association analysis of RNA modificationassociated variants
To explore potential epitranscriptome-related pathogenesis, an analysis was performed as follows. We integrated the human disease-associated variants and trait association TagSNPs from ClinVar (59) National Genomics Data Center (63). The linkage disequilibrium (LD) analysis was computed for each trait association TagSNP using PLINK (64) software (parameters: -r2 -ld-snp-list -ld-window-kb 1000 -ld-window 10ld-window-r2 0.8). The RM-associated variants were then mapped to these disease-related variants, trait association TagSNPs and their LD mutations.

Database and web interface implementation
RMDisease v2.0 web interfaces were constructed using HyperText Markup Language (HTML), Cascading Style Sheets (CSS) and Hypertext Preprocessor (PHP). All metadata was stored using MySQL tables. EChars was exploited to present statistical diagrams and the Jbrowse genome browser (65) was applied for interactive exploration and visualization of relevant genome coordinate-based records.

Database content
RMDisease v2.0 contains a total of 1 366 252 genetic variants that may affect (add or remove) various types of RMs in multiple species. This represents a 6-fold increase in RMassociated variants, as well as a significant expansion in covered species (from human only to 20 species) and type of  Table  S5). Eight new types of RMs (ac 4 C, hm 5 C, D, f 5 C, Am, Cm, Um and Gm) were covered for the first time, and the number of supported species was increased from two (human and mouse in RMVar) to 20, providing a more comprehensive landscape of the genetic factors potentially involved in epitranscriptome layer dysregulation. Please refer to Supplementary Tables S6-S8 for complete collections of RMassociated variants identified in RMDisease v2.0.

Disease and trait association analysis
Next, the disease-related SNPs, trait-TagSNPs and their LD mutations were mapped to all RM-associated variants to unveil the phenotypes (disease or trait) potentially regulated at the epitranscriptome layer. We found that a total of 14 749 disease-relevant RM-associated variants are also linked to human diseases, which is more than three times larger than the previous version (Table 1). For example, for the m 6 A RM, 6477 genetic variants that may alter m 6 A status localized on 2026 genes were associated with 1709 known diseases and phenotypes according to the ClinVar and GWAS databases. In addition, 571 009 RM-associated somatic variants were also recorded in TCGA, linking them with 27 types of cancers (Supplementary Table S6). Furthermore, 2441 RM-associated variants were linked to various traits in four species (rice, sheep, cow and maize). We then calculated the disease and trait phenotypes that are most enriched within each type of RM; please refer to Supplementary Table S9 for a summary.

Enhanced web interface
We redesigned a user-friendly web interface to enables users to efficiently query, carry out customized searches and quickly download all collected RM-associated variants from the database. In addition, two new real-time data analysis modules were developed and presented on the web interface, namely VarFinder and Enrichment analysis tool. Users can upload a list of genetic variants in VCF format and perform inference on the database collections.
Query. RMDisease v2.0 provides two modes to query all collected RM-associated variants. The user can query the database by selecting both species and the modification type ( Figure 2A). For example, when users query the database by clicking the 'N6-methyladenosine (m 6 A)' button, the returned page will display m 6 A-associated variants identified with available species ( Figure 2B). Users can further select a specific species and click the individual RM ID to check the summary table ( Figure 2C), which includes basic information of the modification site and the associated variant(s), reference and mutated sequence, data source and post-transcriptional regulations involved (miRNA targeting, RBP-binding region and alternative splicing, Figure  2D-F). In addition, a statistics plot is provided for the visualization of the global data distribution ( Figure 2H).  Figure 2G). The query will return a full list of relevant data on a specific trait.  Download and share. All data collected in RMDisease v2.0 can be freely accessed and shared. Users can download all the data or their section of interest from the 'Download' page. Additionally, users can also access the application program interface (API) on the web interface, which offers a personalized query and download of all collected data. Please refer to the 'Help' and 'API' pages for more complete data descriptions.

Search. In
Enhanced analysis tools. Two data analysis modules were firstly introduced. (i) Enrichment analysis: the statistical significance and fold enrichment of user-provided human variants over 33 TCGA cancer types and 16 modification types were calculated. Specifically, when users provide a list of human variants, the enrichment analysis calculates whether this variant set is significantly correlated to any specific type of TCGA cancer variants or modification disturbance, based on the statistical significance of the enrichment reported by the P-value of a binomial test. (ii) VarFinder: calculating the associations between a list of user-uploaded variants and 16 types of RNA modifications. Users can upload a list of interested variants to evaluate their potential effects over sites of a specific RM.

Case study on COVID-19: spike glycoprotein
The 'COVID-19' page of RMDisease v2.0 collects the m 6 Aaffected variants located on COVID-19. Studies have confirmed that mutations in the COVID-19 spike gene (S) play a crucial role in infectivity enhancement and immune escape (66,67). m 6 A modifications in COVID-19 have also been reported to be associated with the innate immune response of the host cell (68). Of interest here are the m 6 A-affected variants located on the COVID-19 spike gene (S). Searching by gene 'S' in COVID-19 in the search box returns a total of 823 records ( Figure 3A), comprising 197 m 6 A-gain variants and 626 m 6 A-loss variants. It is possible to further filter the records by confidence level; the user can select the 'medium' level so that only variants that may destroy experimentally validated m 6 A sites identified on COVID-19 spike genes remain. More information can be viewed by clicking a specific RM ID ( Figure 3B), including supported study, gene type, gene region, association level, alternative sequence and Jbrowse genome browser ( Figure  3C). Besides the spike gene, the user can view all COVID-19 m 6 A-affected variants collected in RMDisease v2.0 ( Figure 3D).

DISCUSSION
Increasing numbers of studies have reported that RNA modifications regulate essential biological processes and are involved in the mechanisms of multiple diseases. To further elucidate the genetic basis of epitranscriptome regulation, RMDisease v2.0 collected a total of 1 366 252 RMassociated variants that can alter 16 types of RM in 20 species. A large number of disease-/trait-associated variants are identified to elucidate the potential impact on phenotype of perturbations at the epitranscriptome layer. Compared with RMDisease V1.0 and RMVar, substantial improvements have been made in RMDisease v2.0 to cover wider RM types and more species. The number of RM-associated variants presented in RMDisease v2.0 is six times more than the previous release. As high-throughput epitranscriptome data become increasingly available, advanced RM profiling techniques are continually being developed to identify previously uncovered modification sites, along with transcriptome-wide detection of new types of modification from various species. RMDisease will be continuously updated and expanded in the future to serve as a useful resource for the research communities of RM and genetics.

DATA AVAILABILITY
All data used in this study are already publicly available in the GEO database, National Genomics Data Center, The Cancer Genome Atlas (TCGA), dbSNP (v151), 1000 Genome and Ensembl 2022 (Ensembl release 106). The accession number and detailed description can be found in Supplementary Tables S2-S4. All the identified RMassociated variants collected in RMDisease v2.0 are freely accessible at: www.rnamd.org/rmdisease2.

SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.