ConVarT: a search engine for matching human genetic variants with variants from non-human species

Abstract The availability of genetic variants, together with phenotypic annotations from model organisms, facilitates comparing these variants with equivalent variants in humans. However, existing databases and search tools do not make it easy to scan for equivalent variants, namely ‘matching variants’ (MatchVars) between humans and other organisms. Therefore, we developed an integrated search engine called ConVarT (http://www.convart.org/) for matching variants between humans, mice, and Caenorhabditis elegans. ConVarT incorporates annotations (including phenotypic and pathogenic) into variants, and these previously unexploited phenotypic MatchVars from mice and C. elegans can give clues about the functional consequence of human genetic variants. Our analysis shows that many phenotypic variants in different genes from mice and C. elegans, so far, have no counterparts in humans, and thus, can be useful resources when evaluating a relationship between a new human mutation and a disease.


INTRODUCTION
The progress of the genomics field is accelerating with the aid of next-generation sequencing technologies, most recently exemplified by the launch of the Exome Aggregation Consortium (ExAC) and Genome Aggregation Database (gnomAD). These genomic databases were gathered from >140 000 randomly selected individuals, providing largescale collections of human genetic variants (1,2). Humans possess over 700 million genetic variants in the proteinencoding and non-protein-coding regions of the genome. However, the growing number of human genetic variants does not automatically mean that our understanding of the functional effects of these variants concurrently increases (3). Due to their clinical importance, functional consequences are only available for a handful of genetic variants, while biological impacts are unavailable for most human genetic variants (4). The biological understanding of these genetic variants is crucial to diagnose diseases better and make therapeutic decisions regarding diseases, particularly noting each variant's especially discerning contribution to rare and complex diseases.
Though examining the functional effects of variants is an efficient strategy for categorizing them as being either pathological or benign, a considerable workload and financial demand are limiting factors for such experimental efforts (5)(6)(7)(8)(9). In this regard, model organisms are regularly used to examine the functional influence of clinically relevant variants, but unprecedented quantities of variant and phenotype data from model organisms largely go unnoticed. We assume that when an amino acid, where it shows a similarity with other homologous sequences, undergoes an intolerable change in non-human species, as observed with phenotype-changing variants, human-equivalent variants will likely disrupt the function of genes in humans. The basis for this claim is that the function of proteins is likely to be compromised when the positions of highly conserved amino acids are altered (10). Despite various orthology available search, there is no existing search engine that: (1) allows users to scan for equivalent variants between human and non-human species; (2) incorporates available experimental data from model organisms into corresponding variant positions and (3) enables users to submit the empirical variant data (11)(12)(13)(14)(15).
Here, we develop an integrated search tool and database called ConVarT (Congruent clinical Variation Visualization Tool) to search for equivalent variants (pathological variants, phenotypic variants, and variants of uncertain significance) between human and non-human species. As a proof of concept, we matched amino acid-changing variants from humans with variants from mice and the nematode Caenorhabditis elegans and generated a list of equivalent variants called matching variants (MatchVar) between The availability of organism-specific databases (WormBase for C. elegans and Mutagenetix for mice) containing nonnaturally occurring variants and phenotype data for C. elegans and mice enables us to equate them with human genetic variants from gnomAD, Clinical Variants (ClinVar), COSMIC and dbSNP ( Figure 1) (2,(16)(17)(18)(19)(20)(21). We initially derived the human gene orthologs for M. musculus and C. elegans. We then collected amino acid-changing variants from gnomAD, ClinVar, COSMIC and dbSNP for humans, from WormBase for C. elegans, and from the Australian Phenome Bank and Mutagenetix for mice. gnomAD describes over 16 179 380 variations, while the ClinVar database deposits over 500 000 variants, including pathogenic and benign variants (2,20). COSMIC and dbSNP contain over 6 842 627 variations and 1,086,546 variations, respectively (16,19). Our analysis revealed that gnomAD, ClinVar, COS-MIC and dbSNP have identical and non-identical variants ( Figure 2A). Mutagenetix and WormBase present almost 800 000 variants from these organisms (17,18). To compare human variants with those from mice and C. elegans, we matched the protein sequences of the ortholog genes, followed by performing 2 379 397 multiple sequence alignments (MSAs) of all combinations of these protein sequences to ensure alignment of the corresponding amino acid positions. The schematic workflow of ConVarT is presented in Figure 1. We then integrated genetic variants and variant-specific annotations (such as pathogenicity, phenotypes, and allele frequency) from humans, mice, and C. elegans into corresponding amino acid positions in each gene. We restricted our comparisons to amino acidchanging variants, as amino acids are more conserved than non-amino acid-changing variants, and protein function is likely to be affected by an alteration in the corresponding amino acids. For visualization of MSAs together with variants from humans, mice, and C. elegans, we used inhouse software (22). Previous studies revealed that many human variants overlap with amino acids undergoing posttranslational modifications (PTMs) and are enriched at the protein domains. Therefore, Pfam protein domains for humans and 383k PTMs from PhosphoSitePlus were systematically integrated into corresponding positions (23,24).
Our analysis revealed that many equivalent amino acid substitutions exist between humans and these two organisms ( Figure 2B). For example, both human AP1M1 and C. elegans UNC-101, the homologue of AP1M1, have the identical residue change at the corresponding positions (Human NP 001123996.1: p.P375S, Variant ID: rs1490425955 versus C. elegans NP 001040675: p.P362S, Variant ID: WB-Var00679460). We have named these equivalent variants 'matching variants' (MatchVars) ( Figure 2B, and Supplementary Tables S1 and S2). For a variant to be considered a MatchVar, the amino acid residue at the corresponding positions must be matched between the human gene and the ortholog genes. The Materials and Methods provide a detailed explanation. There are 50430 MatchVars between humans and mice, while the number of MatchVars between humans and C. elegans is 42614 ( Figure 2B, and Supplementary Tables S1 and S2). Finally, to make the Match-Vars available to the community, we created a web-based search engine and database called ConVarT (http://www. convart.org/) to facilitate consistent and rapid visualization of MatchVars and variant annotations on MSAs together with human PTMs and protein domains. ConVarT offers a protein sequence similarity index based on the matching residues for three species below part of MSAs so that the researchers utilize this section to gain more insight into the conservation of the sequence in the model organism compared to humans.
ConVarT reduces the time needed to find the phenotypic and unknown significance of MatchVars in the genomes of three organisms from hours to seconds. A comprehensive list of identifiers that can be used to search for genes on ConVarT can be found in Supplementary Table S17. Furthermore, users can now submit any protein sequence as an input, and ConVarT will perform sequence comparison and alignment to find the closest human orthologue of that protein sequence, as well as display human variants on the pairwise sequence alignment of human protein and submitted protein sequence (Supplementary Figure S2). This is especially useful for model organism researchers, such as S. cerevisiae, who have yet to be incorporated into ConVarT.

Usage of ConVarT to conjecture about the functional effects of variants in disease analysis
Disease-causing variants are annotated as pathogenic in humans, whereas variants with phenotypic annotations in non-human species indicate disease-like conditions for them, suggesting that both types of annotations represent the disruption of the protein function. The majority of human genetic variants in the ClinVar database are single nucleotide variants (SNVs), and variant of uncertain significance (VUS) make up the majority of these SNVs (Supplementary Figure S1A and B). Expectedly, compared to the fraction of phenotypic MatchVars coinciding with benign variants of ClinVar, the proportion of phenotypic MatchVars of C. elegans overlapping with pathogenic variants of ClinVar is calculated to be significantly higher (P < 0.0001; chi-square χ 2 test), which is consistent with the equality of pathogenic with phenotypic variants. For phenotypic MatchVars from mice, the mouse phenotypic MatchVars/ClinVar pathogenic variants ratio is slightly higher than the ratio of the mouse phenotypic MatchVars/ClinVar benign variants (P < 0.05; χ 2 test) ( Figure 2C and D). ConVarT can serve researchers and human geneticists in several ways. First, they can assess whether variants of interest from humans currently have MatchVars already associated with the phenotype in the mice or C. elegans, potentially adding a new layer of support to their study. For example, the MatchVar p.G368E (allele name n4435 and NP 495455) in C. elegans KAT-1 (Human ACAT1) was identified as phenotypic (short life span) in 2010. The p.G388E variant (NP 000010) in human ACAT1 was pathogenic seven years later, resulting in human acetyl-CoA acetyltransferase deficiency (25,26). We presented all the phenotypic MatchVars of mice and C. elegans, overlapping with pathogenic and other variants in ClinVar, COSMIC, dbSNP, and gnomAD ( Figure 2E-J, and Supplementary Tables S1-S16). Second, there may currently be no MatchVars from humans for the phenotypic MatchVars of the mice or C. elegans. Where there may be the appearance of a new human variant associated with a disease, scientists and human geneticists may look for the phenotypic MatchVars in ConVarT, thereby helping to assign certain variants as being potentially detrimental. Finally, the Million Mutation Project and Mutagenetix have many strain collections that bear unique MatchVar mutations. Many of these mutations are poorly characterized, such as C. elegans and mice mutants bearing MatchVars (C. elegans DYF-18/CDK7 NP 502232: p.P245L and p.R26K; C. elegans OSM-3/KIF17 NP 001023308: p.T89I, p.P161L, p.S208L, p.A464T, and p.W572*; Mice Ros1, NP 025412: p.T220A and p.V2050A; Mice Tg, NP 033401: p.S164G, p.T1555S, and p.C1993R) ( Figure 3A-D) (17,27). Therefore, the functional significance of these MatchVars is currently unknown. These mutants can be obtained from the distribution centers for functional analysis (17,27). Researchers can also easily submit their variants and phenotype data to ConVarT, thereby sharing their phenotypic and variant data with the community.

Utilizing phenotypic variants from non-human species to add one layer support for understanding human variants
Eight missense variants in C. elegans ben-1 (Human TUBB4B) encoding tubulin beta chain (NP 497728: p.E69G, p.Q131L, p.S145F, p.F167Y, p.A185P, p.F200Y, p.M257I and p.D404N) were implicated in displaying resistance to benzimidazole (BZ) and were labeled as putative loss-of-function (LoF) variants (28,29) ( Figure  3E). We found that C. elegans missense residues were conserved in humans and mice, except for p.F200. Next, we investigated whether these conserved residues had Match-Vars in humans, which revealed no MatchVars for p.E69G, p.Q131L, p.S145F, p.A185P and p.D404N in humans. This result suggested that these variants were under negative selection in the human population, potentially due to their detrimental effects. All these variants were consistently predicted to be deleterious by SIFT (Supplementary Table  S3). However, p.M257I and p.F167Y in C. elegans ben-1 (NP 497728) are the MatchVars of p.M257I and p.F167Y (variant IDs: COSM3929966 and COSM6451925) in human TUBB4B (NP 006079.1), respectively, predicted to be pathogenic by the FATHMM (Figure 3E) (19). Indeed, the absence of the p.M257I and p.F167Y mutations in human TUBB4B in gnomAD and TOPMed suggests that they were likely negatively selected. This data indicates that the experimental data already available from C. elegans and mice may provide additional evidence to deduce the functional implication of human variants together with allele frequency.

DISCUSSION
Model organisms have contributed significantly to our knowledge of human biology and diseases. For this reason, the Alliance of Genome Resources recently launched an integrated database that shares genetic and functional genomic data from a variety of model organisms in order to aid scientists in their understanding of human biology and disorders, though they did not provide a search engine for MatchVars (30). This current work fills this gap by presenting a search engine for MatchVars and a database comparably displaying the pathogenic or phenotypic and unknown significance of variants from humans, mice, and C. elegans. ConVarT was designed with the goal of expanding and incorporating variants from other organisms, such as Zebrafish and Drosophila. ConVarT recently added 24 658 526 human variants from TOPMed to its database (Supplementary Table S18) (3). The model organism community can easily submit new variant data and expand ConVarT to become an open platform for exchanging information about new MatchVars, thereby aiding the presentation of empirical evidence from model organisms to provide one more layer of support for interpretation of human variants.

The gene homology list for humans and non-human species
The gene homology list for human and non-human organisms was first build using organism-specific orthology resources such as DIOPT, MGI, and the Zebrafish Information Network (ZFIN) (20,(31)(32). In addition, as we uncovered inaccuracies and missing data in the orthology data, we manually corrected them using a reciprocal BLAST (33) analysis to find the counterparts of human genes. For example, the reciprocal BLAST analysis reveals that