VarMap: a web tool for mapping genomic coordinates to protein sequence and structure and retrieving protein structural annotations

Abstract Motivation Understanding the protein structural context and patterning on proteins of genomic variants can help to separate benign from pathogenic variants and reveal molecular consequences. However, mapping genomic coordinates to protein structures is non-trivial, complicated by alternative splicing and transcript evidence. Results Here we present VarMap, a web tool for mapping a list of chromosome coordinates to canonical UniProt sequences and associated protein 3D structures, including validation checks, and annotating them with structural information. Availability and implementation https://www.ebi.ac.uk/thornton-srv/databases/VarMap. Supplementary information Supplementary data are available at Bioinformatics online.

The uploaded file can contain additional tab-separated columns, but these will be ignored by VarMap. A heading line in the input file is optional and will also be ignored. As the upload of very large files can take a long time -and possibly result in the server timing out -it is recommended that the file be stripped of surplus data columns prior to upload. The maximum number of coordinates recommended is 50,000.
Selection and checking of build Users can specify which genome build their DNA coordinates are taken from -either GRCh37 or GRCh38 -by clicking the appropriate radio button on the input form. If there are enough coordinates in the file (i.e. at least 20), VarMap will check the build automatically. It does this by taking a random set of 20 coordinates from the input file and checking the original base against that returned by the Ensemble REST API (Cunningham et al., 2019) for builds GRCh37 and GRCh38. The build giving the best agreement is then used for the entire set of coordinates in the file. Any coordinates then found not to match this build are flagged with a warning. VEP The first step in the process is to call VEP (McLaren et al., 2016). If the input contains fewer than 20 entries, the VEP API is called for each coordinate in turn, and progress is reported on screen. For larger data sets, the user is asked for their e-mail address so that the processing can be performed in batch mode on our processor farm using an in-house installation of VEP. A link to the results is then e-mailed to the user when all processing is complete.
VEP output For each input coordinate, VEP returns the corresponding list of transcripts, identified by an ENST code, together with additional data (see table below). VarMap identifies the protein isoform corresponding to each ENST transcript using a list of transcript-isoform pairs (where the translated transcript sequence is identical to an isoform sequence) provided by UniProt(UniProt, 2019). Given the isoform, the following data can be obtained: UniProt isoform accession number, amino acid position, amino acid change, gene symbol, PolyPhen (Adzhubei et al., 2013) score, SIFT (Vaser et al., 2016)score, CADD score (Rentzsch et al., 2019) and VEP consequence. Additionally, the RefSeq (O'Leary et al., 2016) accession for each transcript, where available, is retrieved via the Ensemble BioMart (Kinsella et al., 2011) download.
UniProtKB/SWISS-PROT canonical isoform Of particular interest are isoforms that correspond to the canonical sequences in UniProtKB/SWISS-PROT (Boutet et al., 2007), as these will have the curated annotations. Other transcripts may map to alternative isoforms, or may not map to any isoform at all (i.e. the DNA variant is not in a known coding region of the genome). Only those mapping to the UniProtKB/SWISS-PROT canonical sequences are further annotated with 3D protein structural information (if available). The canonical is found by comparing a UniProt transcript-isoform pair list with SWISS-PROT.

VEP consequences
There is a wide range of possible consequences of any given variant, as defined by VEP. Those of interest here are missense, synonymous, stop gain, or stop loss variants. Others are likely to fall in an untranslated region such as a 5'UTR or intron, and no further mapping can take place for these. For missense and synonymous variants, the corresponding amino acid returned by VEP is checked against the amino acid at the given protein position in the SWISS-PROT sequence. A mismatch suggests a problem with the identification of the canonical isoform, so a warning is returned in the output.

VarMap output
The complete list of fields returned by VarMap for each valid variant is given in Table S1. The sources of additional data are: • RefSeq identifier is retrieved from the HGNC database (Braschi et al., 2019) using the ENSG gene identifier returned by VEP. This is the Select RefSeq for the gene to which the variant maps, and may represent a transcript which is not associated with the UniProt canonical isoform. • Residue conservation is computed using the ScoreCons algorithm (Valdar, 2002) from the pairwise alignments obtained from a BLAST (Pearson, 2014)search of the canonical protein sequence against the UniProt Knowledgebase. • Natural variants are obtained from the gnomAD database (Lek et al., 2016) processed by VarMap to map DNA coordinates to protein residue positions. • Diseases associated with variants at the given residue position come from UniProt and ClinVar (Landrum et al., 2018). • PFAM domain data come from the PFAM(El-Gebali et al., 2019) ftp server.
• CATH domain data come from the CATH (Dawson et al., 2017) ftp server.
• The closest PDB structure is found by a FASTA (Pearson, 2014) search of the canonical protein sequence against the protein sequences in PDBe(ww, 2019). The closest match, by E-value, to the region of the protein covering the residue position of interest is taken. Other sufficiently close hits (i.e. sequence identity at least 30%) are retained for the structural information they can provide. For example, the closest PDB structure might comprise just the protein, whereas the structures of other, related proteins may have information about any intermolecular interactions the residue of interest might be involved in -e.g. interaction with ligand, DNA, metal, or other protein.
• Secondary structure information comes from PDBsum (Laskowski et al., 2018)and includes the secondary structure assignment (strand, helix or coil), whether the residue is a catalytic residue (as defined in M-CSA (Ribeiro et al., 2018), and whether it is part of a disulphide bond. Other information supplied by PDBsum includes any interactions the residue is involved in (i.e. with ligand, DNA, metal, or other protein).
The flow diagram below (figure S1) illustrates the VarMap pipeline and the sources of data it makes use of.    Figure 1D -The VarMap output columns USER_BASE and USER_VARIANT were used to count those variants were both the reference and variant allele were a single nucleotide. Figure 1E -The VarMap output column CHANGE_TYPE was used to count the types of variant consequence and those that could not be mapped using the tool. Figure 1F -The VarMap output column PDB_UNIPROT_MATCH was used to establish whether a variant could be mapped to the exact human protein structure corresponding to the gene or whether it could be mapped to a closely related homologous structure. Figure 1G -The VarMap output columns NTO_DNA, NTO_LIGAND, NTO_METAL and NTO_PROTEIN were used to count how many variant amino acids have intermolecular contacts. If more than one contact is seen for a variant then both interactions are counted. If no interactions are seen then it is counted as 'no intermolecular interactions'.