Structure-PPi: a module for the annotation of cancer-related single-nucleotide variants at protein–protein interfaces

Motivation: The interpretation of cancer-related single-nucleotide variants (SNVs) considering the protein features they affect, such as known functional sites, protein–protein interfaces, or relation with already annotated mutations, might complement the annotation of genetic variants in the analysis of NGS data. Current tools that annotate mutations fall short on several aspects, including the ability to use protein structure information or the interpretation of mutations in protein complexes. Results: We present the Structure–PPi system for the comprehensive analysis of coding SNVs based on 3D protein structures of protein complexes. The 3D repository used, Interactome3D, includes experimental and modeled structures for proteins and protein–protein complexes. Structure–PPi annotates SNVs with features extracted from UniProt, InterPro, APPRIS, dbNSFP and COSMIC databases. We illustrate the usefulness of Structure–PPi with the interpretation of 1 027 122 non-synonymous SNVs from COSMIC and the 1000G Project that provides a collection of ∼172 700 SNVs mapped onto the protein 3D structure of 8726 human proteins (43.2% of the 20 214 SwissProt-curated proteins in UniProtKB release 2014_06) and protein–protein interfaces with potential functional implications. Availability and implementation: Structure–PPi, along with a user manual and examples, isavailable at http://structureppi.bioinfo.cnio.es/Structure, the code for local installations at https://github.com/Rbbt-Workflows Contact: tpons@cnio.es Supplementary Information: Supplementary data are available at Bioinformatics online.

Yes. Accepts user genomic or protein coordinates

ANNOTATE
Annotates genomic mutations based on the protein features that are overlapping amino-acid changes. ANNOTATE_MI Annotates mutated isoforms based on the protein features that are overlapping amino-acid changes.

ANNOTATE_NEIGHBOURS
Annotates genomic mutations based on the protein features that are in close physical proximity to amino-acid changes.

ANNOTATE_MI_NEIGHBOURS
Annotates mutated isoforms based on the protein features that are in close physical proximity to amino-acid changes.

INTERFACES
Find variants that affect residues in protein-protein interaction interfaces. It uses the PDB files of protein-protein complexes annotated in the Interactome3d database (release 2014_1).

MI_INTERFACES
Find mutated isoforms with affected residues in protein-protein interaction interfaces.

MI_NEIGHBOURS
Find residues within physical proximity to amino-acid changes in mutated isoforms.
NEIGHBOUR_MAP For a given PDB file, find all pairs of residues that fall within a given 'distance of each other. It uses the PDB files of individual proteins annotated in the Interactome3d database (release 2014_1). NEIGHBOURS_IN_PDB Use a PDB file to find the residues neighbouring, in three-dimensional space, a particular residue in a given sequence.

PDB_ALIGNMENT_MAP
Find the correspondence between sequence positions in a PDB file and in a given sequence. The PDB positions are reported as 'chain:position'.

PDB_CHAIN_POSITION_IN_SEQUENCE
Translate the positions of amino acids in a particular chain of the provided PDB file into positions inside a given sequence.

SEQUENCE_POSITION_IN_PDB
Translate the positions inside a given amino-acid sequence to positions in the sequence of a PDB file by aligning them.

Wizard
Retrieve all annotations, including neighbors and interfaces, by using genomic mutation, mutated isoform, or an identifier such as associated gene name or gene symbol.
We illustrate the performance and usefulness of the Structure--PPi module by applying this tool to a validation set of mutations (14 pathogenic and 10 neutral) defined in Lee et al., 2010. Mutations included in the validation set (Table 1 and Supplementary Table S1 in Lee et al., 2010) were classified by genetic or integrative methods that used a combination of data from different sources: co--occurrence with known deleterious mutations, personal and family history of patients carrying the variant, and co--segregation of the variant with disease within pedigrees. As you can see below, Structure--PPi achieves a level of performance similar to that obtained by MetaSVM, a support vector machine algorithm, which incorporate results from state--of--the--art methods (e.g., SIFT, PolyPhen--2, MutationTaster, Mutation Assessor, FATHMM, and LRT) and the maximum frequency observed in the 1000G project (for details see dbNSFP v2.8 database at https://sites.google.com/site/jpopgen/dbNSFP). In addition, Table  S3 shows the utility of Structure--PPi for providing complementary information to the prediction methods. Indeed, this complementary information facilitates discrimination of false positive results (bold letters in the column MetaSVM), and also identifies mutations that should be study in more details (bold letters in the column StructurePPi).
For the purpose of comparison, we assume that the Structure--PPi annotations support a "(D)eleterious" prediction in the following two scenarios: i) "mutations in protein-protein interfaces" AND "mutation position" OR "its neighboring residues" accommodate variants in human diseases, and ii) "mutations outside protein--protein interfaces" AND "mutation position" AND "its neighboring residues" accommodate variants in human diseases. Otherwise, Structure--PPi suggests a careful experimental study of the mutations. Despite the goal of Structure--PPi is to annotate mutations instead to predict damage, based on the previous assumptions we calculated the Accuracy, Recall (or Sensitivity), Precision, and Matthews Correlation Coefficient (MCC). Hereafter, we will refer to the following abbreviations: True positives (TP), correctly predicted disease-associated mutations. False positives (FP), neutral mutations predicted as disease ones. True negatives (TN), correctly predicted neutral mutations. False negatives (FN), disease-associated mutations predicted as neutral. Accuracy accounts for the fraction of mutations correctly predicted in function of the total number of mutations. Recall, also referred to as sensitivity by other authors, accounts for the proportion of correctly predicted disease--associated mutations in function of all the disease--associated mutations in the dataset. Precision accounts for the proportion of correctly predicted disease--associated mutations with respect to all the predicted disease--associated mutations. The Accuracy, Recall, Precision, and MCC were calculated according to the following formulas: The   Features: UniProt key names in the "Feature Table" line; Freq: number of nsSNV in a feature; %: percentage of nsSNV in a feature respect to the total number of nsSNV in the dataset; %PPi/%not_PPi: indicates how frequent is a feature at protein--protein interfaces or outside them; not_PPi: nsSNV outside protein--protein interfaces; PPi: nsSNV in protein--protein interfaces; Tot_mutations: total number of nsSNV in the dataset; Firestar_Cat: Catalytic site residues ("Cat_Site_Atl") predicted by Firestar; Firestar_Bind: Binding site residues predicted by Firestar; Appris_Membr: a "Damaged" transmembrane helix predicted by the THUMP method implemented in Appris; Appris_Signal: a "Signal peptide" region predicted by the CRASH method implemented in Appris.
This preliminary analysis suggests that a large proportion of coding nsSNV is positioned in functional domains and in secondary structural regions, both in COSMIC and in 1000 Genomes Project (1000G). In addition, we observe an enrichment of features like VARIANT (sequence variations), MOD_RES (posttranslationally modified residue), and DNA_BIND (binding site residues to DNA) at protein--protein interfaces in COSMIC in comparison with 1000G. Notice that features with a low percentage of nsSNV produce a less reliable result.