Human germline and pan-cancer variomes and their distinct functional profiles

Identification of non-synonymous single nucleotide variations (nsSNVs) has exponentially increased due to advances in Next-Generation Sequencing technologies. The functional impacts of these variations have been difficult to ascertain because the corresponding knowledge about sequence functional sites is quite fragmented. It is clear that mapping of variations to sequence functional features can help us better understand the pathophysiological role of variations. In this study, we investigated the effect of nsSNVs on more than 17 common types of post-translational modification (PTM) sites, active sites and binding sites. Out of 1 705 285 distinct nsSNVs on 259 216 functional sites we identified 38 549 variations that significantly affect 10 major functional sites. Furthermore, we found distinct patterns of site disruptions due to germline and somatic nsSNVs. Pan-cancer analysis across 12 different cancer types led to the identification of 51 genes with 106 nsSNV affected functional sites found in 3 or more cancer types. 13 of the 51 genes overlap with previously identified Significantly Mutated Genes (Nature. 2013 Oct 17;502(7471)). 62 mutations in these 13 genes affecting functional sites such as DNA, ATP binding and various PTM sites occur across several cancers and can be prioritized for additional validation and investigations.


INTRODUCTION
Non-synonymous single nucleotide variation (nsSNV) can have profound effects on protein function because of resulting changes to the amino acid sequence of the protein (1)(2)(3)(4). By knowing the distribution of all nsSNVs on the human proteome, the number of expected variations in a specific protein can be calculated (5). Additionally, mapping of disease-related and functional information onto the proteome can provide a comprehensive view of the impact of nsSNVs (2,(6)(7). There are several databases that contain SNV data and disease-related annotations (OMIM (8), ClinVar (9), SwissVar (10), LSDB (11), HGMD (6), dbSNP (12)). At the same time, Next-Generation Sequencing (NGS) technology is rapidly becoming a mainstream approach for identifying thousands of novel mutations and polymorphisms through national and international collaborative projects like the 1000 Genome Project (13), TCGA project (http://cancergenome.nih.gov/), NCI-60 panel project (14) and others (15,16). There are not yet standardized methods for NGS analysis, so there is a high level of variability between different analysis pipelines (17). Thus, there is a developing need for the creation of secondary curated databases to provide mechanisms for biocuration and standardization of NGS analysis pipelines and comparisons of their results (3,(18)(19).
To take advantage of the vast volume of human variation data accumulated over the last few years, the integration and unification of the data is critical, as a comprehensive data set is necessary to better summarize the variation trends. Based on our previous work (2,5,20), we have mapped and analyzed nsSNV impact in terms of N-linked glycosylation and active sites of enzymes for single-nucleotide polymorphisms (SNPs) available in db-SNP (12), somatic mutations from COSMIC (21) and variations reported in UniProtKB (22). To ensure a comprehensive study, we have now further extended our nonredundant nsSNV data set to cover all somatic mutations published by TCGA, ICGC (15) and IntOGen (23) as well as cancer-related mutations from NCI-60 cell lines and ongoing Curated Short Reads (CSR) (3) project. All variations were mapped to functional sites and unified based on the UniProtKB/Swiss-Prot defined complete human proteome (22) while different disease annotations were labeled and unified using Disease Ontology (DO) terms (24) for comparative analysis purposes.
Protein function has been observed to rely on select essential sites instead of requiring all sites to be indispensable. Many of these sites are crucial for normal physiological functions and are categorized according to function as active sites, binding sites, post-translational modification (PTM) sites, etc. An enzyme active site is the region where substrates bind to and are catalyzed by an enzyme, but the exact definition can differ across databases and publications (2,(25)(26)(27). Binding sites, on the other hand, exist not only in enzymes but in any protein that interacts with other biological components in a cell. Both UniProtKB/Swiss-Prot (22) and NCBI Conserved Domain Database (CDD) (26) provide rich annotation on active sites and binding sites of human proteins. CDD stores a significant amount of binding site annotations about small molecules like ligands and ions as well as macromolecular protein complexes. PTMs have always been considered important for their role in control of protein functionality and activity under different physiological conditions (28) and they have been extensively studied in various species (29). With the advent of new and powerful technologies, hundreds of thousands of diverse PTMs have been identified from organisms ranging from prokaryotes to eukaryotes. Cutting edge PTM research includes understanding the biological functions of PTMs (30), associating PTM to traits and the interplay and roles of different types of modifications (28,(31)(32). There exist several databases that store PTM-related information. Several efforts have been made to resolve this heterogeneity and to organize the data in a better way. A controlled vocabulary has been developed by UniProKB/Swiss-Prot with more than 100 types of PTMs annotated in the feature (FT) line. Annotations from CDD use a similar but different vocabulary. Besides these two major protein annotation sources, dbPTM 3.0 (33) carries out a comprehensive compilation of publicly available databases and generates a data set containing only experimentally verified entries, which include data from UniProtKB/Swiss-Prot, Phospho.ELM (34), PhosphoSitePlus (35), O-GLYCBASE (36), dbSNO (37), SysPTM (38) and HPRD (39). However, heterogeneity issues still left unresolved include (i) annotations are not mapped to a unified proteome reference and (ii) naming of PTMs is not standardized. In this study, we integrated and curated data from all of the above data sources and mapped the functional sites to the UniProtKB/Swiss-Prot human proteome to allow comprehensive and uniform analysis of the effects of germline and somatic nsSNVs.
In contrast to the extensive research performed on the functional impacts of variation (1,5,30,40) or on interpreting the possible function of the PTM/active/binding sites (2,28,(31)(32), combined studies, though of great importance, are relatively rare. Recently, however, publications on pan-cancer analysis showcase the importance of these types of studies. Examples include studies by Jia et al. which analyzed somatic mutations in nine major cancers and resulted in the proposal of 3-5 predominant mutational processes that likely underlie each cancer genome (41). Analysis performed by the pan-cancer genome program (42) demonstrates how such analyses may aid in identifying new pat-terns of drivers for cancer (43)(44)(45)(46). Although pan-cancer analysis is being heavily pursued because of the availability of cancer genomics data, to the best of our knowledge, there is no study that systematically profiles the interplay between germline and somatic variants mapped to an array of functional sites to answer the critical questions like how many sites are impacted and what is the possible relationship between nsSNVs and various functional sites. Here, we present a study that provides a detailed view of the human and tumor protein coding variome and their functional effects. Furthermore, a phylogenetic analysis of NGS whole exome sequencing samples of 30 breast cancer tumor patients and cell lines is provided as a demonstration of how this type of functional analysis and patient classification can provide a novel direction in personalized diagnostics and therapeutic research.

Integration of nsSNV data
The comprehensive non-redundant data set of nsSNVs was compiled from the following sources. The Cancer Genome Atlas (TCGA) and CGHub data portal (https://cghub.ucsc. edu/) was used to download TCGA variation data sets. The latest release of International Cancer Genome Consortium (ICGC) (15) was downloaded in January of 2014 from its FTP site. IntOGen (23) variation data was downloaded in January 2014 also. CSR (3) data set from 25 TCGA breast cancer patient samples and 5 breast cancer cell lines from NCI-60 panel was downloaded in January 2014. Mutation data from NCI-60 Panel was retrieved from CellMiner verson 1.4 (47). Additional data was obtained from files generated from SNVDis (5) which integrates data from Catalog Somatic Mutations in Cancer (COSMIC), db-SNP, UniProtKB/Swiss-Prot and nextProt (48). It is known the projects mentioned above may collect data from each other. For example, COSMIC is the portal of somatic data mined from literatures and TCGA release; ICGC is built on TCGA and raw data from partner institutes across all over the world; IntOGen collects data mainly from ICGC (15) and TCGA (46). UniProtKB provides curated protein mutations from literature. All somatic nsSNV data was integrated into BioMuta using previously described methodologies (19) for pan-cancer analysis.

Integration of functional sites
The complete protein functional site data set includes the curated annotations from UniProtKB/Swiss-Prot (22), CDD (26) and dbPTM 3.0 (33). The UniProtKB/Swiss-Prot complete human proteome was downloaded in January 2014. Active and binding sites were retrieved based on the FT line description. Modification data was extracted using PTMlist which is a controlled vocabulary provided by UniProtKB/Swiss-Prot. The NCBI CDD-based annotations of functional sites was retrieved using BATCH CD-Search (49) against CDART database in January 2014. Customized Perl/Python scripts were used to filter out entries such as domains, repeats and motifs with longer than five consecutive amino acids. Filtered sites were categorized manually into various types of PTM sites, active sites and binding sites with original annotations maintained in a separate column. Other PTM records were adopted based on dbPTM 3.0 (33) which collects PTM data from more than 10 different sources.

Generating unified and non-redundant data sets
Records with genomic positions and variants were translat ed and annotated through Seattleseq Annotation pipeline (http://snp.gs.washington.edu/SeattleSeqAnnotation138/). Resulting RefSeq accessions and positions were then mapped to UniProtKB/Swiss-Prot complete human proteome using methods we developed in previous works (5,50). Annotations from NCBI/CDD with Refseq protein accessions were directly translated to UniProtKB/Swiss-Prot using methods described earlier (3,5,50). Other annotations with UniProtKB/Swiss-Prot accessions were integrated into the data set following a validation process that confirms the existence of the wild-type amino acid in UniProtKB/Swiss-Prot entry.

Protein conserved sites and conservation ratio
Degree of conservation for each functional site and nsSNV was measured by using Basic Local Alignment Search Tool (BLAST) (51) against five mammalian proteomes (Rattus norvegicus, Mus musculus, Canis familiari, Bos Taurus and Equus caballus) downloaded from UniProtKB in January 2014. The five species are selected based on their proteome quality, completeness and evolutionary relatedness to humans based on Representative Proteome Group algorithm developed earlier by us and collaborators (52). Only best hits with e-value lower than 0.00001 for each species were considered homologs. Conservation was calculated as the number of homologs in which the site is conserved. For each site, if the amino acid was found to be conserved across all five selected mammalian species then it was regarded as 'conserved'. To facilitate the comparative analysis on conservation status of different types of functional sites being affected by nsSNVs, the conservation ratio was calculated. The conservation ratio of a specific amino acid was determined based on the percent of nsSNV impacted amino acids that were found to be conserved. Significance was calculated based on methods described earlier (5,53) where the expected number of nsSNV impacted conserved functional site is = (total number of conserved nsSNV/total number of nsSNV) × total number of the functional site impacted by nsSNV.

Mapping nsSNV to functional sites/motifs
Customized python scripts were developed to map nsSNVs against functional sites and motifs using specific keys, which is the combination of UniProtKB accession and position and the corresponding variation, as shown in Figure 1. After a variant was successfully mapped to the corresponding functional site, an evaluation process determined whether the variant can affect or cause a loss of function based on the type of functional site (sites/types considered for this study are listed in Table 1). Site information was obtained  (77). Variants that can be tolerated without potential loss of function, like T to S variation for the third position in the N-linked glycosylation motif NXS/T (where N is an asparagine, X is any amino acid except proline and serine or threonine is the third amino acid), were not regarded as functional site affecting nsSNVs. The final results of this mapping were organized into two files: a file with a list of impacted functional sites and variants with additional annotations such as conservation, frequency, etc.
Significance was calculated based on methods described earlier (5) and represented as the ratio of expected minus observed impact of nsSNVs (53). Therefore, the resultant Pvalue describes the deviance degree between a global ratio and an observed ratio. The calculation of the expected number n(E) of nsSNVs affecting a certain type of functional site is shown below: where N, n(F) are total number of variations and the total number of positions for a specific functional site, respectively. The probability p(F) of observing an amino acid from human proteome as a functional site equals to the value of n(F) divided by total length of human proteome L. Based on the expected number of functional site being affected by nsSNV n(E), and the actual observed number n(O), P-value is calculated through adopting the Binomial statistic as ap-Nucleic Acids Research, 2014, Vol. 42, No. 18 11573   Table 1. The knowledge summary of 18 common PTM types from the human proteome and their counts in our integrated data set Notes: Table 1 lists only modification sites that are either prevalently distributed on the human proteome or well characterized/defined. Some rare types of sites (less than 10 annotated in the human proteome) that are also included in our data sets are not shown here, but can be found in the Supplementary  Table S1b. plied by Mi et al. (53)

Gene Ontology (GO) and pathway analysis
Proteins affected by nsSNVs were analyzed for GO term and pathway enrichment using PANTHER Classification System 9.0 (78,79). In order to identify GO terms and pathways significantly affected, we calculated the P-value based on the ratio between nsSNV affected genes and genes that harbor the specific functional site according to UniProtKB/Swiss-Prot annotation of GO terms (5,80).

Pan-cancer functional site analysis
BioMuta (19) is a curated cancer-centric variation and disease association database in which variations are mapped to the genome/protein/gene. Cancer and somatic mutation data is automatically collected from a variety of data sources including TCGA, ICGC, COSMIC, ClinVar, IntO-Gen, CSR, UniProt in addition to cancer-associated mutation sites manually extracted from literature. All nsSNVs are mapped to the UniProtKB/Swiss-Prot defined human proteome and associated with UniProtKB accession, position, actual and altered amino acids. A total of 645 706 somatic nsSNVs in BioMuta 2.0 version are associated with at least one cancer type (includes both large-scale studies and small-scale studies). All cancer terms in BioMuta are assigned a DO (24) term to facilitate pan-cancer analysis. All variations were scanned against BioMuta to identify how mutations in different cancer types affect various functional sites. The calculation of significance of observed and expected functional site disruption under each DO term was based on methods described in earlier sections. Heat map and clustering analysis were performed using heatmap2 function from R package (http://www.R-project.org) and pan-cancer visualization was performed using Circos plots (81).

SNV-based phylogenetic analysis and group analysis of nsSNV-affected functional sites
The SNV-based phylogenetic analysis using variants called from Whole Exome Sequencing (WXS) data from 25 TCGA breast cancer tumor patients (30 samples) and 5 NCI-60 breast cancer cell lines is based on our previously developed methods (3). SNV-based phylogenetic tree requires two steps, SNV-based genome condensation and alignment followed by phylogenetic tree generation. Alignment of sequences from tumor samples from 25 patients and the NCI-60 cell lines containing a range of genomic sequence around SNVs (0-3) was created using PhyloSNP (3,82). FastTree (83) was used to generate phylogenetic trees with 1000 bootstrap values and a matrix of different sample IDs versus distinct nsSNV-affected functional sites was generated. The Newick format phylogenetic tree was overlaid with functional site information and visualized using Interactive Tree Of Life (ITOL) (84). A clustering of the same group of samples was performed based on their mutational functional site profiles. This profile was built based on the statistical significance of various types of mutation-affected functional site from each sample using the same binomial statistical method mentioned in the early section. For clustering, we only kept those functional site types being impacted by at least one nsSNV across 35 samples. The unsupervised hierarchal clustering is conducted in R package, heatmap2 function with enabled hclust function.

Disease association data
SwissVar (10) was downloaded in January 2014. Disease association data available through dbSNP including clinical/LSDB variations was downloaded from dbSNP March 2014. The NHGRI GWAS Catalog (85) was integrated based on information obtained in January 2014.

Integration and unification of functional sites and nsSNVs
As shown in the flowchart (Figure 1), the very first step includes integration of data from various sources. The integration process is described in Materials and Methods and the integrated data set is available on the Web (http://hive.biochemistry.gwu.edu/tools/var2function/) for users to browse, search/retrieve and download.
For the nsSNV data set (Supplementary Table S1a), we integrated various germline and somatic variation data sources (COSMIC, ICGC, TCGA, IntOGen, dbSNP, CSR, NCI-60, UniProtKB/Swiss-Prot). For cancer-centric somatic variation databases, the percentage of entries that overlap with dbSNP is less than 11% (COSMIC: 7.2%, ICGC: 10.9%, TCGA: 7.6%, IntOGen: 6.6%), which implies that the majority of variations are most likely not germline polymorphisms. For the functional sites data set, we integrated 17 major types of PTM covering 125 subtypes of non-redundant sites from UniProtKB/Swiss-Prot, CDD and dbPTM 3.0. Additionally, we integrated enzyme active sites and binding sites which gave rise to 1125 different subterms. Table 1 and Supplementary Table S1b provide details of the sites analyzed in this study.

nsSNV impact on functional sites
The integration described above resulted in two comprehensive non-redundant UniProtKB/Swiss-Prot human proteome-centric data sets with 1 705 285 and 259 216 nsS-NVs and functional sites, respectively. Mapping of protein variations to functional sites generated a list of variants that cause functional site disruption. A total of 38 549 nsSNVs that potentially affect protein functionality through replacing original residues at active sites, binding sites or PTM sites were identified (Table 2a). Particular variations for certain types of PTMs are tolerated without loss of function. The list of tolerated residues can be found in Table 1. Table 2a provides an overview of how nsSNVs obtained from nine different sources impact the various functional sites in the human proteome. Note that 14 of 19 common functional sites have significantly lower (eight) or higher (six) numbers of affected sites than expected by comparison to the overall distribution of nsSNVs along the human proteome. Acetylation, O-linked glycosylation, phosphorylation and ubiquitylation are the top underrepresented PTMs, methylation and N-linked glycosylation are the top overrepresented PTMs.
nsSNVs can be divided into two general categories: germline and somatic. Because germline variations are heritable and thus under relatively higher selection pressure than somatic variations, somatic nsSNVs can be more randomly distributed on an individual genome. Due to their different properties genetic diseases can be commonly categorized as a simple genetic disorder, such as Mendelian diseases which are caused by germline mutations, or a complex genetic disorder, like cancer which is associated with a spectrum of somatic and germline variants. Although there is no perfect method to distinguish between somatic and germline variation, in this study we differentiate variant type based on dbSNP categorization. Effects of germline and somatic variations in functional sites are shown in Table 2b. Active sites and binding sites show the sharpest difference between the two groups of nsSNVs where somatic nsSNVs have a significantly higher frequency of occurrence than expected compared to germline variations. Although several PTMs have higher rates of potential loss of function due to variation, methylation and O-linked glycosylation stand out because of the distinct level of significance in terms of overrepresentation observed for germline versus somatic types of nsSNVs.
Based on our findings that the distribution of nsSNVs is not random (2-3,5) we analyzed different data sources to better understand the germline and somatic variome. Table 2b and Figure 2 provides an overview of the analysis results. For Figure 2, green cells represent overrepresentation while red cells show underrepresentation. The intensity of the color is based on P-value significance. COS-MIC, TCGA, ICGC and IntOGen maintain thousands of cancer-associated variations, most of which are not in db-SNP (more than 92% of COSMIC, 89% of ICGC, 93% of IntOGen, 92% of TCGA) and can be considered somatic Nucleic Acids Research, 2014, Vol. 42, No. 18 11575   mutations. The distribution of nsSNVs from these data sets is similar to each other in terms of their distribution on functional sites. The minor differences observed in Figure 2 among different cancer genomics data portals may be due to the different data sources, collection and analysis methods, and the number of cancer study data sets in these resources. NCI-60 and CSR variants are called from the CSR pipeline, which results in between 80% and 98% of variants currently available in dbSNP for both human tumor samples and cell line samples, explaining the similar pattern they share with dbSNP. Below we provide additional details in terms of GO and pathways that are enriched with genes that have variations mapped to key functional sites (complete list is available in Supplementary Tables S3a and S3b).  Phosphorylation sites: Out of 77 734 experimentally verified phosphorylation sites reported on the human proteome, 9383 were replaced by an amino acid residue that cannot be phosphorylated. A total of 5466 of those sites are replaced by nsSNVs not in dbSNP (P-values shown in Table 2b and Figure 2). Most of the variants impacted serine (S), which causes the loss of 6063 phosphoserine sites (pS). There were 2033 phosphothreonine (pT) and 1283 phosphotyrosine (pY) sites lost, respectively. Phosphorylation sites have significantly fewer than expected variations (Table 2a, P-value: 1.28E-110) for both somatic and germline variations (Table 2b). This is not surprising as phosphorylation is involved in protein activation/deactivation in various pathways including transcription (86), translation (87), cell cycle (88) and signal transduction (89), and is necessary for normal cellular condition such that loss of phosphorylation has been implicated in many diseases (90,91). It is possible that variations that affect phosphorylation sites have a higher chance of being functionally relevant and hence disease causing. Pathway analysis of genes with loss of phosphorylation sites (Figure 3a Table  S3a) identified 32 biological processes, 23 molecular function terms having over-/underrepresentation with respect to frequency of nsSNV occurrence, and 9 cellular components terms. A wide variety of biological processes including proteins in metabolic process, cellular process, cell cycle and nucleobase-containing metabolic process are overrepresented, while for molecular functions terms overrepresented terms include kinase activity and nucleic acid binding. Unique terms are listed in rows and each column represents the degree of impact that a specific type of functional site has on that term. For visualization convenience, the calculated P-values are then transformed to -log(P-value). Underrepresentation is colored in red and overrepresentation is in blue. The darkness of the color in each box reflects the absolute value of -log(P-value). According to the patterns of -log(P-value) among various GO terms and PANTHER pathways, functional sites are clustered into different groups.
N-linked glycosylation sites: Among 16 170 experimentally verified N-linked glycosylation sites, 4372 nsSNVs alter the NXS/T motif (80), where N is an asparagine, X can be any amino acid except proline and S/T is serine or threonine. Note that 2375 of those impacted N-glycosylation motifs are disrupted by variations not in dbSNP (somatic), the remaining 1997 are germline variations (found in db-SNP) with P-values 2.09E-118 and 3.95E-163, respectively (Table 2b and Figure 2). Note that 2387 of the variants disrupt the motif by altering the Ser/Thr residue, while 1938 nsSNVs directly change the asparagine and 70 loss of Nglycosylation motif is caused by a change to proline (P). We observed higher numbers of loss of N-linked glycosylation motifs than expected (P-value 4.53E-273, shown in Table 2a) for both somatic and germline nsSNVs (Table 2b), and also across different data sources ( Figure 2). This phenomenon suggests that unlike phosphorylation, the N-linked glycosylation landscape is more likely to be altered. These results are counterintuitive to published results by Park et al. where they show that N-linked glycosylation motifs are more conserved, unlike what they observed for phosphorylation sites (92). One explanation could be that the two types of PTM sites might be involved in additional yet unknown functions and the two processes deal with loss of functional sites differently. Pathway analysis (Figure 3b, Supplementary Table S3b) highlighted two distinct pathways, which are cadherin signaling pathway (Pvalue 1.55E-03) and Wnt signaling pathway (P-value 4.34E-03). GO enrichment analysis (Figure 3a, Supplementary Table S3a) identified 46 terms such as cell adhesion (P-value 1.24E-06), nervous system development (P-value 7.62E-06) and ectoderm development (P-value 6.73E-04).
Ubiquitylation sites: Out of 22 549 experimental verified ubiquitylation sites, 2055 were found to be impacted due to nsSNVs. More than half of those sites (1214) are not in dbSNP (underrepresented, P-value 1.57E-71), the rest (841) are affected by nsSNVs found in dbSNP (underrepresented, P-value 1.10915E-61) (Tables 2a and 2b and Figure 2). Ubiquitylation sites, like phosphorylation sites, are significantly less altered than what is expected by both germline variation (based on dbSNP records) (Table 2b) and somatic variation ( Figure 2). Key ubiquitylation related tasks such as protein degradation, transcriptional regulation and genomic maintenance (93)(94)(95) are therefore less prone to variation-related changes.
O-linked glycosylation sites: O-linked glycosylation is important for proprotein processing, biosynthesis of mucins, formation of proteoglycan core proteins and blood group proteins (96,97). Out of 2549 O-linked glycosylation sites in the human proteome, 205 are found to be disrupted. Note that 97 of those impacted O-linked glycosylation site are replaced by nsSNVs not in dbSNP, the remaining 108 are mutated by nsSNVs found in dbSNP (Table 2b and Figure 2). As shown in Table 1, the main type of O-linked glycosylation studied is 2150 O-N-acetylgalactosamine (O-GalNAc) site, out of which 138 are affected by nsSNV. Note that 51 sites out of 335 O-N-acetylglucosamine (O-GlcNAc) sites are changed by nsSNV, some of which can be potentially linked to cross-talk with other PTMs to cause cancer (98). O-linked glycosylation sites unlike N-linked glycosylation sites are significantly underrepresented in terms of the number of proteins that are observed to contain variation at the site versus the number of proteins expected to be impacted (P-value 8.62E-24, Table 2a) which underscores the need for studying these two types of glycosylations independently. This underrepresentation trend is true for variations in both dbSNP (Table 2b) and somatic variation data sources (Figure 2), which implies that both cancer-related variations and germline SNPs have a resistance to disruption at the sites to maintain capability of O-linked glycosylation, which also plays role in protein secretion (O-GalNAc) (96) and pathway regulation with phosphorylation (O-GlcNAc) (99,100).
Acetylation sites: Out of 8253 acetylation sites in the human proteome, 863 lose acetylation due to nsSNV. 512 of those are replaced by nsSNVs not in dbSNP, the remaining 351 are mutated by nsSNVs found in dbSNP. Acetylation occurs by two distinct biological mechanisms: enzyme systems, substrate and position preference, which are a co-translational modification termed N-terminal acetylation (101,102); and PTM via Lysine acetylation (103). Nacetylation contributes 2194 sites, 201 of which are affected by nsSNVs. N6-acetyllysine, on the other hand, has 673 sites altered among a total of 6056 sites. According to Tables 2a and 2b, nsSNVs from both germline and somatic variations that cause the loss of acetylation sites are significantly less abundant than expected (P-value 4.39E-17 and 3.82E-15, respectively). If one considers the two different types of acetylation, total nsSNV affecting N-acetylation and N6-acetyllysine has similar constraints for variation (P-value 1.08226E-14 and 6.03084E-17, respectively). Further investigations into how nsSNVs in dbSNP and those not in dbSNP affect these two types of modification revealed a different type of behavior for the two subtypes of nsSNVs. For nsSNVs in dbSNP, N6-acetyllysine sites are less affected than N-acetylation sites with P-value 3.54E-12 to 2.18E-05, while for somatic variations the behavior is opposite (N6-acetyllysine with P-value of 2.19E-07 compared to N-acetylation P-value 3.68E-11). N6-acetylation is needed for transcriptional regulation and N-acetylation is involved in protein's synthesis, stability and localization (104)(105)(106). This indicates that transcriptional regulation is targeted more by somatic mutations.
Methylation sites: Histone methylation as a PTM has been extensively studied (107). Non-histone methylation is prevalent yet relatively poorly understood (108). Methylation can be subtyped according to the modified residue, either lysine (Lys) or arginine (Arg). Subtypes are studied separately as they involve different enzyme families (108)(109)(110). Out of 680 experimentally verified methylation sites reported on the human proteome, 224 sites were affected by nsSNV. Note that 163 of those sites were replaced by nsSNVs not in dbSNP, the remaining 61 were found to be mutated by nsSNVs found in dbSNP. There are 388 unique methylarginine sites and 287 unique methyllysine sites in our data set. nsSNV changed 172 arginine sites and 52 lysine sites. From Table 2a, methylation, like N-linked glycosylation, is significantly affected (overrepresentation, P-value 2.24E-25) by nsSNV. After subgrouping nsSNVs by somatic and germline origins, a clearer view of how methylation sites are affected at the proteome level was revealed. nsSNVs not in dbSNP show a significant overrepresentation (P-value 2.86E-28) compared to germline variations (4.84E-03). Figure 2 shows that loss of methylation site is more prevalent in cancer/somatic mutation centric databases than in dbSNP. The above statistical significance indicates that the PTM methylation is greatly affected in tumor cells which corroborate previous studies linking loss of methylation site to cancer where the authors propose such loss of methylation site with global effects which may alter epigenetic states in a variety of pathologies (111).
Crotonylation sites: Out of 204 experimentally verified crotonylation sites reported in the human proteome, 52 sites are modified by nsSNVs. Note that 42 of these sites are impacted by variations not in dbSNP, the remaining 10 are impacted by polymorphisms found in dbSNP. Loss of crotonylation due to somatic mutations is significantly high (Pvalue 8.62E-07) unlike germline variations (P-value 2.69E-01). It was recently shown that lysine crotonylation is a conserved histone PTM found in somatic cells (59). Frequent disruption by somatic mutations of crotonylation sites corroborates previous knowledge that the histone code is more likely to be affected in cancer cells (112).
S-nitrosylation sites: Out of 755 experimentally verified S-nitrosylation sites reported in the human proteome, 75 sites are impacted by variation. Note that 32 of those are changed by nsSNVs not in dbSNP, the remaining 43 are affected by nsSNVs found in dbSNP. Somatic variations are underrepresented (P-value 2.24E-06). Certain exceptions can be seen from the Figure 2, like COSMIC and IntOGen which provide cancer-centric somatic variation sets that are not significantly underrepresented. As a PTM related to various major diseases (113) including cancer (114), heart diseases (115,116) and Alzheimer's disease (117), it is not clear why S-nitrosylation sites are under functional pressure in cancer cells. Availability of additional experimentally verified S-nitrosylation sites and functional analysis of the loss of nitrosylation sites will provide a better understanding of this PTM.
Enzyme active sites: Out of 14 708 experimental verified active sites reported on the human proteome, 2385 are detected to be substituted by an amino acid residue that can potentially interrupt original activity (2). Note that 1574 of the impacted active sites are affected by somatic nsSNVs, the remaining 811 are from dbSNP. Overall, there appears to be more variation on these sites than expected (Table 2a; P-value 1.83E-04). Further comparison between somatic and germline nsSNVs gives a clear indication that active sites are under opposite pressures in terms of significance between the two variation types (Table 1b). Somatic variations have a significantly higher chance of loss of active sites due to variation (P-value 1.56E-14) while germline variations usually do not occur at such sites (8.53E-05), a phenomenon we have reported previously (2). As discussed in the previous study (2), nsSNV is assumed to cause a loss or change of function at active or catalytic sites by replacing the essential site with a residue not optimized to the specific interaction required by the enzyme for proper functionality. There is a variety of experimental evidence which describes this mechanism of functional loss and related diseases (118,119).
Binding sites: The abundance of binding sites can be seen from both numbers of types of functional sites and numbers of sites (Supplementary Table S1b and Table 1). A total of 113 758 sites belonging to 1105 subtypes of binding sites were collected. There are several studies on how binding sites are affected by nsSNVs and how they are linked to diseases (120,121). This study provides a comprehensive view of how germline and somatic mutations affect such sites. Out of 113 758 experimentally verified binding sites reported on the human proteome, 18 681 are affected by nsSNVs. Note that 12 286 of those impacted binding sites are replaced by nsSNVs not in dbSNP, the remaining 6395 are mutated by nsSNVs found in dbSNP. Similar impact patterns can be seen (Tables 2a and 2b and Figure 2) between binding sites and active sites, however, binding sites are overrepresented with an abundance of residue-changing variants (Table 2a; P-value 4.81E-32). Further analysis between somatic and germline nsSNVs provides a similar profile as observed for active site variations: somatic variation is overrepresented (P-value of 2.26E-110) while germline variations are underrepresented (P-value of 7.86E-20). The top six binding sites that are affected are DNA binding site, ATP binding site, Ca2+ binding site, dimer interface, substrate binding site and Cytokine receptor motif.
Pathway and GO analysis summary. There are recent studies that report the association and coordination between different types of PTMs from both experimental and computational aspects (29,31,(122)(123)(124)(125). GO overrepresentation analysis is often used in gene expression analysis and a variety of other large-scale studies (31,(126)(127)(128). However, there is no effort yet dedicated to the comprehensive study of the effects of germline and somatic variations on all major PTM sites. The GO term enrichment test and the follow-up clustering analysis among 10 main types of functional sites can be seen in Figure 3a. A total of 332 distinct GO terms of biological process, molecular function and cellular components showed over-/underrepresentation to at least one type of nsSNV affected functional site. It is interesting to see ubiquitylation and acetylation are grouped together by similarity of pathway hits as similar observations about functional associations are found in literature (31). We also see nsSNV affected phosphorylation, N-glycosylation, enzyme active sites and protein binding sites are enriched/depleted significantly among a wide range of GO terms with distinct patterns. Phosphorylation, especially, behaves similar to what is described in previous studies (31,129) and is more likely to be involved in protein interaction networks and pathways. Besides GO, PANTHER pathway analysis provided similar clustering results. Note that 249 distinct pathways were found to be influenced by proteins containing nsSNV-affected functional sites. Similar patterns and relationship between acetylation and ubiquitylation can be observed in Figure 3b. Details of the significantly influenced pathways and GO terms are described in the above section. It is important to note that GO and pathway annotations are different ways of classifying genes, and they have different coverage and focus and therefore can only reflect the current knowledge available to the scientific community. Maturation of methods in metabolic modeling is expected to provide additional insights into how these variations effects the biological system (130).

Proteome-wide pan-cancer analysis
Comparative analysis of the functional effects of variations on different cancer types provides a comprehensive overview of the similarities and differences between different cancer types (Figure 4, Supplementary Table S4b). In order to better understand the functional impact of mutations in different cancer types we undertook a proteomewide pan-cancer analysis. The systematic pan-cancer scale investigation of somatic mutations has been impossible to achieve until recent advances of sequencing technologies (131,132). Recently, the TCGA pan-cancer analysis project across 12 tumor types has been launched (133) and several others are ongoing. Recent studies by Alexandrov et al. (131) reported genome-wide mutational signatures and regions. A follow-up study by Jia et al. (41) have shown that for nine cancers, 3-5 independent mutational signatures in each cancer underlie each cancer genome where both mutagen exposure and changes in DNA repair systems were identified as key mutagenesis forces. Other previous works have shown that there exists specific mutational patterns in different cancers (134,135) and recent studies have started identifying genes to be 'truly associated' with cancer (23,43,(135)(136). Our pan-cancer study goes a step forward by identifying not just the genes but key mutations that potentially affect known protein function.
Our data set of cancer-associated nsSNVs was mapped to 73 DO terms (24). The DO terms were further collapsed into 12 DO terms to facilitate pan-cancer analysis (Figure 4, functional site affecting mutation counts can be found in Supplementary Table S4a and P-value is available in Supplementary Table S4b). The mapping result from cancer types to our total nsSNV data set (1 705 285) resulted in the assignment of 573 789 nsSNVs associated with at least one cancer type. Note that 38 549 of these nsSNVs affected a protein functional site and 13 159 nsSNVs have cancer associations. We found that across the different cancer types, N-linked glycosylation, methylation and binding sites are all preferentially affected by somatic mutation. Conversely, phosphorylation and ubiquitylation are significantly less impacted by variation compared to what is expected based on the distribution of all variations. O-linked glycosylation and acetylation show a similar trend albeit less significant. It is interesting to note that unlike most of the cancer types having statistically less nsSNVs interrupting phosphorylation sites, acute myeloid leukemia (DOID:9119) shows overrepresentation of cancer nsSNVs on phosphorylation sites (P-value 1.24E-03). It has been shown that phosphorylation status regulates the function of CCAAT/enhancerbinding protein alpha, a crucial factor associated with the development of various subtypes of acute myeloid leukemia (AML) (137). Furthermore, a recent NGS study on AML clinical samples revealed that 59% of case samples harbor at least one nsSNV in signaling genes which are pathogenesis related (138). However, further research is required to elucidate the role of phosphorylation on the development of AML.
Comparison across cancer types show that lung cancer (DOID:1324) and breast carcinoma (DOID:3459) are clustered together (Figure 4) as nsSNVs from both cancer types showing underrepresentation of phosphorylation, ubiquitylation and O-linked glycosylation and overrepresentation in N-linked glycosylation and methylation. Interestingly, for active sites, the two cancer types behave differently. Other cancer types such as urinary bladder cancer (DOID:11054), ovarian serous cystadenocarcinoma (DOID:5746), rectum adenocarcinoma (DOID:1996) and uterine corpus cancer (DOID:9460) are grouped together since they share similar mild or lack of significance for both groups of functional sites that commonly show over-(like N-linked glycosylation) and underrepresentation (such as phosphorylation). Skin melanoma (DOID:8923) and renal clear cell carcinoma (DOID: 4467) are grouped together because they share the overrepresentation of nsSNVs abolishing Nlinked glycosylation site and normal occurrence of nsSNVs on phosphorylation site, which is usually underrepresented. The nsSNVs from endometrial carcinoma (DOID: 2871) and colon adenocarcinoma (DOID: 234) share the pattern of overrepresentation on binding sites and underrepresentation for O-linked glycosylation sites compared to oropharynx cancer (DOID: 8857) which is clustered alone beside the clade of the former two. In terms of overrepresentation on phosphorylation sites, nsSNV from acute myeloid leukemia (DOID: 9119) is separated on one branch. It is interesting to note that the ontologically closely related cancer types from DO hierarchy are not clustered into the same branch but into adjacent branches as they show different functional impact patterns. For instance, colon adenocarcinoma is not clustered together with rectum adenocarcinoma, but they are all under the term colorectal cancer (DOID:9256) in DO. Similar patterns can be seen in uterine corpus cancer and endometrial carcinoma.
It has been suggested that the number of driver mutations required during oncogenesis is relatively small, so it is possible to identify such driver mutations by looking to see which mutations are shared across cancer types (43). Potential driver mutations were identified based on mutations that are present across multiple cancer types. We successfully assigned 5496 distinct genes with 13 132 functional site affecting nsSNVs to cancer types. Out of these 5496 genes, 514 genes with 755 nsSNV affected functional sites are associated with two or more cancers (full list in Supplementary  Table S4c). And 51 of these genes harboring 106 nsSNV affected functional sites were found in three or more cancer types.
In order to better understand what we found through this pan-cancer analysis, a comparison was performed by using the gene list of 127 significantly mutated genes (SMG) identified by a recent pan-cancer study that used TCGA data (43). By mapping the published SMG data it was found that 88 out of 127 SMGs have a functional site affecting mutation based on our analysis. Out of these 88 genes, 29 were found in our list of 514 genes affected in two or more cancers described above. From our study we wish to emphasize 13 genes which are (i) present in the list of 127 SMG set, (ii) present in our list of 51 genes which are present in three or more cancers and (iii) have key functional site-affecting mutations. Table 3 provides a list of these genes and mutations. While the majority of these 13 genes are well known to be involved in cancer, this work highlights for the first time, key point mutations through a comprehensive pan-cancer analysis. To better visualize the data we plotted all functional site-affecting cancer-associated mutations from the above 51 key genes using Circos plot ( Figure 5, the matrix seen Supplementary Table S4d). The plot includes a total of 990 mutations related to one or more cancers (106 of them are associated with three or more cancer types). Functional site-affecting mutations from genes like TP53, HIST1H4A, HIST1H3A, RELN, SMAD4, CTNN81, DICER1, KRAS, NRAS, BRCA2 and PTEN account for majority of the nsSNVs associated with cancers. Figure 5 provides an overview of distinct patterns on how different functional site-disrupting nsSNVs are contributed by each gene. The 127 SMGs consists of one miRNA (MIR142) and 126 pro-tein coding genes. Note that 21 of them either are found to have no functional site annotations (such as EGR3) or have functional site(s) which do not overlap with with nsS-NVs (e.g. PCBP1). Note that 17 of them contain at least one nsSNV affected functional site, which links to disease annotation(s) (e.g. MECOM) but not a cancer type; and the rest 88 of them harbor one or more mutations associated with at least one cancer type. For those proteins which have functional sites and nsSNVs in our data set but show no overlap between the two can point to passenger mutations or might indicate paucity of known/annotated functional sites. For example, PCBP1, which has around 40 functional sites and 38 nsSNVs on a 356 amino acids long sequence, shows surprisingly no overlap of nsSNVs and functional sites. It is possible that for such proteins either are protected from mutation of functional sites or the knowledge of all of the functional sites in the protein is incomplete.
In addition to the above analysis we wanted to interrogate the data to identify mutations that are reported in multiple studies in hopes of gaining a slightly different perspective of the genes and mutations associated in cancer and providing a measure of validation to our analysis of cancer-specific mutations. To achieve this we counted the number of publications associated with a specific variation (based on counting unique PubMed IDs). We found 3787 functional sitedisrupting nsSNVs supported by two or more published studies (Supplementary Table S4e). Out of those 3787 mutations, 379 are described in three or more studies. Based on our review of the publications associated with the mutations we conclude that although there are several publications that associate cancer to specific mutations, very few of these mutations are reported in databases. Focused biocuration can connect cancer, mutation and PMIDs together which in turn can help identify mutations that have been identified by multiple studies.
Conservation analysis. Conservation analysis can be used as a complimentary method to measure the importance of a specific site from an evolutionary perspective. The conservation ratio ranges from 0 to 1 as described in Materials and Methods. We found that functional sites impacted by nsSNVs are significantly more conserved than nsSNV sites which are not associated with any known function (Supplementary Table S5a). Other than the significance analysis using P-value, the absolute ratio provides finer resolution ( Figure 6a) and at the same time illustrates the distinct differences between germline and somatic SNVs. It is clear that somatic SNVs disrupts more conserved sites compared to germline mutations in both functional site affecting nsS-NVs and total nsSNVs (All+S and Func+S, respectively, in Figure 6a). Generally, we found nearly all the sites affecting a functional site has a higher conservation ratio than the global nsSNV conservation ratio, especially binding site (on average 0.78) and active site (on average 0.83) as can be seen in Figure 6b. Common PTM types, like N-linked glycosylation and phosphorylation, have a mild conservation ratio which is higher than the global ratio for total nsS-NVs. O-linked glycosylation is the only exception where the conservation ratio for nsSNV affecting the site has a value (Ser 0.44, Thr 0.26) even lower than the global nsSNV ratio (Ser 0.50, Thr 0.46). More experimentally verified O-  The conservation ratio is calculated based on the percent of nsSNV impacted amino acids found to be conserved according to the BLAST results. In the radar chart, the radiating axis shows the conservation ratio values ranging from 0.3 to 0.9. 'ALL' stands for a specific type of amino acid that is affected by nsSNV regardless of functional site annotation. 'Fun' stands for an amino acid containing a functional site affected by nsSNV. 'G' stands for germline nsSNV site while 'S' stands for somatic nsSNV site. 'T' represents total conservation ratio for both germline and somatic nsSNVs. (b) Amino acid-based conservation ratio profile of various types of functional site affected by nsSNVs. The conservation ratio, which is calculated by the same method as in Figure 6a, is presented as the height of each bar. For each amino acid, depending on the type of functional site there can be one or multiple bar(s) that are colored differently. Each colored bar is represents the conservation ratio of a specific type of functional site affected by nsSNVs for that particular type of amino acid. For instance, the color 'red' is for acetylation, so there are red bars that occur among A(Ala), K(Lys), M(Met) and S(Ser). glycosylation sites are needed to evaluate if this trend holds. Figure 6b only delineates an overall trend among different amino acids while Supplementary Table S5 provides the details of total conservation ratio, somatic/germline ratio on all the main amino acid grouped by different type of functional sites in our data set.
Example filtering and sorting of data to identify validation targets. There are several ways one can filter and sort the data to identify priority validation targets. In addition to the analysis mentioned above, we present below several examples of sorting the data to retrieve potential targets which can be further characterized and validated to connect genomic variation to functional impact and disease.
Through the comprehensive analysis of variations in all types of cancers that lead to loss of phosphorylation sites, we identified 143 mutations present in at least 2 types of cancers and 16 mutations found in 3 or more cancers (Supplementary Table S4c). From our analysis of TCGA breast tumor cancer samples (CSR porject (3)), we found a nsSNV (hg19 chr17:12915009) that changes reference nucleotide G to A. This mutation was found in 33% of the tumor samples and 12% of the control samples. The SNV maps to the protein zinc phosphodiesterase ELAC protein 2 (ELAC2; Q9BQ52) and changes 217 serine (S) to leucine (L). In UniProtKB, this mutation is recorded as a natural variant with the note 'in HPC2; does not affect the enzymatic ac-tivity'. However, in our data set, we found this mutation is shared among CSR, NCI-60 (27 out of 60 cell lines (14)), dbSNP and annotated in UniProtKB to cause the loss of a phosphorylation receptor Ser (a phosphoserine site (35)). We also found that the site is conserved in 4 out of 5 orthologous proteins in mammals (for list of organisms see Materials and Methods). This mutation is linked to a Swiss-Var record for 'prostate cancer, hereditary, 2'. This mapping and annotation bridges the gap between the phosphorylation site and disease-driven mutation and provides clues to test the possible mechanism of disease.
For loss of N-glycosylation sites we identified 63 mutations present in 2 or more types of cancers and 5 mutations found in 3 or more cancers (Supplementary Table  S4c). Here we use the replacement of an asparagine residue in position 130 of SLCO1B1 protein (UniProtKB accession: Q9Y6L6) as an example. The genomic variation on hg19 chr12:21329738 from A to G is shared by CSR, ICGC, dbSNP and NCI-60 exome project. The variation can be mapped to UniProtKB entry Q9Y6L6 (protein name: Solute carrier organic anion transporter family member 1B1). We found that there is nearly a 20% frequency difference between CSR breast cancer samples and normal samples for this specific variation and the variation can be widely detected on 28 cell lines across the NCI-60 cell line panel. The corresponding protein site is under moderate conservation (conserved in 3 out of 5 orthologous proteins).
A comprehensive analysis of variations in all types of cancers that lead to loss of active sites identified 33 such mutations that are present in no less than 2 type of cancers and 1 mutation were found in 3 or more cancers (Supplementary Table S4c). The mutation of an enzyme's active site can destroy its activity. In the 2385 nsSNV modified active sites, 222 are annotated to be associated with diseases. Here we choose one variation not yet well-studied to identify disease associations for which we have a distinct sample frequency among the case and control samples of our CSR TCGA breast cancer research. The genomic variation on human genome hg19 chr14: 24707479 from G to A occur in 13% of case samples but only 4.8% of control samples of our CSR TCGA breast cancer data. The same variation has also been reported by the NCI-60 cell line panel exome study (5 cell lines out of total 60). The protein is GMP reductase 2 (GMPR2) and the variation is mapped to the protein's active site Gly on 242 which is changed to Asp by the nsSNV. This may cause loss of function resulting in a disease phenotype. It is interesting to note that it has been shown that lack of expression of the proteins GMPR2 and PPRA are associated with the basal phenotype and patient outcome in breast cancer (139) and loss of function may play a role in carcinogenesis. For loss of binding sites we identified 439 such mutations that are present in 2 or more cancer types and there are 19 mutations that can be seen across 4 and up to 9 distinct DO cancer terms (Supplementary Table S4c).
As an example we highlight here a GTPase NRas (NRAS) protein and its variation in position 61 from Gln to Arg caused by genomic variation on hg19 chr1:115256529 from T to C. The SNV is present in TCGA, ICGC, IntOGen, NCI-60 cell line data set and dbSNP. The corresponding protein variation can also be found in the corresponding UniProtKB feature line. This binding site information is from the CDD database (26), which records the site as a part of GEF interacting site. General annotation in UniProtKB states that NRAS is activated by GEF. Thus, the mutated binding site of GEF interaction may lead to an impaired activation function of the protein, which can lead to the disease annotation for that site labeled 'lung carcinoma cell and melanoma' in SwissVar. By checking the reference (140) that identified the site as the disease-causing site, we found there is no explicit description how the disease is caused by the variation. The disrupted interaction with the activator GEF may serve as one possible explanation proposed by our study. Similar potential validation targets can be identified for all of the nsSNV-affected PTM and functional sites discussed in this paper.

Phylogenetic classification of patients
Cancers are primarily classified based on tissue of origin. Based on our pan-cancer analysis results described above we see that there is a need to develop additional methods of classification for biomarker discovery. Recently, we have developed a method that allows phylogenetic classification of tumor and normal samples based on mutation profile compared to the human reference genome (3). Such phylogenetic analysis promises a systematic view of multiple samples or a population which can help us to understand the background heterogeneity of each sample from patients with the same or similar disease phenotype. This analysis also facilitates personalized analysis which can lead to personalized diagnostics and therapeutics. In this study, we selected 30 breast cancer samples and compared them with 5 breast cancer derived NCI-60 cell lines. The cancer-centric phylogenetic tree in Figure 7a was generated based on SNVs extracted from the 35 samples (30 tumor samples and 5 cell lines). Phylogenetic analysis shows that the samples fall into few main groups and it is interesting to note that all NCI-60 cell lines, T-47D, HS-578T, MCF7, BT549 and MDA-MB-231 in the same clade with several patient samples associated with that specific branch. It has long been argued that breast cancer is a complex and heterogeneous disease. Till date several factors have been used to classify breast cancer types and cell lines that include expression analysis, histological type, tumour grade, lymph node status, presence of predictive markers such as oestrogen receptor and human epidermal growth factor receptor 2, luminal subtype, etc (141). The use of breast cancer cell lines to model breast cancer has benefits and we believe phylogenetic classification of the cell lines and tumor cells provides a higher level of classification that can complement existing methods and help investigate the functional impact of variation in different subgroups. Additionally, this type of analysis can also help provide some basic quality control step where multiple samples have been sequenced from the same patient from the same tissue. Figure 7a provides such an example where we can see that samples from the same tissue and same patient (share the same TCGA patient sample barcode) are paired in the same branch (TCGA-A7-A13E-01A and TCGA-A7-A13E-01B). It is important to remember that the phylogenetic tree is based on multiple sequence alignment of all exomes (patients and cell lines) and not just somatic mutations (3,19). If whole genome sequence data is present then Branches are colored to visualize the bootstrap value (color from red to yellow represents the corresponding bootstrap value from high to low). Heatmap demonstrates the status of loss of functional sites of each individual sample is overlaid at the tip of the phylogenetic tree branches. From inside to outside of the heatmap contains mutated functional sites from acetylation to ubiquitylation. (b) A hierarchal clustering is performed based on significance of mutational profile on functional site. Underrepresentation is colored in red and overrepresentation is in blue. The darkness of the color in each box reflects the absolute value of -log(P-value). Based on the patterns of -log(P-value), cancer samples/cell lines are clustered into different groups. based on the phyloSNP (82) algorithm one can classify the patients and the tumors with a higher resolution.
The merged matrix shows 527 nsSNVs that affect functional sites among 35 samples, which indicates overlapping of functional site impacting nsSNVs across samples. The heat map appended on the phylogenetic tree ( Figure 7a) provides a visual representation of the distribution of those nsSNVs based on clustered samples. From inside to outside, the colored boxes represent acetylation, active site, binding site, methylation, N-linked glycosylation, O-linked glycosylation, other modified residue, phosphorylation and ubiquitylation.
These 527 nsSNVs impact nine different types of functional sites. The statistical significance of each type of nsSNV affected functional site from each sample was calculated and is presented as a heatmap (Figure 7b). Based on different patterns of significance, an unsupervised hierarchical clustering was performed to show the association between different samples. From the figure, acetylation, active site, binding site, phosphorylation site and ubiquitylation sites were found to have significant impact by nsSNVs. N-linked glycosylation and O-linked glycosylation present the overall trend of overrepresention of nsSNVs with few exceptions, which indicates the heterogeneity of functional profile from the genomic landscape of tumor samples. The sample level resolution also allows the observation of other outstanding data points. For example, there are few samples which show myristylation or methylation being affected by nsSNVs while the majority show no significance impact for these functional sites. When looking across the clustered samples, we can summarize that the grouped five cancer cell lines are relatively less impacted by nsSNVs in binding and phosphorylation sites than most of the TCGA patient samples, which suggests cell lines are under less selection or have higher tolerance to variations interrupting these types of functional sites. Comparing this clustering with SNV-based phylogenetic tree on the same group of samples (Figure 7a), additional insight can be gained. Shared result from the two figures includes the clustering of five cell lines. The fact that the cell lines are grouped together under the two distinct methods, suggests the differences of mutational profile and mutated functional profile between cell lines and patient samples are significant. Although, we agree to Meyerson view 'millions and millions of people are getting cancer, and over time the statistical power of looking at primary tumors is going to be greater' (142); however, the boundary is sometimes bridged as can be seen in Figure 7b. There are three patient samples that cluster closely with the five cell lines as they share similar patterns of nsSNV affected functional sites. Rest of the samples are clustered in more or less different hierarchy than the SNV-based phylogenetic tree, indicating the heterogeneity of tumor samples. It is also interesting to see different mutated functional site profile among evolutionarily paired samples from the same tumor patient. Overall, we believe this type of analysis can provide an informative reference as a complement of traditional classification methodologies described above.

Linking the nsSNV impacted functional sites to diseases
For this analysis disease associated SNVs from SwissVar, GWAS catalog project and dbSNP was collected. The collection resulted in 63 927 distinct entries though some cancer entries from different sources may overlap. Among 63 927, there are 2110 that are present in nsSNV affected functional sites (Supplementary Table S6, column Disvar). The top 10 genes which have the highest number of mutations (normalized based on their length) related to diseases are TP53, HBB, VHL, TTR, HMBS, HBA1, SOD1, GCH1, F8 and KRAS. Although, we attempted to have a comprehensive disease-related mutation list we realized that additional biocuration and literature mining-based methods can significantly increase this type of mapping. Increased mappings may lead to additional genes associated with nsSNVs and their impact to functional site.

CONCLUSION
The significance of nsSNV impact calculated based on the source and type of functional site provides a comprehensive view of how germline and somatic variations are distributed on protein functional sites. Somatic nsSNVs collected from various cancer-centric databases, instead of showing random distribution, occurs on functional sites with significant over-or underrepresentation and therefore may be involved in the development of the tumor phenotype. Additionally, use of phylogenetic classification using SNVs and overlaying the tree with functional site variations provides a framework for novel translational and personalized medicine research and can lead to targeted diagnostics and therapeutics. It is important to note that many disease-related mutation sites are located in the non-coding regions of the genome (143). Analysis of those mutations using methods similar to the ones described in this paper may help to unravel some disease complexity. In this study even after collecting the most comprehensive information on protein functional sites we find only 2.29% affect the types of functional sites defined in our study. Therefore, the vast majority of cancer-associated mutations either have no effect on protein function or one can argue that our current knowledge of protein functional sites is highly limited. Additional protein functional site information in the coming years will help refine and improve this analysis. Our future plans include redoing the analysis described here every 2 years and compare and contrast the results over time.
It is also possible that neutral sequence variants may define individuals and their diseases (40). Additionally, certain disease phenotypes arise through combinations of many variants whose individual effects might not be damaging. With the development of sequencing technologies, highquality genomics and proteomics data will help unravel the complexities of biological systems and help connect genomes to phenomes, especially for complex diseases like cancer and diabetes. In order to get a better understanding of the system, functional annotation that facilitates the exploration of the impact of a variant is necessary and hence active biocuration of genomic and proteomic data is of utmost important.