Whole-exome and whole-genome sequencing have facilitated the large-scale discovery of de novo variants in human disease. To date, most de novo discovery through next-generation sequencing focused on congenital heart disease and neurodevelopmental disorders (NDDs). Currently, de novo variants are one of the most significant risk factors for NDDs with a substantial overlap of genes involved in more than one NDD. To facilitate better usage of published data, provide standardization of annotation, and improve accessibility, we created denovo-db (http://denovo-db.gs.washington.edu), a database for human de novo variants. As of July 2016, denovo-db contained 40 different studies and 32,991 de novo variants from 23,098 trios. Database features include basic variant information (chromosome location, change, type); detailed annotation at the transcript and protein levels; severity scores; frequency; validation status; and, most importantly, the phenotype of the individual with the variant. We included a feature on our browsable website to download any query result, including a downloadable file of the full database with additional variant details. denovo-db provides necessary information for researchers to compare their data to other individuals with the same phenotype and also to controls allowing for a better understanding of the biology of de novo variants and their contribution to disease.
Each person contains novel variants not present in either of their parents and these variants are termed de novo. Most of the ∼70 (1) de novo single-nucleotide variants and small insertions/deletions (indels) found in an individual genome have no obvious phenotypic impact, but there are cases where de novo variants have been found to contribute to disease. Well-described examples include achondroplasia where mutations occur in FGFR3 (2) and Rett syndrome where in most cases the variants arise de novo in MECP2 (3). With the advancement of next-generation sequencing into the study of the whole complement of human genes via whole-exome or whole-genome sequencing, researchers are getting a clearer picture as to the contribution of these variants to ‘complex’ diseases such as autism (4–15) and schizophrenia (16–18). For example, in autism, de novo single-nucleotide variants and indels contribute to ∼7% of the attributable fraction (6) and as much as 21% of simplex cases of the disease (5). Considerable overlap has also been noted between genes with de novo variants contributing to several neurodevelopmental disorders (NDDs) (19).
While the primary focus in the literature for disease-causing de novo mutations has been on NDDs and congenital heart disease, other phenotypes have also been assessed with smaller sample sizes. With application to these and other disorders and diseases on a large-scale, even more findings are sure to arise. denovo-db was designed with the objective of consolidating all published de novo germline variants, regardless of phenotype, and systematically annotating with standardized analytical pipelines. This provides the research community with a one-stop location for assessing the significance of particular genes or mutations as they relate to their phenotype of interest. The researcher could then ask questions relevant to a disease, such as whether the number of de novo variants seen in a gene is statistically significant using tools such as denovolyzeR (20), or one could ask whether the variants seen in a gene, across many individuals, are more clustered in disease than would be expected based on control data using tools such as CLUMP (21). A researcher could also ask questions unrelated to disease with patterns of de novo variants gathered across many individuals potentially providing novel insight into the biology of new mutation in the human genome. denovo-db, thus, provides a resource for specific and general analyses regarding de novo mutations.
We searched the literature for published studies where human de novo variants had been identified by next-generation sequencing technology (4–7,9,10,13–18,22–49). These studies were then carefully curated to gather essential information on each de novo variant, including sample identifier (if possible), chromosome, chromosome position, reference allele, alternate allele, and orthogonal validation status. A validation status of ‘yes’ indicates that the variant has been validated as de novo in the child and absent in the parents. The sample identifiers used in denovo-db originate directly from the published literature, and if there is not one available, a simple nomenclature is assigned: LastAuthorNameSampleX where X is a number. If source coordinates were not mapped to GRCh37, the coordinates were lifted over for consistent annotation among all studies. The data from each paper was then aggregated into a study table with a yaml (http://www.yaml.org/) configuration file corresponding to information in the file required for our pipeline. If any data was not available or was unclear, we queried the authors for additional information. Care was taken to avoid duplication of samples within the database. One example is individuals from the Simons Simplex Collection (SSC). To date, sequencing information has been published from ∼2500 SSC families and the data aggregated into denovo-db from eight studies (5,6,8–12,14). For this collection alone, there have been thousands of duplicates. In cases where this duplication occurs, orthogonal validation status takes precedence.
Combination and annotation of data
Each study table was converted to a gcf file, a variant call format (VCF) file file with one sample per line, and then all gcf files were combined to make a master VCF file of all the studies. This data was then run through the SnpEff (50) program to add annotation information. Post-annotation, variants were removed that did not validate (based on the orthogonal validation) or were found to be inherited and therefore not de novo. All variants were subsequently re-annotated using SeattleSeq (51) so that we could get annotation for all available RefSeq transcripts. Whenever a variant did not have annotation by SeattleSeq, we converted the SnpEff annotation to another label as described in
denovo-db is available at http://denovo-db.gs.washington.edu and requires no usernames or passwords. It is available to the public for querying and downloading data. We have tested it in Mozilla Firefox, Google Chrome and Apple Safari browsers. The download version of the full denovo-db dataset is available as a tab-delimited file on the ‘Download’ page of the website. It contains annotation to all transcripts, based on SeattleSeq, as well as additional columns related to scoring of variants. Each update of denovo-db is released with a version on it and old versions are maintained and archived by the Eichler laboratory using the git version control system. We will update the database and website four to six times per year depending on the number of new papers in the literature with de novo variant data. We have set up a mailing list at email@example.com for users with additional questions. Researchers can also use the mailing list to send us information on other published studies to include in the database. Upon receiving this information we will run the data through our pipeline and integrate into denovo-db.
As of July 2016, denovo-db consisted of 32,991 variants (n = 8,541 orthogonally validated) collected from 40 studies and affecting 31,996 unique sites in the genome (Figure 1A). The majority of variants come from controls (n = 17,698), individuals with NDDs including autism (n = 12,358), schizophrenia (n = 810), epilepsy (n = 440), intellectual disability (n = 197) and congenital heart disease (n = 1,308). A number of other smaller studies have contributed variants found in people with amyotrophic lateral sclerosis (n = 42), congenital diaphragmatic hernia (n = 40), neural tube defects (n = 40), early onset Parkinson's (n = 20), early onset Alzheimer's (n = 14), Cantú syndrome (n = 11), sporadic infantile spasm syndrome (n = 5), anophthalmia microphthalmia (n = 4) and acromelic frontonasal dysostosis (n = 4).
From the 40 studies there are 16,605 individuals affected with a disorder or disease (n = 14 affected phenotypes represented) and 6,493 unaffected individuals. Thirty of the studies are from whole-exome sequencing, eight from whole-genome sequencing, and two from targeted resequencing. Annotation of variants corresponds to 10,170 genes and 20,108 transcripts. In total, there are 25 functional categories representing 1,161 likely gene-disrupting (LGD) events (412 stop-gained, 593 frame-shift, 58 splice-acceptor and 98 splice-donor) and 6,074 missense events (Figure 1B). A metric often used to assess variant severity is the Combined Annotation Dependent Depletion (CADD) (52) score. We have also included this in the database (Figure 1C) and there are notably 394 missense events with a CADD score >30.
Finally, we have included information on orthogonal validation, which is very important since true de novo variants are sometime difficult to detect due to undercalling in parents. By searching the literature and/or contacting authors, we identified a total of 8,541 validated variants. Some studies, particularly those that are smaller, tend to validate all variants (Figure 1D) while larger studies tend to validate only a subset and from these extrapolate a false positive rate of discovery.
Novelty of denovo-db
We know of two other databases that are similar to denovo-db. Both of these databases focus on NDDs, in contrast to our database that collects information on de novo variants regardless of phenotype. The first database is NPdenovo (53) that collects only de novo variants related to NDDs. The link listed in the paper does not seem to work anymore (http://220.127.116.11/NPdenovo/) but this appears to be the new link http://www.wzgenomics.cn/NPdenovo/. denovo-db does not limit collection of de novo data to neuropsychiatric disorders like NPdenovo. The second is the Developmental Brain Disorder (DBD) Gene Database (54) (http://geisingeradmi.org/care-innovation/studies/dbd-genes/), which collects information on variants in developmental brain disorders. In particular, it keeps only LGD events such as splice-donor, splice-acceptor, stop-gain, and frame-shift mutations. It is a very useful website in that it calculates the relevance of each gene for NDDs. Our database differs by collecting variation on de novo variants regardless of phenotype and functional class. denovo-db is meant to be a compendium of all de novo variants and does not make any assumptions on the researchers’ usage of the data.
The denovo-db website consists of a number of options for querying the data. One way is to search by gene and this can be done by typing the gene name (e.g. CHD8) (Figure 2A), typing the beginning of the gene name and an asterisk (*) to identify all variants in genes beginning with that text (e.g. CHD*), and via a comma-separated list that can be pasted into the search (e.g. CHD8, MECP2, PAX4). The next way to search is by chromosome position; for this we have built in another option, including typing the base position (e.g. chr14:21871373) or by typing a genomic range (e.g. chr14:21806838-21946382). An alternative is to search by typing the name of a phenotype: the website has a built-in function to match the user's input text to the existing phenotypes in denovo-db and it provides a dropdown of available options. For example, by entering de, you get the following phenotypes to choose from: developmental_disorder and neural_tube_defects. Querying by function is also available and works like the phenotype search. For example, typing sp, you get the following function class options: splice-acceptor and splice-donor. The ability to search by CADD score greater than or equal to a designated value is also available. Another important search option is to enter a sample name (Figure 2B) and by doing so the website will return a list of all variants for that individual. Finally, the database can also be queried by study name.
There are other features available for browsing queried results on denovo-db. First, you can filter variants. This can be done by typing any term in the ‘Filter’ field on the top left side above the table and the variant table will display entries matching your term. For example, enter missense and only missense variants within the current queried result will be displayed in the table. Second, you can sort columns in ascending or descending fashion by clicking the arrows in the column headers. Third, you can select columns using the ‘Show/hide columns’ button on the top right side above the table. Fourth, you can select the table size per page by using the ‘Show entries’ pull-down menu on the top right side above the table. Finally, you can export data. The full queried data set can be exported to a tab-separated-value (TSV) file through the ‘Export to TSV’ button on the top right side above the table. The output TSV file may contain more entries than what is displayed online without filtering and may contain all annotated entries due to alternative transcripts. It also contains more attributes than what are displayed online. Of note, the results tables incorporate hyperlinks to PubMed study IDs, genecards for the gene name, and the dbSNP (55) variants when available.
We present denovo-db as a resource for human de novo variants found in the literature. Our new database provides a comprehensive collection and assessment of these variants with a standardized format for annotation. Getting data from the literature into this uniform annotation is a key benefit of our database as the original publications represent a number of formats. In some cases, the variants are presented in a table with the information readily usable but in many other cases, it is in a different format. Examples of these formats include a written description of the variant within the text of the paper, tables encoded into PDF documents that are not always exportable to Excel and require hand curation, variants listed only using their HGVS mRNA annotation, and variants containing the wrong reference base. In addition, some published variants are mapped to older versions of the human reference. We manage these formats through careful assessment of the publications and contact with authors as necessary.
Inclusion of orthogonal validation status is another unique aspect of denovo-db. While many studies reported the validated sites, in some cases the authors listed a number for validated events in the paper but then did not report the actual validation status of the actual sites. All of the authors that we emailed readily provided us with the validated events, if available, for their study and we were able to integrate this information into our database. As seen in Figure 1D, the percent of events validated varies greatly by phenotype. Variants from controls represent the largest representation of variants, but the majority of these events (88%) have not been tested for their validation status. This is very important for researchers to consider when using denovo-db.
denovo-db is the first public database, to our knowledge, focusing on de novo variants irrespective of phenotype. It includes many features of the variants, including their basic annotation, and more advanced information including severity scores and orthogonal validation status. One way to analyze data from our database is to look at the number of LGD events by case or control status. We assessed those genes with two or more LGD events in denovo-db (Figure 3) and identified genes with only LGD events in cases, only LGD events in controls as well as genes with LGD events in both cases and controls. Genes with LGD events in controls may not be as interesting to researchers studying specific biology; this is a key reason why including various phenotypes is important. To assess missense mutations, we can examine the CADD score of missense variants by phenotype, particularly in controls, autism, congenital heart defects, intellectual disability, and epilepsy. By looking at empirical cumulative distribution function of these events (Figure 4), we see that in intellectual disability and epilepsy there are higher missense CADD scores than in the other phenotypes. These are just two ways to examine denovo-db and we look forward to seeing how other researchers are able to use this data to explore new biology.
We thank T. Brown for assistance in editing this manuscript and members of the Eichler lab for their feedback on database features.
Simons Foundation [SFARI 303241 to E.E.E.]; National Institute of Mental Health [R01MH101221 to E.E.E.]; National Human Genome Research Institute and the National Heart, Lung and Blood Institute [2UM1HG006493 to D.A.N.]; National Human Genome Research Institute [postdoctoral training grant 2T32HG000035 to T.N.T. and H.A.F.S.]. T.N.T. is an Autism Science Foundation postdoctoral fellow and E.E.E. is an investigator of the Howard Hughes Medical Institute. Funding for open access charge: National Institute of Mental Health [R01MH101221].
Conflict of interest statement. E.E.E. is on the scientific advisory board (SAB) of DNAnexus, Inc. and is a consultant for Kunming University of Science and Technology (KUST) as part of the 1000 China Talent Program.