AutismKB 2.0: a knowledgebase for the genetic evidence of autism spectrum disorder

Abstract Autism spectrum disorder (ASD) is a complex neurodevelopmental disorder with strong genetic contributions. To provide a comprehensive resource for the genetic evidence of ASD, we have updated the Autism KnowledgeBase (AutismKB) to version 2.0. AutismKB 2.0 integrates multiscale genetic data on 1379 genes, 5420 copy number variations and structural variations, 11 669 single-nucleotide variations or small insertions/deletions (SNVs/indels) and 172 linkage regions. In particular, AutismKB 2.0 highlights 5669 de novo SNVs/indels due to their significant contribution to ASD genetics and includes 789 mosaic variants due to their recently discovered contributions to ASD pathogenesis. The genes and variants are annotated extensively with genetic evidence and clinical evidence. To help users fully understand the functional consequences of SNVs and small indels, we provided comprehensive predictions of pathogenicity with iFish, SIFT, Polyphen etc. To improve user experiences, the new version incorporates multiple query methods, including simple query, advanced query and batch query. It also functionally integrates two analytical tools to help users perform downstream analyses, including a gene ranking tool and an enrichment analysis tool, KOBAS. AutismKB 2.0 is freely available and can be a valuable resource for researchers.

integrates two analytical tools to help users perform downstream analyses, including a gene ranking tool and an enrichment analysis tool, KOBAS. AutismKB 2.0 is freely available and can be a valuable resource for researchers.

Introduction
Autism spectrum disorder (ASD) is a severe neurodevelopmental disorder with core symptoms that include deficits in social interaction and social communication, as well as stereotypical and repetitive behaviors (1). Epidemiological studies in many countries have shown that the prevalence of ASD ranges from 1 to 2% of the population (2,3). Twin studies and cohort studies have established that genetic factors play a major role in the etiology of ASD (4)(5)(6). Inherited mutations and de novo mutations have both been found to contribute significantly to ASD (7 -15). More recently, postzygotic genomic mosaicisms have also been associated with ASD (16)(17)(18).
Because of a highly heterogeneous genetic etiology, thousands of genes have been reported to be associated with ASD (19). These genes were identified with a variety of experimental approaches with variable evidence over a long period of time by many different groups. Thus, there is a strong need for databases that collect comprehensive evidence about ASD-associated genes from the extensive literature and research information resources. Autism KnowledgeBase (AutismKB), developed by our group in 2011, was the largest such database; its initial release included 2193 ASD genes, 2806 single nucleotide polymorphisms (SNPs) and indels, 4544 copy number variations (CNVs) and structural variations (SVs) and 158 linkage regions (20). Three other autismrelated genetic databases are available to researchers. The Autism Chromosome Rearrangement Database (21) includes 372 ASD-associated chromosomal breakpoints, whereas the Autism Genetic Database (22) includes 743 CNVs of 226 ASD genes, and the AutDB (23) includes 2225 CNVs of 990 genes and 1165 animal models.
Since its publication, AutismKB has received 1 533 725 page views from 42 619 unique Internet Protocol (IP) addresses. However, new research developments, especially those fueled by next-generation sequencing (NGS) technologies, have revealed many new ASD-related genes and genetic variants, as well as new types of genetic variation, such as de novo variants and mosaic variants (16)(17)(18). Large-scale NGS studies revealed that de novo variants have important contributions to ASD (7,9,10,14,(24)(25)(26) and might explain >10% of ASD probands (27). Dou et al. (17) estimated that 2.6% of the ASD diagnoses in the Simons Simplex Collection (SSC) could be explained by mosaic variants arising postzygotically in probands.
Here, in an effort to help researchers keep pace with the rapid growth in ASD-related genetic information, we updated AutismKB to version 2.0 (http://db.cbi.pku.edu.cn/ autismkb v2/) with significant expansion and changes.

Materials and methods
The framework of AutismKB 2.0 AutismKB 2.0 was created as a relational database using MySQL Server 5.6.26. The web interfaces were designed using PHP (5.5.18-pl0-gentoo), JavaScript and HTML. An overview of the construction of AutismKB 2.0 is shown in Figure 1. The framework consists of three major parts. The first part collects and updates autism-related genetic data and annotated data sets. The second part archives and presents the nine evidence data types and seven annotation data types. The third part is the user interface that displays our main data sets, three query methods and two analytical tools on our website. In this new version, we added new content and made corresponding changes to these three parts. In the first part, we added a new collection of mosaicrelated literature. In the second part, we added mosaic variants as a new data type, as well as variant prediction in the annotations. In the user interface, we added new tools for batch query and enrichment analysis. We also changed the categories of data by adding the category of de novo and mosaic variants, introducing function predictions, collecting large-scale single-nucleotide variants (SNVs) in the categories of NGS and optimizing the data table structure and table contents in the back end to accelerate the access speed and to elevate the user experience.

Data collection
We conducted a systemic review of the ASD-related literature by using the query term 'autis * [Title/Abstract]' to search the PubMed database monthly, and we updated the database every 6 months. For mosaic mutations, we used the query term 'autis * and mosaic * ' to search the PubMed database. Next, we manually reviewed the search results. We collected genes, variations and evidence from the literature and integrated them into AutismKB 2.0. The selection criterion for the literature is as follows: defined ASD-related genes were presented. For all publications that
To provide evidence of the functional consequences of the reported variants, we added the predicted pathogenicity of genetic variants based on ANNOVAR (28) with Refseq (build hg19) and iFish (integrated functional inference of SNVs in human) (29). iFish is a supporting vector machine-based classifier that uses gene-specific and gene familyspecific attributes. At the same time, iFish provides functional annotations from other classifiers such as SIFT (30), Polyphen2 (31) and MutationTaster2 (32). iFish utilizes a customized prediction cut-off for each classifier that maximizes the sum of sensitivity and specificity.
To provide a user-tunable gene list with the strongest possible evidence, we provided a gene ranking algorithm identical to that included in AutismKB 1.0 (20). Briefly, we used an evidence-based candidate gene prioritization approach (33) that first assigns different weights to different types of experimental evidence using a benchmark ASD gene set, after which it calculates the weight of evidence of each gene by summing the weights of the positive evidence for that gene.
Improved scoring system for ranking ASD candidate genes AutismKB 2.0 implemented an improved gene scoring algorithm compared to AutismKB 1.0. First, we extended the six categories of experimental evidences to nine categories by dividing the previous category 'NGS and Low-Scale Gene Studies' into four different categories including 'NGS de novo Mutation Studies', 'NGS Mosaic Mutation Studies', 'NGS Other Studies' and 'Low-Scale Gene Studies'. For missense mutations identified from NGS studies, we only considered those predicted 'deleterious' by iFish as supportive evidence for ASD pathogenesis. The criteria and statistics of raw scores for each type of evidence are shown in Table 1. Second, we updated the benchmark data set. In Table 1. Raw scoring criteria and number of genes for each type of evidence . Third, the range of weights for each evidence type in AutismKB 2.0 was changed from 1-7 to 1-10, and the number of possible weight combination was dramatically increased from 7 6 to 10 9 . Fourth, we re-benchmarked and updated the optimal weight matrix recommended by AutismKB 2.0, by ranking the 75th percentile of the bench-mark data set to the highest rank (Supplementary Table 2). The AutismKB 2.0 web server also allows users to choose their own weights freely for each type of experimental evidence, as well as the cutoffs.
To help users perform downstream analyses, we integrated an enrichment analysis tool, KOBAS (41), into the new version. After a user uploads a list with gene symbols, AutismKB 2.0 automatically searches the background database. If the target genes are present in the database, the server will automatically convert their symbols to the appropriate Entrez gene indexes. Next, the website automatically submits the list to KOBAS for enrichment analysis. Finally, users can view and download the enriched functional categories from their queried gene lists.

Update plan for AutismKB 2.0
To keep AutismKB 2.0 up-to-date in the future, we plan to collect ASD-related literatures from PubMed every month, which will be classified into nine categories according to their experimental methods. We will then extract the phenotype and genotype data from each study. The collected data will be manually curated every 6 months and uploaded to the back-end database through a Perl-based script. We will also recalibrate the scoring system of each evidence and gene and post an update log on the AutismKB 2.0 website (http://db.cbi.pku.edu.cn/autismkb v2/new.php).

Database summary
We reviewed the abstracts of 13 749 published studies up to 30 June 2018 and retrieved the full text of 3208 selected studies. If the abstract of the literature provided phenotype and genotype information that fulfilled our requirements, the information was extracted directly from the abstract; otherwise, the genotype and phenotype information was extracted from the main text and/or the supplementary materials. With the rapid increase in the amount of data from NGS and other related studies, we have increased the amount of literature from NGS, and especially de novo mutation studies and mosaic mutation studies. Information from NGS studies was included in the sixth kind of evidence in the database.
We updated the knowledgebase every 6 months as shown in Supplementary Figure 1 Recent studies have shown that postzygotic mosaic mutations are an important, yet underestimated, genetic risk factor for ASD (16)(17)(18)(42)(43)(44)(45)(46)(47). AutismKB 2.0 is the only ASD database to include germline variants, 789 mosaic SNV, and 6 mosaic CNVs, including 583 mosaic variants detected and validated in the whole exome sequencing data of 5947 families collected by SSC and Autism Sequencing Consortium, as well as 247 unvalidated, yet highly confident, mosaic mutations from the sequencing data of 2264 families.
To conclude, compared with the initial version of AutismKB, the number of articles for GWAS, CNV, linkage, association, expression, NGS and other studies increased by 144, 165, 18, 57, 25 and 74%, respectively, in AutismKB 2.0 ( Table 2), and mosaic mutations were included as an independent evidence type in the new version for the first time.

Database interface and access
In the updated version 2.0 of AutismKB, we improved the user interface by adding 'variant', 'View Mosaicism', 'enrichment analysis' and 'batch query' entrances ( Figure 1). Among these, the 'variant' entrances included CNV/SVs, SNVs/indels, mosaics and linkage regions previously provided under CNV, linkage, NGS and other categories.
To accelerate the user navigation speed and improve the user experience, we optimized the database by adding tables containing mosaic information, tables containing functional annotations, tables with updated polymorphism information such as dbsnp150 and other information. The dbsnp150 table replaced the out-of-date table snp130. We also changed the table structure of gene score and all variants. The database now includes ∼91 different tables. Tables now include keys such as PubMed id, Entrez id, SNV id, Mosaic id, iFish id, CNV id and linkage id, which serve as the index between all tables.

Update of the gene annotation
We annotated the ASD-related genes with extensive information, including gene name and id, sequence, functional annotation, animal models, expression, regulation, pathways, associated diseases and related drugs. These annotations can help users to understand more information about these genes. Additionally, we have now added predicted pathogenicity sores and annotation about ASD-related gene variants (Figure 2 and Supplementary

Conclusion and future perspective
ASD is not a Mendelian disease. Rather, it is a complex and highly heterogeneous disease. Thousands of genes have been reported to be associated with ASD (10-13, 48, 49).
To provide a comprehensive and useful knowledgebase, we have updated AutismKB to version 2.0. We used the gene scoring algorithm and the latest benchmark data set to rank the genes collected in the database. In addition to 99 syndromic genes, we selected 1280 non-syndromic genes with a total score greater than four as candidate ASDassociated genes (Supplementary Table 3). Among them, 30 syndromic and 198 non-syndromic genes with a total score greater than 16 were designated as high-confidence ASD-associated genes (Supplementary Table 4). We will continue to maintain and update AutismKB 2.0 in the future, so that it will provide increased utility to the community. We plan to continue to read and integrate the ASD-related literature to collect data for ASD genes. One limitation of the database is that it does not contain detailed phenotypic information related to ASD genes. Therefore, we plan to follow up with the latest research methods to integrate ever more helpful annotations for ASD genes, including phenotypic scores for ASD probands. For example, if the literature reports Autism Diagnostic Interview Review (ADI-R) and/or Autism Diagnostic Observation Schedule (ADOS) scores, we will collect the detailed scores, which are strongly correlated with the severity of ASD symptoms. Another potential resource of phenotypic data is from public databases such as the Human Phenotype Ontology (HPO) (50). In the future, we plan to extract the ASD-related gene and phenotype information from HPO and integrate them into AutismKB 2.0.
In summary, AutismKB 2.0 integrates multiscale evidence and detailed genetic information for ASDrelated genes. We believe that this updated database will greatly facilitate ongoing and future research about ASD.

Supplementary data
Supplementary data are available at Database Online.