CADD: predicting the deleteriousness of variants throughout the human genome

Abstract Combined Annotation-Dependent Depletion (CADD) is a widely used measure of variant deleteriousness that can effectively prioritize causal variants in genetic analyses, particularly highly penetrant contributors to severe Mendelian disorders. CADD is an integrative annotation built from more than 60 genomic features, and can score human single nucleotide variants and short insertion and deletions anywhere in the reference assembly. CADD uses a machine learning model trained on a binary distinction between simulated de novo variants and variants that have arisen and become fixed in human populations since the split between humans and chimpanzees; the former are free of selective pressure and may thus include both neutral and deleterious alleles, while the latter are overwhelmingly neutral (or, at most, weakly deleterious) by virtue of having survived millions of years of purifying selection. Here we review the latest updates to CADD, including the most recent version, 1.4, which supports the human genome build GRCh38. We also present updates to our website that include simplified variant lookup, extended documentation, an Application Program Interface and improved mechanisms for integrating CADD scores into other tools or applications. CADD scores, software and documentation are available at https://cadd.gs.washington.edu.


Known variants in CADD
In interacting with our users, we noted confusion about whether and how CADD integrates information about known variants into its prediction. CADD v1.0 to v1.3 never used information of variant presence, observed population frequencies or patient phenotype derived effect predictions. We only reported variant presence as well as Exome Sequencing Project (ESP) and 1000 Genomes allele frequencies in our annotation files for the convenience of evaluating CADD's performance.
Since recent studies have highlighted the value of variant density information (21), we have added features based on gnomAD/BRAVO in CADD version 1.4. One feature is the distance between the next single nucleotide variants up and downstream (ignoring variants at the site itself). For the second feature set, we count the number of frequent (MAF > 0.05), rare and single occurrence single nucleotide variants in a window of 100, 1000 and 10000 bases around the position of interest. Considering the use of linear models, it is not possible to infer variable positions from the combination of these accumulated counts and the distance feature. Hence, variant frequencies can be used as a potentially orthogonal annotation by CADD users.
In addition to the retrieving CADD scores via our graphical web user interface and offline scoring, it is possible to request SNV scores per variant or for short ranges via an Application Program Interface (API).
All API requests consist of a CADD version and the genome coordinate. The available CADD versions are `v1.0` to `v1.3` and the two v1.4 releases `GRCh37-v1.4` and `GRCh38-v1.4`. If you require annotations, you can add `_inclAnno` to the version string.
Single position access: The request path for SNV access is `https://cadd.gs.washington.edu/api/v1.0/<CADD-version>/<chrom>:<pos>` which returns a JSON list of the three SNVs at that position. Users can request a single SNV with reference and alternate base given via `https://cadd.gs.washington.edu/api/v1.0/<CADD-version>/<chrom>:<pos>_<ref>_<alt>`. This returns just a single SNV object in a list. In cases where ref or alt are not available, an empty list is returned.
Range access: Range access is similar to the SNV range access on the website with the same limitation to 100 contiguous bases. It can be accessed via `https://cadd.gs.washington.edu/api/v1.0/<chrom>:<start>-<end>`. In contrast to the single position access, this returns a list of lists where the first item contains the field names.
Currently, there is no support for retrieving CADD InDel scores through the API.
Non-coding variants and more missense scores: AUROC performances of different scores on (A) non-coding variants and (B) the same missense variants from Figure 2B. Scores from VEST and REVEL are not included in the main comparison because they were trained on variants from these two test sets.

Correlation between CADD versions:
The plot shows Pearson correlation coefficients of PHRED-scaled scores between the latest six CADD models. GRCh38 scores were lifted to GRCh37 for comparability. Since annotations are not perfectly correlated between genome builds, the GRCh38-v1.4 model is the least correlated to the other models. The figure is based on 100,000 randomly selected SNV throughout the genome (below the diagonal) and only those with PHRED-scaled scores greater than 10 in CADD GRCh37-v1.4 (n=9,923, above the diagonal).       Splice prediction (dbscSNV ) v1.4 (18)