Convolutional neural network model to predict causal risk factors that share complex regulatory features

Abstract Major progress in disease genetics has been made through genome-wide association studies (GWASs). One of the key tasks for post-GWAS analyses is to identify causal noncoding variants with regulatory function. Here, on the basis of >2000 functional features, we developed a convolutional neural network framework for combinatorial, nonlinear modeling of complex patterns shared by risk variants scattered among multiple associated loci. When applied for major psychiatric disorders and autoimmune diseases, neural and immune features, respectively, exhibited high explanatory power while reflecting the pathophysiology of the relevant disease. The predicted causal variants were concentrated in active regulatory regions of relevant cell types and tended to be in physical contact with transcription factors while residing in evolutionarily conserved regions and resulting in expression changes of genes related to the given disease. We demonstrate some examples of novel candidate causal variants and associated genes. Our method is expected to contribute to the identification and functional interpretation of potential causal noncoding variants in post-GWAS analyses.


INTRODUCTION
During the last decade, numerous efforts have been made to elucidate the genetic mechanisms underlying complex disorders. Major progress was made through genome-wide association studies (GWASs). However, developing methods to pinpoint the DNA variants that actually increase the risk of the associated disease is a major challenge that GWASs still face (1). GWASs cannot pinpoint causal disease variants but can only report linkage disequilibrium (LD) blocks including many neutral SNPs linked to causal loci. To exacerbate the problem, the majority of disease-associated DNA variations are thought to alter not the gene itself but the regulation of gene expression (2). Our incomplete knowledge of noncoding regions limits the functional interpretation of underlying DNA variants. Fortunately, the wealth of celltype-specific human epigenomes help with the identification of functional noncoding variants (1,(3)(4)(5).
Deep learning is a powerful approach for learning complex patterns (6) and has been applied in genomics especially for deciphering the complexity of noncoding DNA sequences. For example, a deep-learning model that predicts regulatory codes for RNA splicing has been developed (7). As one of the most successful deep learning architectures, convolutional neural networks (CNNs) have been used to systematically learn the sequence motifs or regulatory patterns embedded in genomic regions recognized by DNA-or RNA-binding proteins (8) or in DNase I hypersensitive sites (DHSs) (9).
In this study, we developed a deep learning framework based on CNNs to discover regulatory variants that may play a causative role in increasing the risk of the five major psychiatric disorders and four autoimmune diseases: autism spectrum disorder (ASD), attention deficithyperactivity disorder (ADHD), bipolar disorder (BPD), major depressive disorder (MDD), schizophrenia (SCZ), rheumatoid arthritis (RA), systemic lupus erythematosus (SLE), Crohn's disease (CD) and ulcerative colitis (UC). We utilized numerous functional features while combining them nonlinearly to model complex patterns shared by risk variants. We were able to discover novel candidate SNPs that may actually contribute to disease development.

Identification of association blocks
In order to make the input data for the CNN model, we first identified chromosomal association blocks. To this end, we employed a GWAS dataset for five major psychiatric disorders and four autoimmune diseases (Supplementary Table S1) (10,11). The association P values were downloaded from the Psychiatric Genomic Consortium portal (https:// www.med.unc.edu/pgc/results-and-downloads) for psychiatric disorders. As for the autoimmune diseases, the association P values were retrieved from the largest meta-analysis for each disease (12)(13)(14). For imputation, we used the EUR (European ancestry) samples of the 1000 Genomes Project phase 1 release (version 3) as the reference panel (15). Imputation of summary statistics was performed by using ImpG-Summary (16). The SNPs that showed the strongest associations and were at least 1 Mb apart from one another were defined as lead SNPs. To identify the lead SNPs, we sorted all SNPs across the whole genome according to the observed or imputed P values and picked SNPs from the top of the list while maintaining a >1 Mb distance from each of the previously selected ones. We then searched upstream and downstream regions flanking each lead SNP for the 30 most significant SNPs. We then discarded those with P > 5 × 10 −4 . In this manner, we identified association blocks carrying the lead SNP and their neighboring SNPs while constraining the maximum number of neighboring SNPs to 30 ( Figure 1A). The resulting number of association blocks was 340, 391, 474, 405 and 601 in ADHD, ASD, BPD, MDD and SCZ for psychiatric disorders, and 435, 849, 431 and 383 in RA, SLE, CD and UC for autoimmune diseases, respectively (Supplementary Table S2).

Feature set construction
The overall processes for our feature map construction are illustrated in Supplementary Figure S1. We obtained open chromatin profiles in 349 different samples as provided in the form of DHS peaks by the ENCODE Project (17) and the Roadmap Epigenomics Project (18). A total of 606 histone modification profiles from the Roadmap Epigenomics Project (18) were obtained as the narrow peak bed files in 124 different biological conditions. The KEGG database (19) was utilized to incorporate gene functions as features. Each SNP was mapped to their target gene and its KEGG pathway when they resided in the gene body or 500 kb upstream of the TSS. We ran FIMO (http://meme-suite.org/ doc/fimo.html) (20) to search the TRANSFAC (21) and JASPAR (22) databases of position weight matrices for 996 transcription factors (TFs). We used the P value threshold of 10 −4 for feature mapping. As a result, we compiled 2,252 functional features consisting of DHSs, histone modifications, KEGG pathways, and TF binding sites. Because using too many features results in overfitting, we filtered features that were shared by only a small number of SNPs in the association blocks (Supplementary Figure S1). Specifically, we retained the features that were mapped to any SNPs in > 95% of the association blocks in each disease model. The resulting number of features was 714 for ADHD, 711 for ASD, 730 for BPD, 714 for MDD, 739 for SCZ, 791 for RA, 741 for SLE, 845 for CD and 834 for UC.

Model design
The developed CNN model consists of hierarchical pattern detectors that learn the regulatory features commonly present in disease-associated genomic loci. In our model, we used two convolution layers; (i) the first layer acts as a local feature extractor at the individual SNP level and (ii) the second layer combines the detected patterns into more complex biological features.
Convolution layer 1. The convolution layer 1 takes an input matrix (X mn ) with a size of M × Nas described below: M is the number of regulatory features that survived the filtering step described above, and N is the total number of candidate SNPs in an association block. With this input matrix and tunable patterns represented as kernel w k , the first feature map array ( 1 c k n ) was obtained by the convolution process, where k is the index for 50 filters used in our model (1 ≤ k ≤ K,K = 50). Each convolution kernel w k is a weight vector with a length of M. This process is equivalent to performing a one-dimensional convolution (8) with a moving window of step size 1 on the list of consecutive N SNPs each having M channels. As implied in the above formula, K types of pattern detectors were used for each SNP without considering the effect of neighboring SNPs. After convolving the matrix X mn , we added a bias vector, b k , which was expected to make the model more flexible and allow the model to solve given problems more effectively. Subsequently, a rectified linear unit (ReLU) was applied as follows: We then stacked the output vector h k n to compose a matrix H kn with a size of K × N, which corresponds to scores measuring how well the features of each SNP match the patterns of the shared weights.
Convolution layer 2. The convolution layer 2 operates on the output matrix of the previous layer (H kn ). The second feature map vector ( 2 c n ) was obtained as  framework to detect regulatory patterns shared by risk variants residing in multiple association blocks centered on lead SNPs. In this example, block #1 carries n SNPs including the lead SNP. We apply k different kernels that learn particular patterns composed of various regulatory features encompassing DHSs, histone modifications, target gene function, and TF binding sites. At this stage, an autoencoder is used for pre-training. In this manner, the first convolution layer scores n SNPs with k pattern detectors. Afterward, another convolution layer is applied to combine the k scores, thereby enabling nonlinear combinatorial modeling of regulatory patterns. The output of the second layer serves as the prediction score for each SNP. The model is trained to maximize the likelihood derived from the block scores that are assigned by max pooling. In this step, only one tunable weight vector was used to linearly combine the K patterns for high-level feature scoring of each SNP. Then, we added a bias term b and scaled 2 c n + b to the 0-1 range by the sigmoid function: o n can be regarded as the prediction score of the n-th SNP. Prediction scores close to 1 indicate that certain common regulatory patterns are embedded in the features of the given SNPs. Finally, we applied max-pooling by taking the maximum of o n as This corresponds to the per-block score of the SNP whose features best match the common patterns shared by different association blocks.

Model training
The input data for the model was split into training, validation, and testing sets (Supplementary Figure S2 and Supplementary Table S2). The validation and testing sets were used to select the best hyperparameter set and report final performance levels, respectively. We trained the model parameters to minimize negative log likelihood (NLL) defined as follows: where f (θ ) s is an output for the sth block in a mini-batch of size B from the training set (B = 100 was used for all experiments). Each target, Y, serves as the dependent variable and can be either 1 (for true cases or the blocks that carry the pattern of SNPs associated with the disease) or 0 (for false cases or the blocks that are unlikely to carry the pattern of SNPs associated with the disease). For each SNP, false cases were constructed by shuffling the features. In doing so, it is anticipated that the features of false cases do not reflect any disease-related common patterns. The shuffling procedure was done separately for each feature group. We generated false cases that are ten times the true cases.
The NLL was composed of parameters (θ ) including w k m , b k , w k and b , which were updated by the standard backpropagation algorithm with momentum. To prevent overfitting, early stopping and pre-training by an autoencoder were used. More details of the model training processes are documented in Supplementary Information.

Feature importance analysis
We employed random forest to assess the relative importance of each feature (23). The input SNPs for training RF were labeled as positive (prediction score > 0.5) or negative (prediction score < 0.5) according to our CNN results. For each feature, we calculated the Mean Decrease Gini (MDG), which is defined as the average of total decrease in Gini impurities in each tree. A greater MDG indicates higher importance of a feature. The empirical P value of MDG was calculated by 1000 random permutations. The hypergeometric distribution was used to test whether disease-related features stood out in terms of the MDG. Neural and immune features were defined on the basis of open chromatin or histone marks in normal neuronal cells or tissues and in immune-related blood tissues (all lymphoid cells and granulocytes) or cell lines, respecitvely. KEGG pathways including 'nervous system', 'neurodegenerative disease', 'substance dependence', 'inflammatory process', 'immune process', and 'chemokine signaling pathway' were also included. As negative controls, we used irrelevant features consisting of open chromatin peaks, histone modifications, and KEGG pathways related to the digestive and circulatory system in the same way. In our network analysis, the neural features with a significant (P < 0.05) MDG were selected for visualization. RF model building and feature importance analysis were implemented by using R packages randomForest and rfPermute (24). Network visualization of the significant neural features was based on Cytoscape (v3.5.1) (25).

Functional analyses of predicted variants
Details on the TF binding and allelic imbalance analysis, evolutionary conservation analysis and target gene function analysis are provided in Supplementary Information. Additional dataset not used in the training process was employed for the analysis of the autoimmune diseases. A total of 100 histone modification profiles in 16 blood tissues or cell lines were obtained as peak bed files from the BLUEPRINT epigenome project (www.blueprint-epigenome.eu) (26).

Prediction model and performance
Our CNN model was trained on the feature vectors across multiple association blocks ( Figure 1A and Supplementary Figure S1). The dependent variable of the model is 1 for true cases or the blocks that carry the pattern of SNPs associated with the disease and 0 for false cases or the blocks that do not carry the pattern of SNPs associated with the disease. The premise of the model is that there are one or more functional variants in association blocks, and that many of the variants share certain patterns of regulatory features despite being scattered in different blocks. Therefore, the association blocks identified through GWASs served as true cases.
In analogy, a true case (association block) can be compared to a face image, and SNPs can be compared to eyes, nose, or mouth. By observing many face images, a CNN model can learn that a face has eyes, nose, and mouth in common, and decide whether a given picture is a face or not. Likewise, if a CNN model is trained with multiple true cases, it can learn that association blocks carry SNPs with certain patterns and can decide whether a given region is a true case.
The number of association blocks used as true cases for ADHD, ASD, BPD, MDD, SCZ, RA, SLE, CD and UC was 340, 391, 474, 405, 601, 435, 849, 431 and 383, respectively (Supplementary Table S2). These blocks were partitioned into the training set, validation set, and testing set (Supplementary Table S2). Details of the learning processes are summarized in Supplementary Figure S2. Performance evaluation was based on the area under the receiveroperator characteristic curve (AUC) and F1 value ( Figure  1B). We modified our model to learn the features of only the lead SNPs (i.e. the most significant SNPs indexing each association block) or to learn patterns composed of the linear combinations of features (by using only one convolution layer). The lowered performance of the modified models (gray and blue bars of Figure 1B) indicates that common regulatory patterns need to be searched for through all variants in each chromosomal block in a complex, non-linear fashion. This justified the usage of a CNN model for this task.
Overall, the lowest performance was achieved for MDD, probably reflecting that genetic factors play a less significant role relative to the other diseases (27). We defined positive calls as variants that were assigned a prediction score greater than 0.5. The list of these putatively causal variants in each disease is provided in Supplementary Table S3.

Biological validation of prediction results
First, true causal variants are expected to have a certain level of statistical association with the given phenotype. Indeed, our model assigned higher prediction scores to associated variants in the testing set, which is independent of the training processes (Supplementary Figure S3). In >50% of the chromosomal blocks with at least one positive call, the variant with the strongest statistical association (i.e. lead SNP) was positively predicted (Supplementary Table S2). Also, there were many cases in which the greatest prediction score was assigned to the lead SNPs (Supplementary Figure S4). Of importance, our prediction method was able to single out one of statistically indistinguishable variants (compare red diamonds and blue circles in Figure 2).
Second, disease-related features are expected to play an instrumental role when predicting causal variants. For example, psychiatric disorders should be associated with brain-related features while autoimmune diseases with immune-related features. To test this, we employed the random forest classifier to assess the contribution of each feature to the prediction processes. By randomizing each feature, the explanatory power of the given feature in discriminating positive and negative calls could be estimated. We used the MDG score for this measure (see Materials and Methods). With this metric, we observed higher discriminative power for neural features and immune features than irrelevant features in the psychiatric disorders and autoimmune diseases, respectively ( Figure 3A).
In addition, some neural features reflected the pathophysiology of the relevant psychiatric disorder ( Figure 3B). For example, astrocyte (red nodes) and dorsolateral prefrontal cortex (green nodes) are often implicated in ASD (and ADHD) and SCZ (and BPD), respectively. Fetal features (blue nodes) were important when characterizing neurodevelopmental disorders such as ASD and SCZ. Similarly, the BLUEPRINT epigenome data for various immune cell types were used for the analysis of the autoimmune disease results. Positive calls were more enriched in the regulatory regions of lymphocyte lineages rather than granulocytes. This is in good agreement with the pathophysiology of autoimmune diseases ( Figure 3C and Supplementary Figure S5). Moreover, a comprehensive target gene analysis also supported the clinical relevance of our prediction (Supplementary Figure S6).
We assessed the importance of features also by simply examining the relative weight ranking of biological features in each kernel (see Supplementary Information). Neural and immune features in psychiatric disorder and autoimmune disease, respectively, showed significantly higher level of importance compared to other features, in a subset of kernels (Supplementary Figure S7).
Third, causal noncoding variants are likely to lie in the regulatory regions of relevant tissue types. We first tested whether positive calls are enriched in regulatory regions derived from independent data. The BLUEPRINT epigenome data for various immune cell types were useful for this purpose because they were not used for model training. The fractions of positive calls for the autoimmune diseases were significantly higher than negative calls in H3K4me1 and H3K27ac regions from the BLUEPRINT epigenome data ( Figure 4A). Next, to compare disease-relevant tissues with others, we ordered all available tissue types from the Epigenome Roadmap project depending on the degree of positive call enrichment. Expectedly, positive calls were enriched in the disease-related tissues for both the psychiatric disorders and autoimmune diseases ( Figure 4B). Fourth, the mechanisms by which noncoding variants contribute to disease phenotypes should involve TF binding changes. For this test, we utilized TF footprint data generated by base-resolution DHS analyses (28). The positive hits included a significantly higher fraction of nucleotides that are in contact with cognate TFs than the negative calls (Figure 4C). A useful method to test the functionality of regulatory variants is to examine allelic imbalance in chromatin accessibility (29). By examining allelic patterns in footprint reads, we found that different alleles at positive calls are more likely to cause distinct regulatory variation (Supplementary Figure S8A). We also tested whether the predicted putative causal variants tend to affect gene expression levels. When examined using the 1000 Genomes whole-genome and transcriptome data, positive calls showed higher levels of expression association, further supporting the functionality of the predicted variants at the transcription level (Supplementary Figure S8B).
Finally, true causal variants for major psychiatric disorders are likely to reside in regions that are critical for brain development and function. Therefore, one can anticipate higher evolutionary conservation, especially among primates, for regions surrounding the putative causal variants. Indeed, the odds of positive calls in conserved sequences were statistically significant ( Figure 4D). This tendency was more distinct with ASD, ADHD and SCZ, in which aberrations in neural development play a critical role, as compared to MDD and BPD. Also, the average degree of sequence conservation was markedly higher for genomic regions centered on positive calls than negative calls (Supplementary Figure S9A). The degree of conservation was less e146 Nucleic Acids Research, 2019, Vol. 47, No. 22 PAGE 6 OF 11 Figure 2. Comparison of our prediction scores (red diamonds on the left y-axis) and association statistics (blue circles on the right y-axis) for individual SNPs in exemplary association blocks. Our functional prediction enables to discern statistically indistinguishable variants. significant when measured among mammals or vertebrates (Supplementary Figure S9B).
Additionally, we sought to test how other existing tools perform in the above biological validations in comparison to our CNN results. Among the tools that predict the pathogenicity of variants, we used CADD (30), PhastCons (31), and LINSIGHT (32). In contrast to the CNN model, pathogenic variants defined by CADD, PhastCons, and LINSIGHT showed weaker or unclear enrichment patterns in disease-related cell types from the Epigenome Roadmap project for both psychiatric disorders and autoimmune diseases (Supplementary Figure S10, compare with Figure  4B). We also assessed overlap with TF footprints (Supplementary Figure S11, compare with Figure 4C), allelic imbalance in chromatin accessibility (Supplementary Figure  S12, compare with Supplementary Figure S8A), and association with gene expression levels (Supplementary Figure  S13, compare with Supplementary Figure S8B). According to these results, we conclude that the positive calls predicted by the CNN model are biologically more meaningful than the pathogenicity calls by the other tools.

Examples of novel candidate causal variants
As shown in Figure 1B, our method was able to detect common patterns shared by variants other than the lead SNPs of association blocks. In other words, there must be numerous cases in which the tag SNPs or lead SNPs detected by the typical GWAS analysis based on statistical association may not act as causal variants for the disease. For example, a known GWAS variant of RA (33), rs773125, located on chromosome 12, has the strongest association with the RA phenotype ( Figure 5A). Previous GWAS studies assigned this SNP to CDK2 on the basis of physical distance. However, the nearest gene is not always the actual target gene. Only a small fraction of distal enhancers target the nearest transcript (34). Not surprisingly, CDK2 has no clear role in association with RA. Furthermore, rs773125 is not  located in active regulatory regions marked by H3K4me1 or H3K27ac in any types of immune-related tissues (Supplementary Figure S14A). Our prediction was negative on rs773125. Instead, there were two positive SNPs (rs773114 and rs1873914) that showed a lower association with RA. GM12878 capture Hi-C data (35) located these SNPs in an enhancer region of RPS26 ( Figure 5A). This site was an active regulatory region of CD14+ monocyte as well. There were several studies that implicated RPS26 in autoimmune diseases as a possible factor that can evoke autoimmunity (36,37).
We were able to find similar examples in psychiatric disorders. For example, rs150721234 has the strongest association strength with the SCZ phenotype in the LD block located on chromosome 10 ( Figure 5B). This SNP resides in an intron region of C10orf68, which has no clear role in association with SCZ. Moreover, rs150721234 is not located in active regulatory regions marked by H3K4me1 or H3K27ac in any types of brain-related tissues (Supplementary Figure S14B). Our prediction was negative on rs150721234. Instead, there was a positive SNP (rs117885390) that showed a lower association strength  Figure 5. Examples of novel candidate causal SNPs. (A) Example for autoimmune diseases. Known GWAS SNP rs773125, located on chromosome 12, has the strongest association with RA. However, this SNP was a negative call from our prediction and not located in active regulatory regions in any immune-related tissues. The two positive SNPs (rs773114 and rs1873914) were located in an active regulatory region of CD14+ monocyte and connected to RPS26 according to GM12878 capture Hi-C data (35). (B) Example for psychiatric disorders. rs150721234 has the strongest association strength with SCZ in the LD block located on chromosome 10. However, this SNP was a negative call from our prediction and not located in active regulatory regions in any brain-related tissues. Instead, there was a positive SNP (rs117885390) that showed a lower association strength with SCZ. Dorsolateral prefrontal cortex Hi-C data (38) located this SNP in an enhancer region of ITGB1. This site was an active regulatory region of the dorsolateral prefrontal cortex tissue. with SCZ. Dorsolateral prefrontal cortex Hi-C data (38) located this SNP in an enhancer region of ITGB1 ( Figure 5B). This site was an active regulatory region of the dorsolateral prefrontal cortex. There were several studies showing that ITGB1 gene has a possible role during the development of schizophrenia (39,40). These cases show that our method may be able to pinpoint functional variants that could be the actual cause of diseases among all variants associated with tag SNPs. Furthermore, these examples illustrate that once candidate variants other than tag SNPs are identified, one may be able to specify novel target genes that may shed light on the pathophysiology of the relevant phenotypes.

DISCUSSION
Our prediction model is different from the conventional architecture of CNNs intended for image processing. For biological reasons, we perform one-dimensional convolution while using a vector instead of a matrix for kernels. Only one-dimensional convolution is applicable for our purpose because genetic information is encoded in linear DNA strands. This type of convolution has been applied for predicting the sequence motifs of DNA-and RNAbinding proteins (8). While binding motifs are consecutive nucleotides that can be represented as a positional matrix, the order of SNPs along the chromosome does not carry meaningful biological information. This is why only vector kernels were applicable for our purpose. The power of our method stems from incorporating feature data that comes with external annotation. These features were not learned from DNA sequences ab initio but were incorporated on the basis of domain knowledge, which probably contributed to achieving high performance with a relatively small number of convolution layers. In addition, biological annotation helped with the validation and interpretation of prediction results. However, it must be noted that the model is based on the premise that at least one causal variant exists in the locus. Therefore, the presence of false positive GWAS signals may lead to undermine the performance of our approach.
Statistical approaches for fine-mapping are not applicable for rare variants because of limited power. In contrast to statistical fine-scale mapping, our prediction method is applicable to rare variants for which statistical association is difficult to estimate. This is important because the combined effects of rare variants may explain a significant proportion of genetic susceptibility to common diseases or traits (41)(42)(43). Expression quantitative loci with large effects detected in a human family were enriched with rare regulatory variants (44). A burden test for enrichment revealed a significant excess of rare regulatory variants at both extremes of gene expression, implicating their potential role in contributing to disease by driving high or low transcription (45). Our method can contribute to the identification and prioritization of rare variants.

DATA AVAILABILITY
The following repository contains source codes for feature set construction and CNN model training: https: //github.com/kaistomics/cnnGWAS. Resources for model construction are also available at: https://omics.kaist.ac.kr/ resources.

SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.