Identification of disease-associated loci using machine learning for genotype and network data integration

Abstract Motivation Integration of different omics data could markedly help to identify biological signatures, understand the missing heritability of complex diseases and ultimately achieve personalized medicine. Standard regression models used in Genome-Wide Association Studies (GWAS) identify loci with a strong effect size, whereas GWAS meta-analyses are often needed to capture weak loci contributing to the missing heritability. Development of novel machine learning algorithms for merging genotype data with other omics data is highly needed as it could enhance the prioritization of weak loci. Results We developed cNMTF (corrected non-negative matrix tri-factorization), an integrative algorithm based on clustering techniques of biological data. This method assesses the inter-relatedness between genotypes, phenotypes, the damaging effect of the variants and gene networks in order to identify loci-trait associations. cNMTF was used to prioritize genes associated with lipid traits in two population cohorts. We replicated 129 genes reported in GWAS world-wide and provided evidence that supports 85% of our findings (226 out of 265 genes), including recent associations in literature (NLGN1), regulators of lipid metabolism (DAB1) and pleiotropic genes for lipid traits (CARM1). Moreover, cNMTF performed efficiently against strong population structures by accounting for the individuals’ ancestry. As the method is flexible in the incorporation of diverse omics data sources, it can be easily adapted to the user’s research needs. Availability and implementation An R package (cnmtf) is available at https://lgl15.github.io/cnmtf_web/index.html. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Polygenic diseases result from the contribution of multiple loci but only those with large effect size are detected by traditional methods for Genome-Wide Association Studies (GWAS) (Arkin et al., 2014;Bush and Moore, 2012). Although regression models (RMs) have been useful for discovering significant single nucleotide variant (SNV) associations, most of these variants explain just a fraction of the phenotypic variability and several loci with a small effect size on the trait are not being captured (Auer and Lettre, 2015;Zuk et al., 2012). It is also evident that genotyping data alone cannot expand our understanding on how SNVs alter protein sequences, mRNA transcript stability, transcription factors and ultimately, the observed phenotypes. Instead, data integration frameworks must be developed to analyze other sources of omics data and for pinpointing variants that associate additively with the trait.
Here, we address the problem that causal loci are underpowered in single-SNV analysis because few subjects carry the same causal variants. Thus, for most of the subjects, different causal variants are observed within the gene region and in genes participating in related cellular functions (Leiserson et al., 2013). This problem is of particular interest in the area of network-assisted analysis of GWAS data (Jia and Zhao, 2014), where the prioritization of candidate genes is enhanced by aggregating GWAS summary statistics onto protein-protein interactions (PPI) or metabolic pathways, followed by searching connected modules within the network (Lee et al., 2011;Leiserson et al., 2013;Liu et al., 2017a, b;Rossin et al., 2011). Despite of their applications, these methods ignore the presence of confounding variables in the input data (e.g. subjects' genetic background), and the information lost by aggregating SNVassociation signals into the gene level.
Taking into account these limitations and the need of integrative frameworks for GWAS and omics data, we developed cNMTF (corrected non-negative matrix tri-factorization). Based on the principle that causal genes participate in similar cellular functions, cNMTF uses the PPI network (PPIN) to share genotyping information among SNVs and identifies novel genes closer to known loci. The method relies on machine learning techniques for clustering, which have been used for studying high-dimensional heterogeneous data in systems biology (Gligorijevi c and Pr zulj, 2015;Gligorijevi c et al., 2014Gligorijevi c et al., , 2016aHwang et al., 2012;Xiao et al., 2018;Zitnik and Zupan, 2015). Our algorithm is itself an extension of nonnegative matrix factorization (NMF) into the context of GWAS, allowing the clustering of subjects with common phenotypes, and simultaneously, the prioritization of data entries (e.g. genes, SNVs) acting jointly on the phenotype.
A first major feature of cNMTF resides in the integration of heterogeneous sources of information: genotypes, phenotypes, subjects' ancestry, the predicted deleteriousness effect of SNVs and PPIs. The integration is performed using raw genotyping data rather than aggregated GWAS statistics, so there is no loss of information when moving into the gene level. A second novelty of cNMTF is its ability for correcting the results against population structures via kernel functions. To our knowledge, cNMTF is the first framework that brings together matrix factorization and kernel methods to solve the problem of confounding factors in heterogeneous data analysis, and adheres to similar improvements for support vector machines (SVMs) (Li et al., 2011).
Our algorithm was tested on three lipid traits: low-density lipoproteins cholesterol (LDL-C), high-density lipoproteins cholesterol (HDL-C) and triglycerides (TG) levels. A number of studies have placed these traits as strong determinants of cardiovascular disease risk; however, the genetic architecture of lipid metabolism is far from understood (Paththinige et al., 2017). This offers an opportunity to identify new variants and loci using cNMTF.
We obtained genotyping data from two consortia to evaluate our method: The Northern Finland Birth Cohort 1966 (NFBC1966), a relatively genetically homogeneous cohort of Finnish individuals with lipid traits associations reported by Sabatti et al. (2009); and a cohort of white American subjects of diverse European ancestry obtained from the Electronic Medical Records and Genomics network (eMERGE) (McCarty et al., 2011).
Across cohorts, cNMTF significantly prioritized a mean of 93 variants and 44 genes per trait (P cNMTF < 0.005), replicating a total of 23 variants from prior studies. We showed that cNMTF complements results of RMs by capturing patterns of association that do not reach statistical significance in the single-SNV methods. In addition, we proved the capabilities of our algorithm for identifying population structures in the data and correcting the results against false positive associations.

Input matrices
cNMTF is a data integration framework that scores SNV-trait associations and finds clustering patterns in the genotypes of subjects. Its search is guided with prior knowledge on the phenotypes, subjects' ancestry and molecular SNV data (Fig. 1).
The algorithm takes a matrix of genotypes, R ðnÂmÞ , that encodes the number of recessive allele copies carried by the m subjects across n SNVs. Considering G i;j , the genotype of patient j in the SNV i with alleles A and B; then, Thus, we set R ðnÂmÞ to contain the genotypes under an additive genetic model (Lewis and Knight, 2012).
We decompose the relationship matrix R in three lowdimensional matrices U ðnÂk1Þ ; V ðmÂk2Þ and S ðk1Âk2Þ : This dimensional reduction is provided by rank parameters k 1 and k 2 ðk 1 ( n; k 2 ( mÞ, which are chosen a priori. Both U ðnÂk1Þ and V ðnÂmÞ can be seen as cluster indicator matrices for SNVs and subjects, respectively, where k 1 and k 2 are the number of clusters. For example, each subject is assigned to a cluster by finding the maximum entries in the rows of V (i.e. the values in each row of V represent the importance or 'learned weights' of a given subject with respect to a subject cluster). On the other hand, S ðk1Âk2Þ is a compressed version of R that describes the importance of a given SNV cluster with respect to a subject cluster (Gligorijevi c and Pr zulj, 2015).
One of the novelties of cNMTF resides in the definition of three sources of information to guide the solution of U, V and S as follows: The solution of V is guided with prior knowledge about subjects' clinical phenotypes. That information is added through a phenotype matrix V oðm Â k2Þ , of m subjects by k 2 phenotype categories. We set k 2 ¼ 2 because we are studying binary phenotypes or cases-control designs. The entries of this matrix are V o½j;1 ¼ 1 if patient j is a control (V o½j;1 ¼ 0 otherwise) and V o½j;2 ¼ 1 if patient j is a case (V o½j;2 ¼ 0 otherwise).

SNV-SNV network (L U )
The solution of U is guided with prior knowledge about the SNV potential to disrupt the function of interacting proteins. Our hypothesis is that SNVs do not act independently but their effects are dependent on other polymorphisms elsewhere in the genome. Therefore, we constructed an SNV-SNV network containing two kinds of nodes: damaging variants and candidate variants ( Supplementary Fig. S1). The set of damaging variants correspond to SNVs with high or moderate impact in the genome according to ENSEMBL (e.g. frameshift variant, stop gained variants). In addition, SNVs annotated as deleterious by Sift or Polyphen (VEP-ENSEMBL query, McLaren et al., 2016) and SNVs known to be associated with the trait in the GWAS catalogue were labelled as damaging variants. All the remaining SNVs were labelled as candidate variants. Then, we connected two SNVs in the network if both SNVs are harboured by the same gene. Damaging SNVs from different genes were also connected because they are more likely to disrupt the PPI (BioGrid multi-validated interaction list). The number of candidate and damaging variants is reported in Table 1.
Afterwards, the network was weighted to give preference to edges between damaging variants. The weighting aims to reduce bias in the node degree when genes harbour thousands of variants, so the edges were divided by N À 1, where N is the number of variants in the gene.

Subjects' ancestry (A)
The solution of V is corrected for the individual's ancestry or population origin. We used the Hilbert-Schmidt independence criterion (HSIC), which has been previously used for correcting SVMs (Li et al., 2011).
The HSIC is a measure of statistical independence between two random variables (X and Y) computed in terms of kernel functions (Li et al., 2011). Let X be a random variable from the domain X, with feature space F and associated kernel k : X Â X ! R. Let Y be a random variable from the domain Y, with feature space G and associated kernel a : Y Â Y ! R. The empirical estimator of HSIC for a finite sample of points x and y is shown to be: where K and A are the kernel matrices on the random variables X and Y, given by: Because small values of HSIC are expected when X and Y are independent, we define the following kernel matrices: • Kernel matrix K i;j : Suppose m subjects with clustering information vectors (v 1 ; . . . ; v m ). The K i;j ¼ kðv i ; v j Þ is a kernel matrix generated by the linear kernel k on the clustering information: K ¼ VV T • Kernel matrix A i;j : Generated by the kernel a on the population structure information. The population origin is frequently unknown, so it must be inferred via principal components analysis using the similarity between subjects in the principal components space.

The objective function
To obtain U, S and V, we solve an optimization problem denoted by the objective function, J cNMTF : where jj Á jj 2 F refers to the Frobenius norm, trðÁÞ is the trace of a matrix and Á T is the transpose of a matrix. The goal in Equation (3) is to minimize the difference between the genotyping data (R) and the Step 1: The algorithm takes four sources of data in the input: R is a relationship matrix for the genotyping data. LU is the Laplacian matrix of an SNV-SNV network. V o is a phenotype matrix. A is a kernel similarity matrix encoding the population origin of the subjects.
Step 2: The R matrix is approximated as the product of low-dimensional matrices U, V and S. Here, we use LU ; Vo and A to penalize the factorization and guide the solutions of U and V.
Step 3: The dimensional reduction provides information for clustering tasks, so U and V are taken as cluster indicator matrices for SNVs and subjects, respectively. Simultaneously, we compute the product of U and S to generate a score matrix X. This matrix summarizes the effect of single SNVs on clusters of subjects with specific phenotypes and can be used to prioritize the SNVs. When SNV scores are compared between clusters we observe the relative importance of each SNV on a trait; therefore, those SNVs with high delta score, DX, can be prioritized for further analysis new matrices using the Frobenius norm, while guiding the solution with the penalization terms: • The phenotype matrix V o maximizes the separation between subjects with different phenotypes in the term jjV À V o jj 2 F , while c 2 weights the influence of the phenotype. • The SNV-SNV network is integrated in the form of a graph Laplacian, L Uðn Â nÞ ¼ D Uðn Â nÞ À W Uðn Â nÞ , where W U is the weighted adjacency matrix and D U is the diagonal degree matrix of W U . Hence, a new term, trðU T L U UÞ, guides the solution of U in the objective function and c 1 weights the influence of the network. We also tested the algorithm when using a normalized graph Laplacian: where I ðnÂnÞ is the identity matrix with ones on the main diagonal. • The HSIC term trðVV T HAHÞ is added to correct for population structures and parameter c 3 weights that regularization.
In Supplementary File S1, Sections S4-S6, the optimization problem is solved using iterative update rules (Lee and Seung, 2000). We show how to achieve unique solutions and robust prioritizations for the different traits by running the algorithm multiple times with different initializations, extracting the clustering solutions from each repetition and combining these results to conform a consensus solution. The computational complexity of this minimization problem is shown to be quadratic in the number of SNVs, Oðn 2 Þ.

Delta scores, DX
The minimization of the objective function (Equation (3)) allows us to find the clusters of subjects from V. Simultaneously, we use U and S to prioritize SNV-trait associations. We summarize the number of copies of recessive alleles and the connectivity patterns of the network into an SNV score, X s;l , which tells how important is the variant s for a given cluster of subjects l (Equation (4)).
After comparing the SNV scores between clusters enriched in controls and cases (Equation (5)), we can prioritize variants DX s ¼ X s;controls À X s;cases The delta score captures the overall effect of the SNV in the cluster of controls. For instance, a positive DX s means a protective effect of the recessive allele B, because the controls are carrying more copies of the recessive allele when compared with the cases. In contrast, negative DX s mean that recessive alleles are detrimental and preferentially observed in cases.

Significance of the Delta scores, DX
We assessed the significance of the observed DX in order to compare objectively our results [i.e. a P-value for the SNV-trait association is provided (P cNMTF )]. For each trait, the null distribution of DX was generated by running cNMTF with 1 000 randomizations of phenotypes. We set a significance level, a ¼ 0:01, to prioritize approximately 100 variants in the tails of the distribution. Thus, cut-off points at a 2 ; 1 À a 2 À Á are obtained from the cumulative distribution function ( Supplementary Fig. S3). These cut-off points are estimated for every genetic dataset, trait and setting of the algorithm.

Parameters selection
There are two sets of parameters to be chosen sequentially. In a first step, we explore the structure of the data to select the number of clusters of SNVs and subjects, k 1 and k 2 , respectively. We selected k 2 ¼ 2 because we follow a case-control study design. On the other hand, the selection of k 1 is based on a grid search while tracking the cluster stability. We used a dispersion coefficient to summarize the LGR4 This section shows the most significant novel gene not reported in GWAS catalogue. It lists the SNVs with the lowest P-value within each gen (P cNMTF ) and their delta score (DX). The complete list of prioritized genes and SNVs is annexed in Supplementary Files S2 and S3. consistency of clustering assignments throughout repetitions of the algorithm (Kim and Park, 2007) (Supplementary Fig. S5.A).
The second step is to explore the contribution of data sources (R, L u , A and V o ), while changing the thresholding parameters. Here, the optimal c 1 maximizes the number of known loci-trait associations in the results, whereas optimal c 2 and c 3 maximize the separation of case-controls. See Supplementary File S1, Sections S8 and S9 and Figure S4 for details on the grid search and how we quantified the transference of information.

Genotyping data and phenotypes
Array genotyping data and clinical variables from 5 402 Finnish and 6 100 white American subjects (NFBC1966 and eMERGE cohorts) were pre-processed as described in Supplementary File S1 and (Section S1). We also simulated genotyping data with embedded population structures.
With regards to the phenotypes, we focussed on three lipid traits: LDL-C, HDL-C and TG. For each individual we obtained the following clinical variables which are potential confounders of the associations: age, alcohol consumption, smoking status, body mass index (BMI), BMI at birth, pregnancy and use of oral contraceptives (Supplementary Table S1).
As our method is aimed at the study of categorical phenotypes, we categorized the trait by using cut-off levels and compared subjects at extreme categories. For LDL-C, we divided our cohort in controls (subjects with LDL-C < 100 mg/dl) and cases (subjects with LDL-C > 160 mg/dl). These cutoffs were chosen following the optimal and high LDL-C levels for cardiovascular risk prevention (Supplementary Table S2, ATP III guidelines). Similarly, for the other traits, subjects were divided in cases and controls using the cutoffs reported in Table 1. Subjects with intermediate lipid levels were excluded from our analysis.
We subtracted the effect of confounders from the phenotype by fitting a multiple RM. The residuals of the model were used as a corrected phenotype that is explained by the genetic component of the subjects rather than by the confounders (Supplementary File S1 and Section S3). Notably, the confounding effect from subjects' ancestry is corrected simultaneously within the cNMTF algorithm, constituting one of the main contributions of this work.

A subset of SNVs in the relationship matrix (R)
We aimed at reducing the universe of variants to a meaningful subset that could explain the phenotype. This pre-selection of SNVs reduces the data noise, increases the performance of the method and alleviates the computational complexity during the matrix operations (Fig. 2).
First, we queried the GWAS catalogue for all known variants associated with the trait (under the genome-wide significance threshold P < 10 À8 ) (MacArthur et al., 2017) (available at: www. ebi.ac.uk/gwas, accessed 1 August 2018, version 1.0.2). These variants can be seen as a seed group that will be expanded to a larger group of variants. To achieve this, we identified the genes harbouring known associations (seed genes) and mapped them onto a PPIN (BioGrid interactions validated by multiple publication sources and experimental systems; available at: https://thebiogrid.org, accessed 1 August 2018, version 3.4.162). Please note that seed genes with no interactions in the PPIN are analyzed as well (i.e. isolated nodes).
Then, variants located in the region of seed genes and variants located in the interacting gene partners (first neighbourhood in the PPIN) were selected to form the subset of SNVs. The gene region is determined by the start and stop genomic coordinates of the gene. Hence, the final genotyping data input, R, contained on average 8 800 SNVs per trait-cohort analyzed.
We further compared the results of the algorithm when clumping SNVs in high linkage disequilibrium (LD) from the input. The SNV clumping is a pre-processing step where a proxy SNV is chosen to represent a gene region in high LD (r 2 > 0:5) whereas non-proxy SNVs are excluded from that gene region. The proxy variant is not selected at random but we select the variant most associated with the trait using Logistic Regression Models (LRM) (Euesden et al., 2018). For purposes of the SNV-SNV network construction, nonproxy SNVs are removed but their functional impact (i.e. damaging variant) is inherited to the proxy SNV in the network, so this information is not lost during the clumping. On average, this pre-processing step reduces the data input to $7 800 SNVs.

Results
The dispersion of delta scores for LDL-C is displayed in Supplementary Figure S7, where the prioritized variants are those above and below the significance cut-off points. The delta score indicates the trend of the association: (þ) protective effect of the recessive allele B towards optimal conditions of LDL-C; (À) We mapped the prioritized variants to genes and ranked the genes using the variant with lowest P cNMTF . A mean of 93 variants and 44 genes was prioritized per cohort and trait (Table 1).

Prioritization of SNVs
We replicated a total of 23 genome-wide significant associations from GWAS catalogue (P < 10 À8 ) using smaller sample sizes than the original studies. For LDL-C, we replicated two out of four significant associations reported by Sabatti et al. (2009) in the Finnish cohort: rs174546, rs1535; and two more associations reported in other studies (rs2228671, rs10490626) (Aulchenko et al., 2009;Teslovich et al., 2010). For HDL-C and TG, the numbers were broadly similar; our method recalled 7 and 12 known associations in both cohorts, respectively (Supplementary Figs S8-S11).
We compiled the information for prioritized SNVs, including their delta score, P cNMTF , minor alleles, genes and consequences on the transcript (Supplementary File S2). It was found that 19 out 23 variants conserve the trend reported in the original GWAS. This result supports the codification of recessive alleles in the input matrix R, and the mathematical formulation of the scores from lowdimensional matrices, because there is an agreement between the sign of the delta score and the sign of beta coefficients from RMs.
In the formulation of the SNV-SNV network, we labelled the variants as damaging or candidates. The set of prioritized variants contains 7% damaging variants, compared to 3% in the input network. Thus, an enrichment of damaging variants is observed in the results (P ¼ 8 Â 10 À8 ), but still a large proportion of prioritized variants have no evidence of impact in the protein (e.g. synonymous variants). In fact, most of the SNVs are located in intronic regions or affect non-coding transcripts with no further consequences in the protein (Supplementary Fig. S12). Among the SNVs with high impact are rs2228671, a stop gain variant in the LDL receptor (LDLR), and 18 missense mutations (Supplementary File S2).
The LD structure among variants was considered in the preprocessing steps and we compared the lists of prioritized genes with/ without SNV clumping. We found that results are highly robust in the trait-cohort analyses, where $90% of prioritized genes were preserved between both settings (Supplementary Fig. S17 and File S6). This can be explained by the clustering of SNVs taking place in matrix U (Equation (1)). Therefore, SNVs in LD are clustered and weighted together because their profiles are analogous in cases/controls and they also reside in the same gene region (they are connected in the SNV-SNV network). For example, rs174556 and rs1535 are in LD (r 2 ¼ 0:92) in gene FADS1. Their prioritization scores (without clumping) are 4.004 and 3.444, respectively, for association with LDL-C in Finnish individuals. If clumping is performed, rs1535 is filtered out from the input. The proxy SNV is rs174556, whose score slightly changes from 4.004 to 4.006 and FADS1 is still prioritized (P cNMTF ¼ 3 Â 10 À4 ). In general, the method achieves similar performance with either a set of high LD SNVs in the input or just proxy SNVs representing the same loci.

Prioritized genes
We consider that cNMTF is a complementary tool to explore hidden patterns of association that usually pass undetected with LRM. To make this a fair comparison between methods, we choose the same level of significance for SNV prioritization (a ¼ 0:01), and mapped the SNVs to genes. Then, we compared the performance in retrieving significant genes from the Global Lipids Genetics Consortium (GLGC, European-ancestry individuals) ( Supplementary Fig. S14) (Willer et al., 2013). For cNMTF, 99 out of 265 prioritized genes are significant in GLGC (Precision, PR ¼ 37%); whereas for LRM, 136 out of 394 genes are significant (PR ¼ 35%).
In a second step, the results were benchmarked against the lipidassociated genes in GWAS catalogue, which includes curated information from GLGC and all published GWAS world-wide ( Supplementary Fig. S15). The PR increased to 49% and 48% for cNMTF and LRM, respectively. Among the true positives, there are 45 genes captured exclusively by cNMTF, 105 genes captured only by LRM and 84 genes in the intersection of both methods (Fig. 3). This supports our conclusion that cNMTF and LRM results are complementary and mutually enhance gene discovery for GWAS.
As seen in the dispersion plots of Supplementary Figure S14, one of the strengths of cNMTF is on leveraging genes with moderate association through data integration. We compared the P-values between both methods and found that 29% of the genes prioritized by cNMTF nearly reach the significance level in LRM (0:01 < P LRM < 0:10). These genes would have been disregarded due to insufficient statistical power in the case-control samples. cNMTF, on its own, makes use of the PPIN to share association signals between variants, which is the strength of this machine learning method over single-SNV analyses.
We searched for additional evidence to support our findings with different benchmarks (Fig. 4): 1. The benckmarking of cNMTF with GWAS catalogue gave us evidence for 129 out of 265 genes (49%). We noted that despite the low proportion of seed genes in the input, a significant enrichment of them is obtained in most of the outputs ( Supplementary Fig. S13). This means that our results are not biased by the subset of variants in the input, and that cNMTF is truly differentiating signals of association from any noise generated by the PPIN. 2. Further exploration of GWAS catalogue was conducted to include closely related traits (e.g. 'hypertriglyceridemia', 'total cholesterol levels'. The complete list of 101 related traits is annexed in Supplementary File S5). This procedure gave us evidence of association for eight more genes (3%). For instance, the gene CARM1 is reported with the combined phenotype 'C-reactive protein levels/LDL-C levels (pleiotropy)' (rs1529711, P < 10 À8 ) (Ligthart et al., 2016). Also, CARM1 association with LDL-C was updated recently (rs2304088, P < 10 À23 , GWAS catalogue, February 2019, retrospective validation).  Figure S15, we present Venn diagrams for specific trait-cohorts 3. In the remaining genes not reported in GWAS catalogue, we searched for functional implications in GO, KEGG, Reactome and OMIM database. We found 45 genes (17%) related to biological processes/pathways in the lipid metabolism (e.g. fatty acid biosynthesis). Some of them were significantly enriched in novel genes (P < 0.05, Supplementary  (Bock et al., 2003;Hebel et al., 2015)], and studies in other organisms [DLGAP1 in mice, NLGN1 in ducks (Chadwick et al., 2010;Zhu et al., 2019)]. Noteworthy, the association of NLGN1 with total cholesterol was reported by this year. A short description of our findings is presented in Supplementary File S3 for each gene. All in all, we collected evidence that supports 226 genes (85%).

Patterns of interaction in the PPIN
The interaction between novel genes and seed genes in the PPIN is presented in Figure 5, Supplementary Figures S9 and S11. We mapped the results onto the PPIN and retrieved only the edges between prioritized genes. The graph for LDL-C shows well-known genes for the lipid metabolism: LDLR, APOB, APOH and FADS1/ FADS2/FADS3. However, most of these genes appear isolated from other prioritized genes, which are a consequence of using only the first neighbours from the PPIN rather than the whole lipid pathways in the input. It is also clear in the network that Finnish and white Americans have just a few prioritized genes in common (12% of the genes intersect). Indeed, some of the interactions between novel loci and key proteins were observed only in specific populations. For example, SMARCA4-CARM1 is an interaction captured only in the Finnish cohort (the tumour-suppressive gene CARM1, interacts with a gene associated with abnormal LDL-C levels and required for tumour cell growth, SMARCA4). Similarly, in the white Americans, we noted the interaction HMGCR-INSIG1 (the regulator of lipogenesis INSIG1 binds an enzyme controlling the production of cholesterol, HMGCR). This specificity of results by population arise from variable allele frequencies and the LD pattern (Deo et al., 2009).

Discussion and conclusions
We developed a data integration framework to address the problem of SNV and loci prioritization. cNMTF extracts relevant patterns of information from genotypes, phenotypes and molecular data via dimensionality reduction, finds clustering patterns and scores the associations with the phenotype. A key feature of cNMTF is performing multiple-SNV analysis by means of the SNV-SNV network. This strategy allows for the sharing of information between variants depending on their network connectivity and the similarity of genotypes across individuals.  We have provided cNMTF with capabilities to capture gene-trait associations that are not significant in the univariate studies due to insufficient statistical power. The method also unveiled well-known genes involved in lipid metabolism that have not been prioritized in the Finnish and white American cohorts, thereby, complementing current findings of LRMs.
Another relevant feature of cNMTF is the correction for detrimental effects of population stratification, which are particularly problematic when using matrix factorization. In Supplementary File S1 (Supplementary Sections S10-S12 and Figs S18-S21) we expand the discussion on how the algorithm generates solutions regardless the subjects' ancestry, while minimizing the rate of spurious findings.
Although our study shows an implementation of cNMTF with SNV-SNV networks based on PPIs, the algorithm is suitable for studying other omic datasets in future works. The input can be modified to allow for the integration of genotypes with metabolic, transcriptomic or proteomic data by means of weighted networks. This would give insights on the genetic heterogeneity resulting from pathways, where current methods treat all the genes as equivalent and do not model their interactions (Leiserson et al., 2013). In addition, cNMTF can be easily adapted to go beyond the case-control design. For example, this can be used for patient stratification and definition of tumour subtypes in cancer research, where a number of clusters could be assessed simultaneously.
With regard to the computational features of the algorithm, our research expands the field of NMTF-derived methods (non-negative matrix tri-factorization). To date, regularizations of NMTF are characterized by the use of graph Laplacian (Shang et al., 2012), constrained clusters (Li, 2010), rules in matrix definition (Ding et al., 2006;Gu and Zhou, 2009) and knowledge transference between input matrices (Gligorijevic et al., 2016a, c). Here, we showed the regularization of NMTF via the combined use of kernels, and stated principles for adequate data weighting and confounder correction. Future work can extent these principles for the study of continuous phenotypes or the formulation of SNV scores from multilayer omic networks.
This work has limitations in the algorithm implementation. First, the method can only be evaluated on a subset of SNVs due to the computational cost of matrix operations at the genome-wide level. We limited the analyses to disease-associated genes reported in GWAS catalogue (seeds) and the first neighbourhood of seed genes in the PPIN. Genes outside these filtering rules are lost in our study cases, so we have potentially lost disease-associated locus not interacting with our seed genes. Nonetheless, the algorithm can be tailored to explore larger sets of seed genes or indirect PPIs, depending on the scope and the computational resources available to that aim.
Second, the inclusion of PPI data could bring bias to the results. Hub genes in the protein interactome have a higher chance to interact with our set of seed genes and they are more likely to be included in the input (e.g. TP53 connects 11% of proteins in the BioGrid network). Similarly, candidate SNVs in hub genes could be connected to more damaging SNVs. When the algorithm is executed, the diffusion of information through the network will favour the proximity between candidate and damaging variants.
To address this limitation, we introduced the normalized graph Laplacian in the objective function and analyzed the node degree of prioritized genes/SNVs in their networks (Supplementary File S1, Section S4.1 and Fig. S17). However, no significant differences in the median node degree of prioritized genes/SNVs were observed between settings of the algorithm (basic Laplacian versus normalized Laplacian, Wilcoxon test, P > 0.24). Please note that we are already performing a normalization in the edges of the SNV-SNV network to remove bias for genes harbouring multiple variants. Therefore, these results indicate that hub genes/SNVs could be prioritized because they may play a relevant role in the lipid processes and should be integrated in the analyses (e.g. APOB is a key hub gene connecting 12 proteins). We conclude that the connectivity of the network itself cannot prioritize variants unless they show moderate association signals in the genotype data.
Another limitation of cNMTF is that some genes associated with the lipid trait (e.g. CELSR2, LPL) were not prioritized. This is due to the clustering nature of the method, the different sources of information contributing to the final results and our still limited knowledge of the human interactome. Particularly, the SNV-SNV network leads to strong association signals boosting the weak variants; however, the opposite can also occur (the strong association is masked) if clustering patterns are not observed in the genotype data, or the network is saturated of very poor associations. Consequently, we see our method as a complementary tool for single-SNV studies because its performance depends on the clusters and the connectivity of weak-strong variants.
In conclusion, we have presented cNMTF as an alternative approach to prioritize variants and genes for follow-up studies. Given the satisfactory results with lipid traits and the flexibility of cNMTF to handle interrelated but disparate sources of data, this study provides valuable guidelines for future integrative approaches in the field.