Network-guided search for genetic heterogeneity between gene pairs

Abstract Motivation Correlating genetic loci with a disease phenotype is a common approach to improve our understanding of the genetics underlying complex diseases. Standard analyses mostly ignore two aspects, namely genetic heterogeneity and interactions between loci. Genetic heterogeneity, the phenomenon that genetic variants at different loci lead to the same phenotype, promises to increase statistical power by aggregating low-signal variants. Incorporating interactions between loci results in a computational and statistical bottleneck due to the vast amount of candidate interactions. Results We propose a novel method SiNIMin that addresses these two aspects by finding pairs of interacting genes that are, upon combination, associated with a phenotype of interest under a model of genetic heterogeneity. We guide the interaction search using biological prior knowledge in the form of protein–protein interaction networks. Our method controls type I error and outperforms state-of-the-art methods with respect to statistical power. Additionally, we find novel associations for multiple Arabidopsis thaliana phenotypes, and, with an adapted variant of SiNIMin, for a study of rare variants in migraine patients. Availability and implementation Code available at https://github.com/BorgwardtLab/SiNIMin. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Testing associations between individual genetic markers and a disease phenotype has shown large successes for Mendelian diseases, where individual point mutations at one genetic locus cause an individual to develop a disease phenotype (e.g. Kerem et al., 1989;MacDonald et al., 1992). However, for complex diseases that presumably result from a complex interplay between genetic markers, environmental factors and lifestyle (Hunter, 2005), genetic loci detected in univariate association studies often fail to explain the phenotypic variation. This phenomenon is referred to as missing heritability (Lee et al., 2011;Manolio et al., 2009;Visscher et al., 2008Visscher et al., , 2012. Focusing on interactions between genetic markers promises to increase our understanding of disease mechanisms and address the problem of missing heritability (Zuk et al., 2012). One such mode of interaction is genetic heterogeneity, meaning that different genetic variants affect a phenotype in a similar direction (McClellan and King, 2010). While technical advances allow genotyping individuals at ever-increasing resolution, this complicates the search for joint effects between individual genetic variants, as the number of potential interactions explodes with the number of genetic variants. We refer to the genetic variants as features in the following. A common approach is pairwise testing of single nucleotide polymorphisms (SNPs), for example, to test for epistatic effects (Cordell, 2002). This pairwise search scales quadratically with the number of features, posing not only a computational but also a statistical challenge in the form of a multiple hypothesis testing burden. If not accounted for, the multiplicity of tests can lead to high numbers of associations with the phenotype by random chance. Especially if they are to be evaluated in follow-up studies, strictly avoiding large numbers of false positives is indispensable. This is achieved by controlling the familywise error rate (FWER), i.e. the probability to observe false positives. While the classical Bonferroni correction (Bonferroni, 1936) accomplishes this, it is too conservative and avoids false positives at the expense of accepting large numbers of false negatives. Tarone (1990) showed that a less restrictive control of the FWER exists that increases statistical power for discrete tests. This observation has been used extensively in significant pattern mining to simultaneously alleviate the statistical and computational challenge imposed by testing large amounts of hypotheses (Llinares-Ló pez et al., 2015a;Minato et al., 2014;Papaxanthos et al., 2016;Terada et al., 2013). Recently, Tarone's procedure has shown success in various bioinformatics disciplines (Bock et al., 2018;Llinares-Ló pez et al., 2015b. Despite those advances, mining pairwise SNP interactions in datasets with highdimensional feature spaces in a statistically rigorous manner remains an open challenge. A different line of research aims at elucidating genetic interactions by adopting a gene-based view (Kwon et al., 2018;Povysil et al., 2019;Zhao et al., 2016), where genes are represented as variant-sets. While this is certainly promising, those methods might be underpowered if not the interaction between whole genes, but only sub-regions within the genes, are driving the association. At the same time, large numbers of high-quality biological networks describing various relationships between genes (or their products) have become available for different organisms (e.g. Li et al., 2017;Luijk et al., 2018;Kanehisa et al., 2017;Obayashi and Kinoshita, 2011). Including feature relationships encoded in these networks has boosted biological analyses in different areas (e.g. Azencott et al., 2013;Horn et al., 2018;Krogan et al., 2015;Shen et al., 2017;Zhang et al., 2018). Hence, having seen the advantages of biological networks, combining them with concepts from significant pattern mining and Tarone's procedure to guide the search for interactions seems promising. This article proposes Significant Network-Interaction Mining (SiNIMin), a method that combines Tarone's procedure with concepts from significant pattern mining and biological networks to detect pairs of genes that exhibit association with a phenotype. SiNIMin integrates biological networks with SNP data to test whether pairwise interactions between gene segments are associated with a phenotype of interest under a model of genetic heterogeneity. Since testing all such pairwise interactions is computationally and statistically intractable, we leverage networks to restrict the search space to combinations of segments coming from two different genes that share an edge. Combining this interaction-mining approach with techniques from significant pattern mining makes solving a previously intractable problem feasible. Our approach has multiple advantages: (i) The statistical and technical tricks from the pattern mining field alleviate the intrinsic multiple-hypothesis testing problem and provide an efficient criterion to prune the search space of interactions. (ii) The combination of segments within interacting genes generates biologically meaningful hypotheses. (iii) By analyzing the sub-parts of genes, as opposed to whole genes, edges or modules in a network, we get a 'close-up view' of the genetics underlying disease at the highest resolution, namely at the level of individual bases. (iv) With the assumption that various SNPs might influence a phenotype in a similar way, we gain additional power by aggregating them under a model of genetic heterogeneity (Llinares-Ló pez et al., 2015b.

Problem statement
Consider a study of n samples, each coming from one of two phenotypic classes. We store this class assignment in an n-dimensional, binary phenotype vector y 2 f0; 1g n . Furthermore, each of the n samples is represented by a d-dimensional binary genotype that could correspond to genetic variants, such as SNPs or rare variants in a dominant encoding. The genotypes are stored in a data matrix Network. Nodes correspond to genes, and each edge represents an interaction between two genes. (C) Mapping of genetic variants to genes in the network. All variants that overlap with a gene's introns and exons (dark orange) are mapped to the gene. Optionally, variants that lie in a fixed-size window around the gene (light orange) are mapped to the gene as well. (D) Analysis of an interaction. Both adjacent genes are represented by their overlapping genetic markers, highlighted in blue and green in the data matrix. All possible segments are considered in both genes (illustrated by horizontal colored bars, where segments of length 1, i.e. single SNPs, are omitted). Each segment from the first gene is tested against each segment from the second gene. (E) Heterogeneity encoding (allelic heterogeneity) of two segments in the blue and green genes, highlighted with black boxes in A. In white, heterogeneity encoding (locus heterogeneity) w of their interaction. The max function is the elementwise maximum over the two vectors D 2 f0; 1g nÂd . Optionally, there may be a discrete covariate vector c 2 f1; . . . ; kg n that assigns each sample to one of k discrete covariate classes (see Fig. 1A). In addition, we have information about interactions between genetic regions in the form of a network. We represent this network formally as G, consisting of a node set V and an edge set E, that is G ¼ ðV; EÞ. An example are protein-protein interaction (PPI) networks, where nodes correspond to gene products, and edges correspond to interactions between them (see Fig. 1B). Each gene can be represented by the set of variants that overlap with its exons and introns. Optionally, variants that fall into a window up-and downstream of the gene can be included in the representation (see Fig. 1C). We define a genetic segment as a range of subsequent variants along the sequenced genome and denote it with the tuple (s, l), where s corresponds to the starting position of the segment, that is its first variant, and l is its length. We restrict ourselves to segments within genes, and call them gene segments in the following.
Our goal is to find pairs of gene segments that (i) show a statistically significant association with the binary phenotype y under a model of genetic heterogeneity and that (ii) can be mapped to an edge in the network, that is, each segment comes from one of two genes, with genes connected by an edge in G. This poses three challenges: (i) enumeration of all possible segments of genetic variants within any two genes that share an edge, (ii) testing of all possible combinations of segments from both genes (see Fig. 1D) and (iii) correcting for the dependency structure between gene segment interactions, arising due to the extensive overlap between the segments. These challenges can be addressed using concepts from the field of significant pattern mining.

Patterns as interactions between gene segments
A pattern is generally defined as a discrete subset of the d features in the data matrix D, ranging from single features to large combinations. Here, we restrict ourselves to interactions between gene segments, and we will use the terms segment interaction and pattern interchangeably. Importantly, patterns exhibit sub-and superstructure relationships. For example, the interaction between segments (s 1 , l 1 ) and (s 2 , l 2 ) is a sub-pattern of the interaction between ðs 1 ; l 1 þ 1Þ and (s 2 , l 2 ), as it is fully contained in the latter. Inversely, the interaction between segments ðs 1 ; l 1 þ 1Þ and (s 2 , l 2 ) is a superpattern of the interaction between segments (s 1 , l 1 ) and (s 2 , l 2 ). Similarly, the interaction between segments (s 1 , l 1 ) and (s 2 , l 2 ) can be expanded in any other direction, giving rise to a multitude of superpatterns.
To test the combination of two gene segments (s 1 , l 1 ) and (s 2 , l 2 ) from interacting genes g 1 and g 2 , respectively, we first represent each segment as a sub-matrix of the data matrix of dimension n Â jl i j; i 2 f1; 2g (see black boxes in Fig. 1A). We reduce each segment to an n-dimensional vector m ðs;lÞ 2 f0; 1g n where m ðs;lÞ j ¼ maxðD j;s:sþl Þ and j ¼ 1; . . . ; n indexes the samples. If for sample j any variant in the segment has value 1, its representation will also be 1, and we call this heterogeneity encoding (see also Li and Leal, 2008;Morris and Zeggini, 2010). The combination represents allelic heterogeneity within the gene since multiple variants in a segment lead to the same effect on the final representation (Fig. 1E). To combine two segments ðs 1 ; l 1 Þ and (s 2 , l 2 ), we merge their representations m ðs1;l1Þ and m ðs2;l2Þ , again using the above-mentioned heterogeneity encoding. This combination gives rise to an n-dimensional vector w ðs1;l1Þ;ðs2;l2Þ 2 f0; 1g n , representing locus heterogeneity. For simplicity of notation, we drop the indices because they will be clear from the context, and denote the encoding as w throughout this section. While other popular combinations of variants, such as burden tests, exist, the heterogeneity encoding introduced here is a biologically interesting concept (McClellan and King, 2010), that also constitutes a key-component of our algorithmic contribution: given the binary representation of a segment interaction and the phenotype, contingency-based tests, such as Fisher's exact test, Pearson's v 2 test (Pearson, 1900) or a Cochran-Mantel-Haenszel test (CMH, Mantel and Haenszel, 1959) can be used to measure association. Importantly, to reduce the redundancy between patterns, we enumerate segments in a closed form: if a segment has the same heterogeneity encoding as any of its sub-patterns, it will not be enumerated.

Statistical testing and control of the FWER
Up to now, we have established how to define, represent and combine gene segments. The binary reformulation of the problem by means of the heterogeneity encoding allows us to exploit several concepts from the field of significant pattern mining (Minato et al., 2014;Papaxanthos et al., 2016;Terada et al., 2013). Those methods rely on the concept of the minimum p-value.

The minimum p-value
Given a pattern's binary encoding w and the phenotype vector y, we can generate a contingency table such as in Table 1 as the basis for a discrete test. For fixed table marginals x, n and n 1 , the table has exactly one degree of freedom (w.l.o.g., we use the table count a).
Since every value of a leads to a contingency table with a specific pvalue, there exists a value of a that leads to the minimum p-value of the table, and this minimum p-value can be computed analytically as a function of the table margin x (Llinares-Ló pez et al., 2015a; Papaxanthos et al., 2016). Here, we focus on the CMH test, which can account for categorical covariates by creating one separate contingency table for every covariate class k [see Papaxanthos et al. (2016) for details]. However, all approaches can easily be extended to Fisher's exact test and the v 2 test.

Tarone's procedure to control the FWER
Tarone's procedure (Tarone, 1990) is a method to control the FWER, that is to restrict the probability of observing one or more false-positive associations when large numbers of tests are conducted simultaneously. The FWER is controlled at a significance level a, i.e. one finds a per-hypothesis threshold d, such that FWERðdÞ a. To maximize statistical power, one determines an optimal d Ã ¼ maxfd j FWERðdÞ ag. A common approach for FWER-control is the Bonferroni correction (Bonferroni, 1936), where the target significance level a is divided by the number of simultaneous tests, resulting in a per-test significance threshold of d bon ¼ a=m, where m is the number of tests. For a large number of tests, we have d bon ( d Ã ; the Bonferroni correction is therefore highly restrictive. The false positives are controlled at the expense of a loss in statistical power and hence the ability to discover true associations. Tarone showed that, for discrete tests, a correction factor c m exists that guarantees control of the FWER using the concept of the minimum p-value. A hypothesis with minimum p-value above a significance threshold d does not contribute to the FWER, as it cannot become significant; it is therefore untestable. Conversely, patterns with minimum p-value below d are testable at that threshold. The correction factor c in Tarone's procedure corresponds to the number of testable patterns and depends on the threshold, i.e. d tar ¼ a=c, where c is a function of the current threshold d. In practice, Tarone's threshold can be computed efficiently using an iterative search strategy (Minato et al., 2014) that subsequently lowers the threshold d while enumerating patterns. We describe how this can be realized to mine segment interactions in Section 2.4.
Tarone's procedure has been applied extensively in the field of computational biology for different problems and with different discrete tests (Bock et al., 2018;Llinares-Ló pez et al., 2015bMinato et al., 2014;Terada et al., 2013). The extension to the Table 1. A 2 Â 2 contingency table to test the binary encoding w of a segment interaction ðs 1 ; l 1 Þ; ðs 2 ; l 2 Þ for its association with the binary phenotype y Class label w ¼ 1 w ¼ 0 Row totals y ¼ 1 a n 1 À a n 1 y ¼ 0 xa n À n 1 À x þ a n À n 1 Col. totals x nx n CMH test (Llinares-Ló pez et al., 2017;Papaxanthos et al., 2016) led to a broader field of application, especially in the realm of genome-wide association studies where confounding factors such as population structure-if not properly addressed or corrected-can result in a large number of false positives.

Westfall-Young permutations for FWER control
One disadvantage of Tarone's procedure is its inability to correct for dependencies between tests. Interactions between gene segments are highly affected by the aforementioned sub-and super-pattern relationships (Section 2.2). Llinares-Ló pez et al. (2015a) developed a scalable version of the computationally demanding Westfall-Young permutations (Westfall et al., 1993), called Westfall-Young light.
The principle underlying permutation testing is the empirical estimation of the test statistic's null distribution by permuting the class labels. This destroys all true association signals such that any significant association found is a false positive. Given a sufficient number of permutations n p , the FWER at a significance threshold d can be estimated as where 1ðÁÞ corresponds to the indicator function, and p i min is the smallest p-value across all hypotheses for permutation i. The optimal threshold d WY that controls the FWER at level a can be chosen as the lower a-quantile of P ¼ fp i min g np i¼1 . Westfall-Young light significantly improves the computational cost associated to permutation testing by incorporating Tarone's testability criterion: since label permutations do not affect the minimum p-value, untestable patterns can be pruned from the search space, and their permutations can safely be skipped.

Our contributions: SiNIMin and SiNIMin-WY
In this section, we introduce our method SiNIMin and an extension based on Westfall-Young permutations (SiNIMin-WY). We provide a Cþþ implementation of both methods under https://github.com/ BorgwardtLab/SiNIMin, including examples. Both methods use the Algorithm 1 SiNIMin approach Input: n Â d-dimensional binary dataset D, phenotypes y, covariates c, edge-list E, target FWER a Output: set of significant interactions S 1: function mainðD; c; y; E; aÞ 2: Initialize globald tar 1, globalâ 1 and T 1, 3: P init min pvalues 4: init SiNIMinWY " only required for SiNIMin-WY 5: for ðg 0 ; g 1 Þ 2 E do 6: process edgeðD; c; ðg 0 ; g 1 ÞÞ 7: end for 8: d Ã a=jT j " final significance threshold 9: return filter significantðd Ã ; yÞ 10: end function 11: 12: function process edgeðD; c; ðg 0 ; g 1 ÞÞ 13: I ðg0;g1Þ enumerate segment interactionsðg 0 ; g 1 Þ 14: for return interactions in T whose computed p-value d Ã 37: end function concepts introduced in the previous sections to enable networkbased mining of gene segment interactions and are based on the discrete CMH-test. Since they only differ in how they achieve control of the FWER, we start by explaining SiNIMin, and afterwards point out the modifications necessary for the SiNIMin-WY approach.

The SiNIMin method
SiNIMin enumerates all segment interactions in a depth-first strategy, by successively increasing the segment's lengths. For every interaction, three crucial steps are performed: (i) assessment of testability by computing the minimum p-value, (ii) adjustment of the significance thresholdĉ d tar to guarantee control of the FWER and (iii) potential pruning of super-patterns.
A high-level description of the method can be found in Algorithm 1. We refer the interested reader to the Supplementary Material, where a more detailed description of the method can be found. The method requires the binary data matrix D, a list of edges in form of gene-tuples E, the target FWER a (commonly a ¼ 0:05), the phenotype vector y and the covariate vector c as input. The output is a list of significant segment interactions S. The FWER estimateâ is set to 1 and the set of testable hypotheses T is initialized to the empty set, since no patterns have been explored yet. The significance thresholdd tar is initialized to 1, and it will be successively lowered to account for an increasing number of testable patterns found during the enumeration to guarantee control of the FWER. It will take on values from a list of decreasing, precomputed thresholds obtained by calling init_min_pvalues (Line 3). Lines 5-7 contain the heart of the algorithm: the call to process_edge, that tests all segments between the two genes and updatesd tar . When invoking process_SiNIMin on edge (g 0 , g 1 ), all segment interactions are enumerated and stored in I ðg0;g1Þ . For every interaction in I ðg0;g1Þ , we (i) compute its heterogeneity encoding w I and minimum p-value p min I and (ii) assess the testability (Line 17). In case a pattern is testable, it is stored in T and the function process_SiNIMin is invoked to ensure the FWER is controlled. An empirical FWER estimate can be obtained asd tar Á jT j (Line 28). Whenever this value exceeds the given target a, the thresholdd tar is lowered until the FWER criterion is fulfilled (Lines 29-32). By lowering the threshold, previously testable patterns might become untestable and have to be removed from T . Notably, sinced tar decreases monotonically, untestable patterns with minimum p-value aboved tar can never become testable, thus guaranteeing control of the FWER. Lines 21-23 check whether there exist testable super-patterns of I . This can be inferred efficiently (Papaxanthos et al., 2016), and constitutes a major increase in speed. Again, due to the monotonicity ofd tar , we guarantee that prunable patterns remain prunable. Ultimately, after all edges have been processed, the final significance threshold d Ã is computed (Line 8) as a=jT j. Only now the phenotype y is used to infer p-values of all testable patterns in T , and significant patterns are returned.

The SiNIMin-WY method
To incorporate Westfall-Young permutations, the algorithm has to be altered in three different places. First, in case of SiNIMin-WY, a method-specific initialization takes place (Algorithm 1, Line 4), during which the label permutations are generated, and an n p -dimensional vector of permutation p-values is initialized to 1 (Algorithm 2, Lines 1-6). It stores the smallest p-values across all n p permutations. Second, the FWER is estimated from the label permutations: To do so, we call process_SiNIMinWY instead of process_SiNIMin in Line 19. It uses the permuted labels to compute n p permutation p-values for the current interaction I (Algorithm 2, Lines 9-12). Those p-values falling below previously seen permutation p-values are stored, as they potentially correspond to newly occurring false positives (Algorithm 2, Line 11). After processing all permutations, the empirical FWER is estimated (see Eq. 1), and in case it exceeds the target value a, the significance thresholdd tar is lowered until the estimated FWER complies with a. The third change is the computation of the final significance threshold (Algorithm 1, Line 8). For SiNIMin-WY, we call threshold_SiNIMinWY instead. It computes the per-hypothesis significance threshold as the lower a-quantile of the vector p min . Since Westfall-Young permutations are computationally demanding, we parallelized the permutation testing using OpenMP.

Simulation study
We assessed the performance of our methods SiNIMin and SiNIMin-WY to find statistically significant gene segment interactions under a model of genetic heterogeneity on simulated data. We compared our methods to several comparison partners and quantified performance in the form of runtime, statistical power and type-I error. Further analyses showing the usefulness of testing segment interactions under a model of genetic heterogeneity, as well as the robustness of our methods to network modifications can be found Note: Columns indicate the features/feature-sets that were tested, rows the different method-classes that either fall into the FastLMM framework (Lippert et al., 2011(Lippert et al., , 2014, the SKAT-O framework (Lee et al., 2012), the Tarone framework (Llinares-Ló pez et al., 2017) or the PLINK framework (Chang et al., 2015) (see Supplementary Section S2.2 for details).
in Supplementary Section S2. The comparison partners are listed in Table 2, and can be categorized into six different groups (see columns) with respect to the types of feature-sets they test for association. A more detailed description can be found in the Supplementary Material.
Runtime analysis We compared runtimes of the direct competitors of our proposed methods, i.e. all methods that test interactions between gene segments along edges in a network. We simulated artificial data (see Supplementary Material) and varied the two parameters that dominate the runtime: the number of variants per gene and the number of edges in the network. All methods were run on a high-performance computing cluster. We observed that SiNIMin and SiNIMin-WY clearly outperformed FastLMM-interact and SKATO-interact for all network and gene sizes (see Supplementary Fig. S4). Especially SKATOinteract is only applicable to small networks. See Supplementary Material for a more in-depth analysis of the runtimes.
Power and FWER analysis To evaluate the performance of SiNIMin and SiNIMin-WY, we generated artificial data with known ground truth. We created small networks containing 75 genes and 100 edges to allow an evaluation of all comparison partners (including SKATO-interact). Each dataset contained one truly significant gene segment interaction. We varied its association strength p s to the phenotype, with 1 indicating a very strong association (see Supplementary Material for details). To evaluate type II error, we compared our methods against the baselines with respect to power. Figure 2A and B shows power as a function of the association strength p s for interaction-based (either on the level of gene segments or individual genetic variants) and non-interaction-based methods, respectively. In general, we observed that interactionbased methods were better suited to detect the true association, except for the PLINK methods that are unable to handle the excessive number of tests (see Supplementary Fig. S2). Although SKATOinteract performed similar to SiNIMin and SiNIMin-WY, its application is limited to small datasets due to computational restrictions (see Supplementary Fig. S4). While SKATO-edge showed a performance comparable to SiNIMin and SiNIMin-WY, our proposed methods have the advantage of analyzing the data at a higher resolution: we can pinpoint the variants that drive the association, as opposed to the edge only. Next, we showed that our methods controlled the type I error by evaluating the empirical FWER (Fig. 2C). As for SiNIMin and SiNIMin-WY, we found that Westfall-Young permutations resulted in a less stringent control of the FWER. In summary, we showed that SiNIMin and SiNIMin-WY were able to detect the truly associated pattern, and outperformed the competitors in at least one of the following characteristics: (i) statistical power, (ii) runtime or (iii) resolution of association signal.

Analysis of Arabidopsis thaliana phenotypes
We applied our novel methods to a widely used A.thaliana GWAS dataset from the study by Atwell et al. (2010), downloaded from easyGWAS (Grimm et al., 2017) and AraPheno (Seren et al., 2017;Togninalli et al., 2019).
Experimental setup We used genotype data for 20 dichotomous phenotypes (sample sizes ranging from 84 to 177, see Supplementary Table S2), and created categorical covariates with k-means, clustering the three leading eigenvectors of the empirical kinship matrix [following Llinares-Ló pez et al. (2017)]. The number of clusters k equaled the number covariate classes and was chosen such that the genomic inflation was as close to 1 as possible (see Supplementary Fig. S6). To represent interactions, we used the Interactome network (Consortium et al., 2011), consisting of 11373 interactions between 4866 A.thaliana genes. We downloaded gene annotations from AraPort (Krishnakumar et al., 2015), and represented each gene with SNPs that overlapped with its introns and exons. Out of the 4866 genes in the interactome network, we were able to represent 4431 genes with SNPs from the datasets (see Supplementary Material for details).
Results We applied our methods SiNIMin and SiNIMin-WY to the 20 A.thaliana phenotypes. We observed that including covariates into the analysis resulted in the desirable reduction of genomic inflation across 18 out of 20 phenotypes (Supplementary Fig. S6). Estimating the FWER using Tarone's procedure increases power compared to the standard Bonferroni correction, as can be seen by the substantially lower number of testable patterns (see Supplementary Fig. S7 and Table S5). With SiNIMin-WY, we found significant gene segment interactions in 36 gene interactions, spanning 10 different phenotypes (Supplementary Files and Tables S3  and S6). We compared our approach against a variety of comparison partners and considered any gene interaction found by SiNIMin and SiNIMin-WY as novel, if none of the comparison partners detected one or both of the interacting genes. In total, SiNIMin-WY found nine novel gene interactions across seven different phenotypes (see Table 3). When restricting the segment length to 1 in SiNIMin-WY (yielding edgeEpi-WY), we detected two novel gene interactions spanning two additional phenotypes. Although those SNP interactions were also tested with SiNIMin-WY, they did not reach significance due to the larger number of tests, and the more stringent significance threshold. When comparing SiNIMin-WY to FastLMM-interact we found that SiNIMin-WY had a low falsenegative rate, since only five interactions were missed (out of 25, Supplementary Table S4). We investigated the genes in Table 3 using the TAIR resource (Berardini et al., 2015). All nine gene interactions contained genes that were either involved in similar biological processes, located in the same cellular components, shared molecular functions or were expressed in similar plant structures and expressed during similar developmental stages. The interaction AT1G15750-AT1G17380 was significant for the phenotypes avrB and avrRpm1 (p avrB ¼ 6.76e-08, p avrRpm1 ¼1.15e-07), both are phenotypes of bacterial disease resistance. The two genes are involved in the jasmonic acid-mediated signaling pathway. Jasmonic acid mediates stress responses in plants (Delker et al., 2006), and the gene AT1G17380 is known to be involved in the defense response and the response to Power (1-type II error) and type I error analysis of the simulation study. We compared SiNIMin and SiNIMin-WY to (A) approaches that consider interactions between sets of genetic variants and (B) approaches that test sets of genetic variants (no interactions) (see Table 2 for comparison partners). Both figures show the power to detect the truly significant interaction for varying association strengths p s . (C) Empirical FWERs for varying association strengths p s . The black horizontal line indicates the target FWER (a ¼ 0:05) wounding. Another interesting interaction was the one between AT5G25150 and AT5G45600 for the Leafroll16 phenotype ( p ¼ 3.89e-09). Both genes code for subunits of the General Transcription Factor IID (TFIID) in A.thaliana (Lawit et al., 2007). The interaction between AT3G18490 and AT5G42980 for the Leafroll22 phenotype (p ¼ 5.24e-09) linked two genes that show similar biological processes: AT3G18490 is involved in the response to water deprivation, and AT5G42980 is involved in the plant's heat response. Other interactions contained related genes, for example the interaction AT4G16845-AT5G57380, where both genes are involved in the vernalization response in A.thaliana. However, the interaction's connection to the phenotype Hiks1, a protist disease-resistance phenotype, remains to be uncovered.

Study on low-frequency variants in migraine
As a second application, we aimed to improve our understanding of the genetics underlying two migraine subtypes, namely migraine with aura and migraine without aura. Patients that suffer from the first subtype experience visual disturbances before onset of other migraine symptoms. In this study, we restricted the dataset to lowfrequency variants, that is SNPs with a minor allele frequency below 5%. Note that, combinations of gene segments could still achieve frequencies above 5%.
Experimental setup We pooled data from five different migraine cohorts: a Dutch cohort with aura (DMA), a Dutch cohort without aura (DMO), a German cohort with aura (GMA), a German cohort without aura (GMO) and a Finnish cohort with aura (FMA). Cases from the five cohorts were combined, and a binary phenotype was created such that migraine patients with aura obtained label 1. This constituted the MaMo cohort, comprising 5013 samples. We repeated this procedure, and this time combined migraine patients with the same nationality, which resulted in one dataset of 1849 Dutch patients (dMaMo) and one of 2231 German patients (gMaMo). Covariates were inferred from the genetic relationship matrix (see Supplementary Material). We used the InBio network (Li et al., 2017) to derive interactions, and mapped SNPs to genes that fell within a 50 kb window up-and downstream of the genes (see Supplementary Table S7 and Fig. S8).
Results All three cohorts exhibited inflation (k MaMo  Table  S8). Out of those, one interaction in the dMaMo cohort was novel, between the EPHA6 and TIAM1 (p ¼ 2.64e-08) genes, two genes that are involved in ephrin signaling. The ephrin-B signaling pathway has been previously associated with migraine (Guyuron et al., 2014). Notably, this is an interaction between length-1 segments, and hence could also be detected with edgeEpi-WY. Restricting the segment length to 1 (edgeEpi-WY), two more novel interactions were detected (see Table 4), one of them between the BMP4 and BMPR1B genes (p ¼ 1.50e-07). They interact within the BMP signaling pathway, which in turn has been linked to neural development (Bond et al., 2012). The third novel gene interaction (HAO1-VDAC3, p ¼ 1.33e-07), was between two genes whose edge had a low-confidence (score 0.247) in the InBio network. It was not part of any of the underlying pathway or interaction data bases the InBio network was built on (e.g. because it might have been inferred by orthology), and hence, we could not deduct any biological interpretation. However, reporting edges that showed no strong evidence indicated the potential of our methods to provide further support for low-confidence interactions in networks, and might mark the starting point for additional experiments.

Discussion and conclusions
We introduce two novel methods, SiNIMin and its Westfall-Young permutation-based counterpart SiNIMin-WY (Westfall et al., 1993), to detect pairwise interactions between gene segments that are associated with a binary phenotype under a model of genetic heterogeneity. We guide our search for interactions using biological prior knowledge in the form of PPI networks and restrict the search space to interactions between gene segments that fall into genes connected by an edge. To enable this computationally and statistically challenging search for interactions, our methods are based on established methods in the field of significant pattern mining. This AT2G04630-AT3G56270 3.59e-07 0 0 1 0 Note: The last four columns contain the number of significant gene-segment interactions (SiNIMin, SiNIMin-WY) and SNP interactions (edgeEpi, edgeEpi-WY) in the novel hit. We report the lowest p-value for any pair of segments within the novel hit, and highlight the method in bold, for which this p-value is obtained. For more details on the significant segments, see Supplementary Table S6. Note: The last four columns contain the number of significant gene segment interactions (SiNIMin, SiNIMin-WY) and SNP interactions (edgeEpi, edgeEpi-WY) in the novel hit. We report the lowest p-value for any pair of segments within the novel hit, and highlight the method in bold, for which this p-value is obtained. For more details on the significant segments, see Supplementary Table S10. reformulation requires categorical data, and non-categorical data have to be preprocessed prior to the application of SiNIMin. While this might seem as a limitation, it permits exploring the enormous search space associated with mining gene segment interactions, a task that cannot be achieved with other methods. Using other statistical tests, for example likelihood ratio tests, requires the exploration of all possible segment interactions between adjacent genes, which is prohibitive from a computational and a statistical point of view, due to the intrinsic multiple hypothesis problem. We evaluated the performance of our method on simulated datasets and showed that it was well-powered to detect true associations while controlling the type I error. Application of our methods to A.thaliana and migraine data showed their capability to find novel interactions between genes that discriminate cases and controls.
In contrast to methods that rely on estimating joint effects of markers based on the marker's marginal statistics (e.g. Azencott et al., 2013;Reyna et al., 2018), our approach is not restricted to detecting significant interactions in which at least one of the interaction partners shows marginal associations. By combining the gene segments according to a genetic heterogeneity model prior to the statistical test, our approach is capable of detecting interactions that do not exhibit a marginal signal in any of the interaction partners. However, for increasing dataset sizes, our method scales with the number of genetic variants that are mapped to genes, i.e. the computational and statistical burden increases, which affects the applied set-based tests (Lee et al., 2012;Lippert et al., 2014) to a lesser extent, as they scale with the number of genes or edges in a network.
An interesting direction of future research is to deviate from the strict mining of interactions between pairs of genes, which could be achieved by taking larger groups of interacting genes into account and focusing on network modules (see e.g. Mezlini and Goldenberg, 2017;Reyna et al., 2018). Moreover, recent advances in the field of frequent itemset mining (e.g. Fowkes and Sutton, 2016) could be integrated into our approach to find representative patterns of the phenotypic classes and test those, instead of exploring all segment interactions.