## Abstract

**Motivation:** In both genome-wide association studies (GWAS) and pathway analysis, the modest sample size relative to the number of genetic markers presents formidable computational, statistical and methodological challenges for accurately identifying markers/interactions and for building phenotype-predictive models.

**Results:** We address these objectives via *maximum entropy conditional probability modeling* (MECPM), coupled with a novel model structure search. Unlike neural networks and support vector machines (SVMs), MECPM makes explicit and is *determined by* the interactions that confer phenotype-predictive power. Our method identifies both a marker subset and the *multiple* k-way interactions between these markers. Additional key aspects are: (i) evaluation of a select subset of up to five-way interactions while retaining relatively low complexity; (ii) flexible single nucleotide polymorphism (SNP) coding (dominant, recessive) within each interaction; (iii) no mathematical interaction form assumed; (iv) model structure and order selection based on the *Bayesian Information Criterion*, which fairly compares interactions at different orders and automatically sets the experiment-wide significance level; (v) MECPM directly yields a phenotype-predictive model. MECPM was compared with a panel of methods on datasets with up to 1000 SNPs and up to eight embedded penetrance function (i.e. ground-truth) interactions, including a five-way, involving less than 20 SNPs. MECPM achieved improved sensitivity and specificity for detecting both ground-truth markers *and* interactions, compared with previous methods.

**Availability:**http://www.cbil.ece.vt.edu/ResearchOngoingSNP.htm

**Contact:**djmiller@engr.psu.edu

**Supplementary information:**Supplementary data are available at *Bioinformatics* online.

## 1 INTRODUCTION

It is widely accepted that individual variation in risk for complex disorders results from the joint effects of both environmental and genetic factors. Considerable progress has been made in identifying environmental factors associated with specific diseases. However, there remain gaps in knowledge about common genetic variants that also play a role in disease pathogenesis and especially about the extent to which genetic variants interact with each other and with environmental factors to increase disease propensity. There are statistical, computational and methodological challenges associated with discovery of gene–gene and gene–environment interactions. First, there is the computational challenge of analyzing the vast majority of common single nucleotide polymorphisms (SNPs), or SNP subsets identifying common haplotypes, in large numbers of individuals. These datasets, typically with hundreds of thousands of SNPs genotyped for thousands of individuals, are too large for exhaustive evaluation of even moderate-order [e.g. third (SNP–SNP–SNP)] interactions. Even with an initial screening to reduce the SNP pool to less than, e.g. 10 000, it is infeasible for most computational platforms to evaluate all combinations of SNP associations for even moderate interaction orders.

Even if exhaustive evaluations *were* possible for large SNP pools, a high proportion of identified associations will be false positives due to chance, as well as due to biases such as admixtures, non-random missing data or genotyping errors. There is also no full consensus on how to set an experiment-wide significance level, consistent with accurate estimation or control of the number of true and false discoveries (Allison *et al.*, 2006).

There are also several challenges that may confound ability to detect true interactions. A fundamental problem is that interactions may have *undetectable main (single SNP) or even low order (e.g. pairwise SNP) effects*, e.g. consider handedness, attributed to a non-linear XOR interaction Levy and Nagylaki (1972), which *only* has a significant *joint* effect. Linear models such as linear support vector machines (SVMs, Guyon *et al.*, 2002), often applied in past studies, cannot detect interactions without main effects.

Another difficulty is that the mathematical form of interactions is a priori unknown, with many possibilities. Indeed, there is even lack of consensus about what constitutes an interaction. Here, we define an interaction as occurring whenever a *combination* of two or more SNP features increases or decreases likelihood of the phenotype more than the features acting singly. Unfortunately, it is difficult to know which forms really occur in biology. The strong prior precedent for main effects between single SNPs and phenotypes argues that techniques should, at a minimum, be proficient at detecting these effects. However, more exotic forms are already known to exist (Levy and Nagylaki, 1972). Multiplicative interactions, within logistic models, are often assumed. However, model bias associated with this assumed form may reduce power to detect interacting SNPs. Even given a fixed form (e.g. multiplicative), the number of candidate interactions grows quickly with interaction order. Finally, detection power is also affected by low interaction penetrance and by an interaction's subpopulation size. Limitations of generalized linear and generalized additive models motivate the search for alternative methods and more flexible and efficient detection strategies.

One *machine learning* strategy is to treat the problem as one of *supervised feature selection* within statistical classification, with ‘cases’ and ‘controls’ as the classes. Here, one aims to select the feature subset which, when coupled with a classifier, leads to best classification performance. Standard *wrapper-based* feature selection (Guyon and Elisseeff, 2003 involves repeated application of classifier learning for different candidate subsets, with the chosen *marker* subset the one maximizing a criterion function closely allied to classification accuracy. A wrapper is thus specified by the subset search algorithm, the classifier and its learning algorithm, and by the feature selection criterion function. Searching can involve greedy forward selection, with ‘informative’ features added; backward elimination, with irrelevant features removed; or more complex algorithms. For SNP marker selection, Bhat *et al.*, (1999) applied wrappers using multilayer perceptrons (MLPs), while Kim and Kim (2001) applied SVMs. Several other SNP discovery methods have also been proposed. Ritchie *et al.*, (2001) proposed multifactor dimensionality reduction (MDR) to encode SNP interactions. Kooperberg *et al.*, (2007) proposed logic regression and reported promising results. Zhang and Liu (2007) proposed Bayesian epistasis association mapping (BEAM). In Section 4, we discuss these and other previous SNP discovery methods and their disadvantages and advantages, compared with the method proposed here.

Unlike most past approaches, the method developed here learns a *single* phenotype-predictive model that explicitly specifies all detected markers and marker interactions, and automatically estimates the number of interactions present. Key aspects of this method are: (i) while some methods only detect a marker subset, maximum entropy conditional probability modeling (MECPM) determines this SNP subset, the number of interactions and the specification of the marker SNPs involved in each interaction. For example, from 1000 candidate SNPs, MECPM might determine that there are six interactions—a five-way, three-way and two-way, along with three main-effect SNPs—each involving disjoint marker SNP subsets, and thus together comprising a marker set of 13 SNPs; (ii) MECPM is a posterior model (classifier), predicting phenotype given the genotype, that makes explicit and in fact is defined by its interactions; (iii) it uses a novel heuristic search that evaluates a selected, enriched subset of relatively high-order candidate interactions (up to five-way), while retaining relatively low complexity (worst case, quadratic in the SNP size); (iv) the method allows flexible SNP coding (dominant, recessive) within each interaction; (v) no mathematical interaction form is assumed; (vi) theoretically grounded model structure and order selection, based on the *Bayesian Information Criterion*(BIC; Schwarz, 1978), accounts for the finite sample size in (fairly) comparing candidate interactions at different orders and in determining the number of interactions; thus automatically, via model order selection, the experiment-wide significance level is set. This approach tends to determine markers and interactions with high specificity *and* better sensitivity than previous approaches (see Section 3).

In Section 2, we briefly review MECPM modeling and then specialize this framework for interaction discovery. In Section 3, we experimentally compare MECPM SNP discovery with other methods. Section 4 discusses methodology, capabilities and advantages/disadvantages of previous SNP discovery methods, compared with MECPM.

## 2 METHODS

### 2.1 The maximum entropy principle

The principle of maximum entropy (ME; Jaynes, 1989) prescribes that, when building a probability model, one should agree with all known information while remaining maximally uncertain (least-biased) with respect to everything else. For a physical system, e.g. gas particles, the known information may be the temperature, with the goal to determine the distribution over particle velocities. For six-sided die tossing, the average die value may be known. In the case of a SNP association study, the ‘known information’ is taken to be statistics involving the phenotype that can be accurately measured (via co-occurrence counts) from the given dataset, e.g. the pairwise probability of disease presence and a particular genotype for a given SNP. To achieve maximum uncertainty, the probability distribution is chosen to maximize Shannon's entropy function, while satisfying constraints that ensure agreement with the known information (Jaynes, 1989). While the constraints (e.g. on the average die value) do not uniquely specify the distribution, the distribution that *maximizes entropy* while satisfying the constraints is unique (Cover and Thomas, 1991, Ch. 11). Other distributions that are consistent with the known information (but with lower entropy) may make unjustified assumptions, such as a parametric distribution form. In Jaynes, (1989) and Cover and Thomas (1991, Ch. 11), it has also been shown that the ME distribution realizes the known information in an overwhelmingly greater number of ways than other distributions. Other justification for the ME principle can be found in Cover and Thomas (1991, Ch. 11). While ME techniques have found increasing use (Berger *et al.*, 1996), they have not been widely exploited in bioinformatics.

### 2.2 ME Modeling

To build a classifier within the ME framework, one learns an *a posteriori* class probability model (Duda *et al.*, 2000), consistent with the specified statistical constraints. The constraints that are chosen play a pivotal role in determining the accuracy of the resulting model. Without *any* constraints, the ME posterior is (an uninformative) uniform probability mass function over all classes. Each encoded constraint *reduces* the (maximum) entropy, yielding a more predictive class posterior. The most useful constraints achieve the greatest reduction in the ME distribution's entropy (Zhu *et al.*, 1997). To determine whether an individual has propensity for phenotype *C*, the (first-order) probability P[C=‘case’, smoker = ‘yes’] and the (second-order) probability P[C = ‘control’, SNP47 = ‘AA’, SNP32 = ‘AG’], specifying a two-way interaction, may be meaningful statistics to encode. These statistics are first measured from co-occurrence counts over a labeled subject pool. One then learns the ME posterior on *C* to agree with these constraints. If the constraints are accurately measured, the model should exhibit little or no overfitting.

The ME optimization problem is convex for linear constraints and solvable by standard approaches (Della Pietra *et al.*, 1997), guaranteed to converge to the (unique) ME solution—in contrast, neural network training is beset by non-convex cost surfaces and poor local minima. ME also naturally handles discrete-valued features (such as SNPs) and both binary and non-binary categorical phenotypes. Most importantly for SNP modeling, unlike MLPs and SVMs, each ME constraint, with an associated parameter, corresponds, one-to-one, to a SNP–SNP–phenotype interaction. Thus, ME makes explicit the interactions important for achieving accurate phenotype prediction.

### 2.3 The form of the ME posterior

Consider a discrete-valued feature space (*F*, *C*)=(*F*_{1},…, *F*_{N}, *C*), *F*_{i} ∈ 𝒜_{i}, *C* ∈ 𝒫, where 𝒜_{i}, ∀*i* and 𝒫 are finite sets. Here, *F*_{i} represents the SNP value, with 𝒜_{i} = {0, 1} under dominant or recessive coding. *C* is the phenotype, with 𝒫 = {0, 1}, representing cases and controls. The ME goal is to learn the posterior *P*[*C* = *c*|*F* = *f*] that maximizes entropy consistent with specified constraints measured based on a ‘supervised’ training set, 𝒯 = {(*f*^{(t)}, *c*^{(t)}), *t* = 1,…, *T*}. For the sake of clarity, we first specify the form of the ME posterior assuming simple constraints defining ‘first-order’ interactions. We then give the (notationally heavier) posterior form that arises for variable order interactions.

In this section, we assume that the constraints are given whereas in practice they must be chosen. A greedy algorithm for selecting the constraints (and, thus, the interactions) will be given in Section 2.5. Consider the (given) probabilistic constraints {*P*_{g}[*C* = *c*(*m*), *F*_{i(m)} = *v*(*m*)], *m* = 1,…, *N*_{c}}, *N*_{c} the number of constraints. Here, *c*(*m*), *i*(*m*) and *v*(*m*) are the class value, feature (SNP) index and feature value (genotype), respectively, associated with the *m*-th constraint. The constraints are measured from frequency counts over 𝒯: Here, ‘g’ denotes ‘ground truth’. The model's *estimate* of *P*_{g}[·] is: . The posterior is chosen to satisfy the specified constraints while maximizing the conditional entropy , based on assumed uniform support over the training set, i.e. . The supplement derives that this ME posterior takes the *Gibbs* form:

*c*(

*m*),

*v*(

*m*)] is the Lagrange multiplier associated with constraint

*m*and One can make phenotype predictions by choosing the class with largest posterior probability. Note that (1) makes explicit the (single feature) interactions that contribute to class predictions.

More general interactions are expressed if we encode higher order constraints. Specifically, suppose now the *variable order* constraint set {*P*_{g}[*C* = *c*(*m*), *F*_{i1(m)} = *v*_{1}(*m*),…, *F*_{iL(m)(m)} = *v*_{L(m)}(*m*)], *m* = 1,…, *N*_{c}}, where *L*(*m*) is the number of interacting features. Following the same derivation as in the Supplementary Material, we find the ME posterior:

*no*assumption is made about the mathematical

*form*of an interaction involving

*L*(

*m*) features (SNPs)—there is simply a numerical Lagrange multiplier value γ[·], whose magnitude indicates the interaction strength, and whose sign indicates whether the interaction increases or decreases phenotype likelihood.

The ME *model* (i.e. the parameters {γ[·]}) satisfying the given constraints, can be found via the method of *iterative scaling* (Della Pietra *et al.*, 1997), described in the Supplementary Material.

We allow interactions up to order *L*(*m*) = 5, with individual SNPs in an interaction flexibly coded as dominant or recessive. We do not use additive coding because ternary coding (relative to binary) exponentially reduces (for increasing order) an interaction's subpopulation—we have found that this adversely affects power to detect interactions, even when some true interactions do involve additive coding. Since flexible dominant/recessive coding is used, each Lagrange multiplier, associated with a constraint/interaction, *actually* has the form γ[*C* = *c*(*m*), *F*^{(bi1(m))}_{i1(m)} = *v*_{1}(*m*),…, *F*^{(biL(m))}_{iL(m)(m)} = *v*_{L(m)}(*m*)], where the superscript *b*_{j}(*m*) ∈ {0, 1} indicates dominant or recessive coding of feature *j* in interaction *m*.

### 2.4 BIC-based model selection

For choosing constraints and their number, we have developed a greedy induction algorithm which adds constraints one at a time. In principle, this algorithm could be used just to posit candidate interactions, with standard association testing applied to determine whether there are significant deviations from the null hypothesis of no interactions. However, there are difficulties for multiple hypothesis testing, both in determining an experiment-wide significance level and in comparing associations of different orders. Moreover, even when the sample size is inadequate for satisfying a target significance level, one would like to rank the most promising associations, for targeted followup studies. Since our method builds a posterior model, it possesses a natural alternative to conventional hypothesis testing, both for choosing between candidates at different orders and for estimating the number of interactions—the problem can be cast as *model order selection*.

The BIC (Schwarz, 1978) is a theoretically grounded estimator, asymptotically minimized by the true model (if in the model family) in the large sample limit. For finite sample sizes, BIC captures a basic tradeoff between data representation (data likelihood) and model complexity. This allows ‘level-playing-field’ comparison of interactions at different orders, with the preferred interaction the one which, when added to the model, yields the greatest decrease in BIC. We apply BIC both to compare candidate interactions during model growing and to *terminate* model growing, i.e. the interaction significance level is automatically set based on the global minimum of the BIC curve, which is sample-size dependent. As seen by our results (see Section 3), this approach tends to achieve high interaction specificity, but also better sensitivity than other methods. Section 2.5 describes our greedy model induction search. Here, we specify the BIC used within this search.

To develop the BIC cost for our ME model, we exploit a well-known equivalence between BIC and the minimum description length (MDL) criterion (Hastie, 2001). The latter, MDL (Rissanen, 1978), is explicitly based on an optimal source coding framework, which is convenient for our development here. Specifically, MDL (BIC) consists of a two-part *codelength*: the bits to specify the model, *B*_{Θ}, plus the bits needed to specify the data given the model. For our (posterior) model, the data consists of the class labels for each sample, with the number of bits to describe the data estimated by the negative posterior joint log-likelihood. Assuming each labeled sample is independently generated, MDL (BIC) can thus be written generically as:

*B*

_{Θ}, the number of bits to specify the model. In the Supplementary Material, we detail a simple coding scheme (with associated codelength) for specifying the MECPM model parameters. Here, we list the information that must be described by this code and give the formula for the total codelength. An MECPM model is fully specified by: (i) The number of interactions,

*N*

_{c}(in fact, we can ignore this codelength because we assume a uniform prior on the number of interactions) and (ii) for each interaction: (a) the interaction order, (b) the SNPs involved, (c) the coding and genotype of each SNP under its coding model, and (d) the interaction's real-valued Lagrange multiplier. For this last quantity, we assume the Laplace approximation within the BIC framework (Hastie, 2001), which gives the result that each real-valued parameter γ(·) incurs bits of description. Thus, based on the simple coding scheme given in the Supplementary Material, the total codelength is: where log

_{2}(5) specifies the interaction order,

*C*

_{L(i)}

^{N}is the number of combinations of

*L*(

*i*) SNPs from a pool of size

*N*, 4

^{L(i)}the number of (coding, genotype) configurations for

*L*(

*i*) interacting SNPs and ‘2’ the number of possible phenotype values.

### 2.5 Greedy interaction growing search method

We have developed a novel variant of greedy search ME model induction (Della Pietra *et al.*, 1997) that uses BIC (4) both for choosing the ME constraints (interactions) and their number. Our variant achieves great reduction in computation compared with standard greedy ME model-building, with little or no compromise on interaction discovery accuracy. Its key features are: (i) adding one constraint (interaction) at a time; (ii) flexibly considering coding models (dominant, recessive) for each SNP in a candidate interaction; (iii) for each new interaction, progressively evaluating candidates of increasing order by *accreting* ‘seed’ interactions at first *and* second order (to avoid the XOR problem); (iv) instead of treating *all possible* first- and second-order constraints as seeds and all possible single SNPs as candidates for seed accretion, we propose a novel, low-complexity method, used both for identifying promising, *small* seed subsets at first and second order and small candidate SNP pools for seed accretion.

For each seed, we choose (from amongst the seed's candidate SNP pool) the best single SNP conjunction to add to increase the seed's order by one (seed accretion), continuing in this greedy fashion until we reach the maximum order (five). To evaluate each candidate for an accreted seed, we trial-learn the associated Lagrange multiplier (taking 10 iterative scaling update steps), keeping all current model parameters held fixed, and then measure the change in BIC cost. The number of such trial-learnings is linear in the product of the number of seeds and the candidate SNP pool size for seed accretion. Supposing these numbers are the same (*K* ≪ *N*), trial-learnings are quadratic in *K*. Thus, our great reduction in computation time stems from the fact that we reduce the amount of trial-learning from *cubic in the number of SNPs*, *N* (if all second-order constraints are seeds and all SNPs are seed accretion candidates) to quadratic in *K* ≪ *N*—if there are, e.g. *N* = 1000 SNPs, the number of parameter trial-learnings may be reduced by several orders of magnitude.

Let us illustrate the seed accretion process. For example, starting from the second-order seed constraint *P*_{g}[*C* = 0, *F*^{(0)}_{47} = 0 , *F*^{(1)}_{94} = 1], trying each single SNP in this seed's accretion pool as a candidate conjunction (with all possible codings and genotype values) by trial-learning the associated Lagrange multiplier, we might find that the best third-order accretion of this seed (in the sense of BIC) is *P*_{g}[*C* = 0, *F*^{(0)}_{47} = 0, *F*^{(1)}_{94} = 1, *F*^{(0)}_{3} = 1]. Continuing, we might find that the best fourth-order accretion of this (third-order) accreted seed is *P*_{g}[*C* = 1, *F*^{(0)}_{47} = 0, *F*^{(1)}_{94} = 1, *F*^{(0)}_{3} = 1, *F*^{(1)}_{36} = 1] and, at fifth order, *P*_{g}[*C* = 0, *F*^{(0)}_{47} = 0, *F*^{(1)}_{94} = 1, *F*^{(0)}_{3} = 1, *F*^{(1)}_{36} = 1, *F*^{(1)}_{17} = 0]. Amongst these four accreted seed constraints, we save, as the best candidate starting from this second-order seed, the one that reduces the BIC cost the most, relative to the current model. Note that even though seed accretion starting from first order evaluates a set of interactions with orders ranging from one all the way up to five, the *number* of such evaluated interactions, rather than quartic, is at most *linear* in the number of SNPs (*N*), as is the computation associated with accreting a given seed up to order five [each accretion step, increasing a seed's order by one, evaluates 4*K* SNP interactions, based on four (coding, genotype) configurations per SNP, with *K* ≪ *N* the candidate SNP pool size for seed accretion].

We perform this same accretion process starting from *every* seed at first and second order, and then pick the best constraint (in the sense of BIC) over *all* the seeds (and thus over all greedily grown accretions of these seeds up to fifth order). This best constraint (with associated learned γ[·] parameter) is then added to the model. This process is then repeated to determine the next interaction to add. This sequential growing process, greedily adding variable-order interactions one at a time, continues until the BIC cost reaches its minimum, at which point the algorithm terminates, yielding both a phenotype-predictive model and a posited list of interactions and marker SNPs.

As noted above, computational requirements of this algorithm depend on the number of seeds and number of seed accretion candidates, *K*. We have considered several approaches. One is to take as seeds *all possible* first- and second-order constraints, i.e. at second-order, constraints involving all *N*(*N* − 1)/2 SNP pairs, and with all SNPs as accretion candidates. In this case, as noted earlier, the number of trial-parameter learnings for each added interaction and the overall algorithm complexity grow *cubically* with *N*. However, *another* possibility is to limit seeds to the ‘most promising’ SNPs and SNP pairs and seed accretion candidates to the most promising SNP tuples. In (Della Pietra *et al.*, 1997), constraints are exclusively evaluated by trial-learning, which is computationally heavy. Instead, we make the novel proposal of an objective criterion to use for efficiently identifying both seeds and seed accretion candidates *without* performing trial-learning—to select seeds, we measure the *Kullback–Leibler divergence* (Cover and Thomas, 1991, Ch. 2; Kullback, 1997), also often referred to as the ‘relative entropy’ or ‘information divergence’, between the probability mass functions (*P*_{g}[·], 1 − *P*_{g}[·]) and (*P*_{ME}[·], 1 − *P*_{ME}[·]), defined for *all possible* one- and two-way SNP constraints, with *P*_{ME}[·] measured based on the *current* model that has not yet encoded (been learned to agree with) the constraint *P*_{g}[·]. That is, for each first- and second-order candidate interaction, we evaluate *D*[*P*_{g}[·]‖*P*_{ME}[·]] ≡ *P*_{g}[·]log(*P*_{g}[·]*P*_{ME}[·])+(1 − *P*_{g}[·])log((1 − *P*_{g}[·])/(1 − *P*_{ME}[·]). It is *quite* plausible that the constraints at a given order that are furthest from currently being met by the existing model (relative entropy sense) are *also* those which, once added to the model, will increase log-likelihood (hence decrease BIC cost) the most. Some information-theoretic justification for relative entropy-based selection is given in the Supplementary Material. Thus, an alternative to accretion of all possible constraints at first and second order (with trial parameter learnings cubic in *N*) is to use seed pools (of size *K* ≪ *N*), comprising the constraints with largest relative entropies at first and second order, and with the candidate SNP pool for accreting a given seed at each order *also*, e.g. chosen as the *K* accretions of this seed with largest relative entropy. In this case, the number of trial parameter learnings is only quadratic in *K*. Exhaustive evaluation of cross-entropies at both first and second orders (which should be repeated, in order to update the seed and seed accretion pools after each new constraint is added to the model), will still require complexity quadratic in *N*. With the seed pools limited in size, the overall algorithm thus grows quadratically in *N*. It will be seen that limiting the seed and accretion pools in this way achieves great computational savings and accurate interaction discovery.

## 3 RESULTS

### 3.1 Datasets

Data on 223 individuals genotyped on the 317K Illumina SNP chip as part of the New York City Cancer Control Project (NYCCCP; PI Peter Gregersen) form the base from which the datasets were produced. A flexible simulation program was written that generates user-defined sample size, number of SNPs, no missing data or missing data patterns consistent with the observed missing data in the original genome scan, and affected or unaffected disease status under the null hypothesis (i.e. no associations in the genome) or under the alternative hypothesis (i.e. hard-coded penetrance functions). Missing data are filled in completely at random, proportional to the allele frequencies in the original data. The datasets were produced as follows. Consider a matrix with 223 rows corresponding to individuals and 317 503 columns corresponding to the SNPs. The elements of this matrix are the genotypes. The columns were partitioned into bins of 500 SNPs, the last bin containing only three SNPs. The simulated genome scan data for each individual was obtained by random draws (with replacement) from a real data matrix of 223 individuals and 636 bins of 500 SNPs. Specifically, the simulated data for an individual was generated by randomly selecting the first bin (first column) from the 223 individuals (rows), randomly selecting with replacement the second bin from the 223 individuals, randomly selecting with replacement the third bin from the 223 individuals and so on. Thus, the data retains the basic patterns of linkage disequilibrium, missing data and allele frequencies observed in the original genome scan data. The exception to this is only at the 635 breaks in the genome corresponding to the bin boundaries. The simulations correspond to 1000 cases and 1000 controls simulated under the alternative hypothesis described below and no missing data. Only autosomal loci are considered in the data. The various methods were applied to sets of 100, 250 and 1000 SNPs selected at random from the autosomal loci. These numbers of SNPs are consistent with a (GWAS) study following an initial SNP screening stage and also with pathway-based association studies. The simulations reported assume that the disease risk is explained by genetic factors and an underlying sporadic rate, with no interactions with any environmental effects. Seven polymorphisms-by-polymorphisms interaction models, described in the Supplementary Material, were used in creating the 100, 250 and 1000 SNPs datasets. Five of the seven models involve interactions. None is motivated by an additive or multiplicative model—all were generated by more complex Mendelian inheritance pattern. Seventeen SNPs were selected to participate in penetrance functions, and thus to influence disease status. We refer to these 17 SNPs as ‘ground-truth’ SNPs and to their associated penetrance functions as ‘ground-truth interactions’. Results on more datasets, based on nine penetrance function models with up to 20 marker SNPs, are presented in the Supplementary Material.

### 3.2 Illustrative MECPM results

Figure 1 plots BIC as a function of the number of constraints for a dataset with 100 SNPs (Table 1). MECPM model-building (based on seed and seed accretion pool sizes of *K* = 25) stopped at the point with minimum BIC value, with seven constraints added. The interactions discovered before stopping are marked above the curve. We also list interactions added just after the stopping point (below the curve). We considered an MECPM model interaction to be a true interaction *only* if it contained *all* the true interacting SNPs and *no* false SNPs. Comparing with Table 1, MECPM discovered 15 out of 17 ground-truth markers and 7 out of 8 ground-truth interactions, including the five-way. The interaction (44, 75) was missed, although a constraint associated with SNP 44 was added immediately *after* the stopping point. On this dataset, MECPM achieved both very high specificity (zero false positives) *and* very high sensitivity of *both* interaction and marker discovery. More generally, on a variety of datasets, the method typically achieves very high specificity; while sensitivity on this dataset was on the high end, we have found sensitivity of the method is also typically much higher than alternative methods. Some comparisons are next shown, with comparisons on more datasets given in the Supplementary Material.

Group index | SNP interaction |
---|---|

1 | 25 81 87 92 99 |

2 | 85 83 |

3 | 85 100 |

4 | 97 20 47 |

5 | 60 93 |

6 | 44 75 |

7 | 98 |

8 | 78 |

Group index | SNP interaction |
---|---|

1 | 25 81 87 92 99 |

2 | 85 83 |

3 | 85 100 |

4 | 97 20 47 |

5 | 60 93 |

6 | 44 75 |

7 | 98 |

8 | 78 |

There are eight interactions, including one five-way and one three-way interaction.

### 3.3 Marker discovery accuracy comparison

We compared MECPM marker discovery accuracy with the following methods described in the Supplementary Material: Linear SVM-RFE (Guyon *et al.*, 2002), Linear SVM-MFE (Aksu *et al.*, 2007), Pearson's χ^{2} test (Agresti, 2002), logistic regression LR; Hunter *et al.*, 2007), information gain IG; Moore *et al.*, 2006), full interaction model FIT; Marchini *et al.*, 2005) and BEAM (Zhang and Liu, 2007). Excepting BEAM, for which source code was provided by the authors, we implemented all other comparison methods, based on the descriptions in the cited papers. Only linear SVM results are shown because non-linear kernel SVMs performed worse than linear SVMs. Excepting MECPM, most methods can only discover a marker subset, with at best limited ability to infer interactions, e.g. IG and FIT can infer interactions but the interaction order is limited to two. Since the comparison methods have limited interaction inference capability, we restrict to comparing marker discovery accuracies.

Table 2 compares the marker discovery ability of eight methods. *N*_{tm} is the number of true markers at the point where there are *zero* false markers. At this point, MECPM discovered 15 true markers for both the 100 and 1000 SNPs datasets, better than all other methods. *N*_{fm} is the number of false markers when the number of true markers is fixed at the number discovered by MECPM at the (minimum BIC) stopping point (15 markers). Since MECPM has greater sensitivity than the other methods, this essentially evaluates specificity at high sensitivity. Note MECPM has perfect specificity on these datasets at its (BIC-minimum) stopping point whereas, to achieve the same sensitivity, the other methods must retain *many* false markers. More results are given in the Supplementary Material, including: (i) comparisons on other datasets; (ii) evaluation of MECPM applied to an existing Lupus Erythematosus GWAS study dataset;(iii) MECPM discovery accuracy as a function of sample size; and (iv) classification accuracy comparisons.

Method | 100 SNPs | 1000 SNPs | ||
---|---|---|---|---|

N_{tm} | N_{fm} | N_{tm} | N_{fm} | |

MECPM | 15 | 0 | 15 | 0 |

SVM-MFE | 7 | 13 | 6 | 531 |

SVM-RFE | 8 | 31 | 7 | 570 |

χ^{2}-test | 8 | 40 | 6 | 395 |

LR | 8 | 39 | 6 | 492 |

IG | 9 | 55 | 8 | 711 |

FIT | 7 | 36 | 6 | 400 |

BEAM | 6 | 49 | 6 | 774 |

Method | 100 SNPs | 1000 SNPs | ||
---|---|---|---|---|

N_{tm} | N_{fm} | N_{tm} | N_{fm} | |

MECPM | 15 | 0 | 15 | 0 |

SVM-MFE | 7 | 13 | 6 | 531 |

SVM-RFE | 8 | 31 | 7 | 570 |

χ^{2}-test | 8 | 40 | 6 | 395 |

LR | 8 | 39 | 6 | 492 |

IG | 9 | 55 | 8 | 711 |

FIT | 7 | 36 | 6 | 400 |

BEAM | 6 | 49 | 6 | 774 |

### 3.4 Fast approximate search

We compared three ME greedy induction methods for interaction discovery accuracy and execution time, along with all other evaluated methods. All MECPM variants perform seed accretion as discussed in Section 2.5. One method exclusively uses trial-parameter learning to compare candidate constraints, consistent with Della Pietra *et al.*, (1997). For this method, we performed exhaustive trial-learning of all possible constraints at first and second order and then selected as a *single* seed the constraint (amongst those at both orders) that decreased BIC the most. We then performed accretion on this seed using all SNPs as the accretion pool. This method is dubbed ‘Complete’. Even though this method accretes only one seed, it has high complexity due to exhaustive trial-learning of candidates at second order. A second method (‘MECPM-RE order2’) is as described in Section 2.5, using relative entropy evaluations to identify seed pools starting at both first and second order, both of size *K* = 25, and accretion pools at each order, also of size *K* = 25. A third method (‘MECPM-RE order1’) achieves further *great* reduction in complexity by, unlike ‘MECPM-RE order2’, only starting from first-order seeds (and, thus, only performing exhaustive relative entropy evaluations at first order, rather than first and second orders). This method has complexity of only *O*(*N*) (*N* relative entropy evaluations) or *O*(*K*^{2}), whichever is larger. Computer simulation was performed for all methods on an Intel Xeon 3.4 GHz CPU, 4 GB Memory, using Linux CentOS 5 and Windows XP operating systems. Table 4 shows that both ‘RE’ methods greatly reduce computation, compared with ‘Complete’, and that the ‘order 1’ method even has modest complexity for 1000 SNPs, comparable with BEAM and IG. Table 3 shows the inference accuracy for these three MECPM variants, with ‘RE’ in fact representing the performance of both the ‘order1’ and ‘order2’ methods, since they achieved precisely the same results. Note that use of relative entropy (and the additional ‘order1’ reduction) to reduce complexity did not degrade inference accuracy at all—in fact, the ‘RE’ methods achieved better results on the 1000 SNPs dataset, finding the fifth-order interaction which was missed by ‘Complete’.

Data | N_{gm} | N_{tm} | N_{gi} | N_{ti} | ||
---|---|---|---|---|---|---|

RE | Co. | RE | Co. | |||

100 SNPs | 17 | 15 | 15 | 8 | 7 | 7 |

1000 SNPs | 17 | 15 | 10 | 8 | 7 | 6 |

Data | N_{gm} | N_{tm} | N_{gi} | N_{ti} | ||
---|---|---|---|---|---|---|

RE | Co. | RE | Co. | |||

100 SNPs | 17 | 15 | 15 | 8 | 7 | 7 |

1000 SNPs | 17 | 15 | 10 | 8 | 7 | 6 |

*N*_{gi} is the number of ground−truth interactions, *N*_{ti} the number of detected true interactions, *N*_{gm} the number of ground-truth markers and *N*_{tm} the number of detected true markers; *K* = 25 was the seed and seed accretion pool size.

Method | 100 SNPs | 1000 SNPs |
---|---|---|

MECPM-Complete | 3.6 h | 750 h |

MECPM-RE order2 | 19 m | 7.5 h |

MECPM-RE order1 | 9 m | 24 m |

χ^{2}-test | 2 s | 4 s |

LR | 2 s | 5 s |

IG | 3 s | 2 m |

FIT | 4 s | 2 s |

BEAM | 12 s | 15 m |

Method | 100 SNPs | 1000 SNPs |
---|---|---|

MECPM-Complete | 3.6 h | 750 h |

MECPM-RE order2 | 19 m | 7.5 h |

MECPM-RE order1 | 9 m | 24 m |

χ^{2}-test | 2 s | 4 s |

LR | 2 s | 5 s |

IG | 3 s | 2 m |

FIT | 4 s | 2 s |

BEAM | 12 s | 15 m |

## 4 DISCUSSION

MECPM can be compared with past SNP discovery methods from a few standpoints: (i) it is a type of wrapper approach that sequentially builds explicit interactions into a phenotype-predictive classifier; (ii) it applies an information-theoretic principle (ME). Kim and Kim (2001) and Bhat *et al.*, (1999) applied wrapper-based SNP selection using SVMs and MLPs, respectively. We discuss some limitations of SVMs here and discuss both MLPs and SVMs in more detail in the Supplementary Material. Linear SVMs do not encode non-linear interactions and thus can only find single SNP (main) effects. This may explain the poor performance of linear SVMs seen in Table 2. Kernel-based SVMs do encode non-linear interactions. As an example, for an original feature vector *x* = (*x*_{1}, *x*_{2}, *x*_{3}), the implicit (mapped) feature coordinates generated using the polynomial kernel *K*(*x*, *z*) = (1 + *x*^{T}*z*)^{2}, which are linearly weighted via the SVM, are:

*X*

_{i}is involved in a single (multiplicative)

*ground-truth*interaction, with

*X*

_{j}. If we eliminate

*X*

_{i}, we also effectively eliminate

*all*its interactions {

*X*

_{i}

*X*

_{k}}, including

*X*

_{i}

*X*

_{j}. If we retain

*X*

_{i}, we not only preserve the true interaction

*X*

_{i}

*X*

_{j}, but also false interactions. This ‘all-or-nothing’ characteristic makes it difficult for kernel-based SVM feature selection to identify true interactions and may lead to including false markers. This may explain why kernel-based SVM performance was even worse than the linear SVM performance as shown in Table 2. Moreover, for kernel-based SVMs, it is impractical or even infeasible to selectively eliminate the non-linear coordinates (interactions) directly, since the number of these coordinates may even be

*infinite*(for the Gaussian kernel) and in practice is huge, even if finite.

A number of past works apply information-theoretic criteria to marker and interaction discovery. (Moore *et al.*, 2006) and (Dong *et al.*, 2008) use entropy gain (mutual information) to identify single SNP and two-way interactions. Some limitations of these methods are that they do not account for finite sample size in assessing interactions at different orders; there is also no methodology given for determining the number of (significant) interactions. (Moore *et al.*, 2006) does build a classifier, *after* identifying interactions. In contrast, our (wrapper) method integrates classifier design and interaction detection, with the most predictive interactions (BIC sense) selected. Another method based on an information-theoretic criterion is McKinney *et al.*, (2007). They perform backward SNP elimination inspired by the physical process of evaporative cooling, with ‘uninformative’ SNPs eliminated based on a combination of mutual information (for preserving main-effect SNPs) and Relief-F (for preserving epistatic interactions). A main limitation of this method is that it only identifies a marker subset. Two other related methods that apply mutual information for interaction discovery are Dawy *et al.*, (2006) and Manzour and Saraee, (2007). Similar to MECPM, these methods perform a forward-growing interaction search, via ‘relevance chains’, with a SNP marker added to multiple chains if its mutual information with the phenotype, conditioned on existing chains, is statistically significant (based on an approximate χ^{2}-test). These methods *do* account for finite sample size in evaluating candidate interactions. They also place special emphasis on the sequence order in which SNPs are added to chains (thus, inferring whether a SNP is a primary or secondary ‘causal’ marker). Such interpretation can also be given to (sequentially accreted) MECPM interactions. Unlike MECPM, these methods do not build a phenotype-predictive classifier and they do not apply flexible dominant/recessive coding. Moreover, the complexity of these methods is high for k-way interactions (*k* > 2); in Dawy *et al.*, (2006), the authors do not discuss ways to reduce complexity. In Manzour and Saraee (2007), heuristics to limit the number of interaction candidates are introduced. Finally, it must be emphasized that MECPM applies entropy in a quite different way than all methods just described—in MECPM, entropy is not the SNP marker selection criterion. Rather, the ME principle is used to derive a posterior that encodes interactions, with this distribution (and thus its interactions) actually evaluated using BIC (MDL).

Another recent method is Bayesian epistasis association mapping (BEAM; Zhang and Liu, 2007). This approach partitions the SNP pool into three subsets—SNPs unlinked to the disease, those with independent main effects, and those involved in a joint effect. A limitation of BEAM is that, for any given model hypothesis, specified by a SNP partitioning, there is only a *single* epistatic (multilocus) interaction. Thus, while BEAM scores candidate markers and interactions, it does not directly estimate the number of markers and interactions present. BEAM also does not give a phenotype-predictive model. Finally, BEAM has a relatively large number of hyperparameters requiring user setting.

While we have noted properties of MECPM and its advantages relative to past methods, there are also some disadvantages. One is high complexity when evaluating all second-order constraints (as may be needed to detect purely epistatic interactions). Another disadvantage is that, while we do have a theoretically grounded stopping point at the BIC minimum, we do not yet have efficient (closed form, analytical) ways of assessing statistical significance for the detected interactions (permutation testing could be applied, but with high-computational cost). Since stopping at the BIC minimum achieves very high specificity, the ability to efficiently assess statistical significance of detected interactions might allow us to continue adding interactions until a point is reached with lower (but still acceptable) specificity, yet with enhanced interaction sensitivity. Development of this capability will be a subject of our future work.

## 5 CONCLUSIONS

We have developed and evaluated an ME approach for SNP interaction discovery that makes explicit and is *determined by* the interactions that confer phenotype-predictive power. We coupled ME with novel, heuristic but effective and efficient searching techniques for identifying complex, multilocus interactions and with BIC model selection. Compared with a panel of alternative methods, MECPM achieved both improved sensitivity and specificity for identifying ground-truth markers and interactions.

## ACKNOWLEDGEMENTS

We would like to thank Joshua Grab for help with simulation experiments. We gratefully acknowledge Peter Gregersen for the use of the New York City Cancer Control Project genotype data. We also gratefully acknowledge John B. Harley, Marta E. Alarcón-Riquelme, Lindsey A. Criswell, Chaim O. Jacob, Robert P. Kimberly, Kathy L. Moser, Betty P. Tsao, Timothy J. Vyse, Carl D. Langefeld, ALR, and the International Consortium for Systemic Lupus Erythematosus Genetics (SLEGEN) for providing us with the SLEGEN study data and authorizing us to publish our results on this data.

*Funding*: Wake Forest University Health Science Center for Public Health Genetics.

*Confict of Interest*: none declared.

## Comments