Background Epistasis has been historically used to describe the phenomenon that the effect of a given gene on a phenotype can be dependent on one or more other genes, and is an essential element for understanding the association between genetic and phenotypic variations. Quantifying epistasis of orders higher than two is very challenging due to both the computational complexity of enumerating all possible combinations in genome-wide data and the lack of efficient and effective methodologies.
Objectives In this study, we propose a fast, non-parametric, and model-free measure for three-way epistasis.
Methods Such a measure is based on information gain, and is able to separate all lower order effects from pure three-way epistasis.
Results Our method was verified on synthetic data and applied to real data from a candidate-gene study of tuberculosis in a West African population. In the tuberculosis data, we found a statistically significant pure three-way epistatic interaction effect that was stronger than any lower-order associations.
Conclusion Our study provides a methodological basis for detecting and characterizing high-order gene-gene interactions in genetic association studies.
Understanding the mapping from genetic variation to phenotypic variation has a great potential helping us understand, predict, diagnose, and treat common human diseases. However, existing main-effect-centered methodologies and techniques that depend on fundamental assumptions about a simple genetic architecture can only find very limited individual associations with disease risks. Genome-wide association studies1–3 and next generation sequencing4 make millions of single nucleotide polymorphisms (SNPs) in the human genome available for testing associations with phenotypic traits. These developments call for new methodologies that embrace the complex genetic architecture of diseases.5,6 The non-additive effects of gene–gene interactions, that is, epistasis, are believed to be an important contributor to the complex relationship between genetic and phenotypic variations.7–11 The focus of recent disease association research is shifting from identifying single locus susceptibility to quantifying interaction effects between multiple candidate loci throughout the human genome.8,9,12
Detecting and quantifying epistasis is a very challenging task. First, the epistatic interactions could involve multiple genes from a pair to a large set, and this undetermined order of interactions imposes enormous computational complexity of enumerating all possible combinations of genetic attributes for varying orders.6,13 Second, as the order of interacting genetic attributes goes beyond two, it becomes mathematically difficult to separate the additive lower-order effects and the pure higher-order synergy, that is, the extreme case in which the association can only be observed when all attributes are considered together. Those are also the major reasons why most existing epistasis studies are limited to pairwise interactions on genetics data of moderate sizes.
Information-theoretic measures have emerged as a very useful tool to quantify synergistic interactions among multiple genetic attributes.14–21 These measures are based on the Shannon entropy, which quantifies the amount of information, or uncertainty, of a random variable.22 By considering genetic attributes and phenotypic traits as random variables, entropy-based information-theoretic measures can be used to quantify the shared information between one gene and a trait, i.e., the main effect, as well as the gained extra information about a trait obtained from combining multiple genes, i.e., the synergistic effect or epistasis. However, as discussed previously, due to the mathematical complexity, the application of information-theoretic measures in disease association studies is mainly limited to pairwise epistasis between two genetic attributes.
In this study, we propose a new measure to quantify the synergistic effects among three genetic attributes that contribute to disease susceptibility by extending the information-theoretic measure for pairwise synergy. In particular, we first measure the total amount of information that three attributes together can provide about the phenotypic status, and then subtract all lower-order effects including the main effects of the three attributes and all pairwise synergies between them. This yields a very strict measure of pure three-way epistasis. There have been previous attempts at extending information-theoretic measures on three-way and higher-order synergies.14,17,23,24 However, most existing measures are not able to decouple lower-order interaction effects from the higher-order effects and the formalization of higher-order synergy is still debatable. We compared our new measure to those existing ones and were able to show that our measure performed the best at separating all lower-order effects from the pure three-way epistasis by applying to both synthetic data containing artifact epistasis models and real human disease data from a candidate-gene association study of tuberculosis in a population from West Africa.25 Of particular interest, we identified a statistically significant three-way epistasis model in the tuberculosis data, in which the three-way synergy is stronger than all the main effects and pairwise-interaction effects combined. With further biological verification and interpretation, this model could be very valuable in advancing tuberculosis research.
In information theory,22 entropy is a measure of the uncertainty of a random variable. It can be explained as the amount of information required on average to describe a random variable. For a discrete variable with alphabet and probability mass function , its entropy is defined as
The dependency between two random variables can be described using mutual information.22 This is a measure of the amount of information that one random variable contains about the other, or can be thought of as the reduction of uncertainty of one random variable given the knowledge of the other. In the context of genetic association studies, mutual information can be very useful to quantify how much of a phenotypic status is explained by genotypic variations. We consider a genetic attribute and the phenotypic class , for example, case or control, are both discrete random variables. The mutual information measures the reduction in the uncertainty of the class C due to the knowledge about the genotype of (figure 1A), defined as
Further extension of the information-gain measure on more than two genetic attributes is non-trivial for epistasis studies because many complex human diseases could very likely involve genetic interactions of orders higher than two way. There is no widely accepted formal definition of information gain including genetic attributes higher than two. Here, we make an effort measuring three-way synergistic interactions using information gain.
In a previous attempt, Anastassiou14 and Varadan et al24 proposed to define the three-way information gain by comparing the integrated joint mutual information to the best-achieved subsets mutual information after breaking the whole into partitions, mathematically written as
The partition of the set chosen in this formula is the one that maximizes the sum of the amounts of mutual information connecting the subsets with the phenotypic class. This is referred to as ‘maximum-information partition’ of the set with respect to the class . This three-way quantifies the information that can be gained by combining and together comparing to its maximum-information partition. Although technically sound, this formula might include false-positive errors of pure three-way epistasis. For instance, assuming and is the maximum-information partition, after combining with , the gained information could be the result of either the pure three-way epistasis, or the pairwise epistasis between and one (or both) of , or the mixture of all above.
A more strict alternative measure23 was proposed as follows
We only subtract pairwise synergies, that is, positive information gain, because the failure of is due to the fact that it adds back information by subtracting negative information gain. By subtracting all lower-order effects and synergies, measures the pure three-way synergy that is observable only by considering three attributes together.
Also note that, when applying to genetics data, the above mutual-information and information-gain measures can be normalized by dividing the class entropy . The normalized measures range from , and provide the percentage of explaining the phenotypic class by giving the knowledge of one or multiple genetic attributes.
We applied both three-way synergy measures and to synthetic datasets and a real dataset on pulmonary tuberculosis from a West African population. The synthetic datasets were generated using genetic architecture model emulator for testing and evaluating software (GAMETES),26,27 a direct approach to simulating bi-allelic n-locus epistatic models. In GAMETES, an n-locus epistasis model is generated deterministically with specified genetic constraints such as heritability and minor allele frequencies, and then a population of samples can be generated for that model. Using GAMETES, we can have synthetic data of models with epistasis at desired orders, which are ideal for testing and comparing the two three-way synergy measures. We first generated a pure three-way epistasis model , where the association to phenotypic status was only observable when all three SNPs were considered together, that is, no main effects and no pairwise interactions. The total heritability of combining three SNPs was set to 0.27, and the minor allele frequencies of all three SNPs were set to 0.2. A corresponding dataset was generated by including 100 SNPs in total with 97 randomized SNPs, , to provide a null distribution. This synthetic dataset had 400 cases and 400 controls. Then we used GAMETES to generate a collective pairwise epistasis model , in which any two of the three attributes had strong pairwise interactions but there was no epistasis at the three-way level. Again there was no main effect for each attribute. The heritability of combining three SNPs was set to 0.32, and the heritability of combining any two SNPs, that is, , and , was about 0.1. The minor allele frequencies for all three SNPs were again set to 0.2. The second synthetic dataset also included 100 SNPs in total with 97 randomized SNPs and had 400 cases and 400 controls.
The pulmonary tuberculosis data are from a case–control study25 conducted at the Bandim Health Project in Bissau, the capital city of Guinea-Bissau. This area has a high prevalence of pulmonary tuberculosis and tuberculosis symptoms.28 Tuberculosis is one of the highest mortality diseases due to the infection of Mycobacterium tuberculosis. However, the majority of infected individuals keep the bacterium under control and never develop a clinical disease. Genetic variation among individuals is a promising direction to look into the factors that could influence the susceptibility to develop tuberculosis. Tuberculosis patients included in this study were residents or long-term guests of Bissau aged 15 years or older. From November 2003 to November 2005, 438 tuberculosis patients were screened at local health centers. A total of 344 subjects met the inclusion criteria and accepted participation, and DNA samples were successfully collected from 321 of them. Healthy controls were recruited from the study area with certain exclusion criteria, such as history of tuberculosis and household tuberculosis records within the past 2 years. Three hundred and forty-seven DNA samples were obtained from the healthy control group. All DNA samples were extracted using a standard salting-out procedure. DNA purities were estimated spectrophotometrically, and final concentrations were determined by PicoGreen. Samples (4 ng of DNA) were genotyped by TaqMan SNP assays (ABI; Applera International Inc, Foster City, California, USA) in 10 μl reaction volume, using the Rotor-Gene 3000 (Corbett Robotics Pty Ltd, Brisbane, Queensland, Australia) and the ABI 7500 real-time PCR systems. Fluorescence curves were analyzed with the Rotor-Gene Software V.6 and the 7500 Sequence Detection Software V.1.2.1 for allelic discrimination. The data include 19 SNPs from innate immunity genes, DC-SIGN (CD209), long pentraxin 3 (PTX3), toll-like receptors (TLRs), and vitamin D receptor (VDR), which are relevant to the defense against M tuberculosis. The missing genotypes (<5%) were imputed using a frequency-based method. The missing value of a sample was filled using the most common genotype of the corresponding SNP in the population.
Information-gain measurements on synthetic data
Figure 2A shows the results of applying information-theoretic measures to the first synthetic dataset that had a pure three-way epistasis model. The points in red represent the observed values of the model , and the box plots in black show the measures for the randomized SNPs . Neither main effect nor pairwise synergy was found using the information-theoretic measures as the observed data points do not distinguish from the null distributions. In addition, both three-way synergy measurements and successfully captured the pure three-way epistasis in the model.
The results of the second synthetic dataset with the collective pairwise epistasis model are shown in figure 2B. Again, the points in red represent the observed values of the model , and the box plots in black show the null distributions of 97 randomized SNPs. As we can see, all three pairwise synergies were detected. No three-way epistasis was detected by , but reported a strong three-way epistasis among , , and by including some portions of their pairwise synergies.
Information-gain measurements on real data
We used information-theoretic measures to quantify the main effects, two-way and three-way synergies on all possible combinations of attributes in the tuberculosis data. In addition, to show the collective and neighborhood structures of strong synergistic pairs, we built a pairwise epistasis network18 (figure 3, rendered by software Cytoscape29). The network was constructed through an incrementing process as follows. An edge and its two end vertices were added to the network only if their pairwise epistasis strength was greater than a given threshold. When we gradually decreased the threshold from its maximal observed value to its minimal value, the network started from a single edge with two vertices and eventually became a complete graph, that is, every single vertex is directly connected to every other vertex. We chose the highest pairwise synergy threshold, that is, 0.71%, when all 19 attributes were included in the network. Therefore, we can have a map of the strongest pairwise interactions showing the neighborhood structure of every attribute.
Figure 3 has 19 vertices and 32 edges that represent the top 32 strongest pairwise interactions out of all pairs. Using this network, we can easily identify three-way models that have strong collective pairwise synergies. In graph theory,30 the distance of a pair of vertices and is defined as the minimal number of edges for one vertex to reach the other. Given three vertices , , and , we define their trio distance as the sum of all pairwise distances, that is, . Therefore, for trios with , any two of them are directly joined by an edge, which indicates three strong collective pairwise interactions in a three-way model. If a trio has , one vertex is directly connected to the other two but the other two are not joined by an edge. A trio with does not have strong collective pairwise epistasis.
Figure 4 shows the comparison of the results of both three-way synergy measures. In general, and are positively correlated (Spearman's rank correlation ρ=0.8448, ). All the data points, that is, three-way models, are positioned on one side of the line, which indicates that the measure is always less than or equal to the measure. This is intuitive because subtracts more terms than does from . Colors indicate whether a trio has strong collective pairwise epistasis. As seen in the figure, the discrepancies (away from the line) between and are the most distinguishing for red data points, that is, trios that have either two or three strong collective pairwise interactions. These results also verify our previous discussion on these two three-way synergy measures using synthetic data. That is, probably includes pairwise synergies into its three-way epistasis measure when there are more than one pairwise synergies in a three-way model, and only captures the pure three-way epistasis that are beyond any lower-order synergies or effects.
We also looked into the correlations between the three-way synergies and lower-order synergies or effects (table 1). Both three-way synergy measures do not correlate with individual main effects. However, the three-way shows a correlation with pairwise synergy but does not. This further confirms our previous discussion on the differences between these two three-way synergy measures.
|Three-way IGpartition||Three-way IGstrict|
|Three-way IGpartition||Three-way IGstrict|
The best three-way models using those two synergy measures are reported in figure 5. The figure shows all the individual main effects, pairwise synergies, three-way synergies, and the total mutual information for a three-way model. The best model (figure 5A) includes SNPs rs11465421 from CD209, rs1544410 from VDR, and rs7975232 from VDR. It has the total mutual information 6.695% and three-way 3.668%. This model is clearly a mixed epistasis model by possessing both strong collective pairwise interactions and a pure three-way interaction. The best model (figure 5B) involves SNPs from three different genes, rs5743836 from TLR9, rs2305619 from PTX3, and rs4804803 from CD209. All three SNPs have very limited main effects and pairwise synergies. However, the strict three-way information gain is stronger than all other lower-order effects combined together, and contributes to the total association. An explicit test of epistasis31 was used to assess the statistical significance of those observations. Instead of shuffling the case–control class in a standard permutation test, the explicit test randomly shuffled genotypes of samples within each class. Therefore, the genotype frequencies within each class remained fixed. This preserved the independent main effects while randomizing any non-linear interactions, and provided the null hypothesis that the only genetic effects in the data were linear and additive. We performed the explicit test 1000 times for each observation, and both best models were statistically significant ( for the best and for the best ).
Epistasis is recognized as playing an important role in the genetic architecture of complex traits such as common human diseases. Quantifying interaction effects among multiple loci throughout the human genome has become the major focus of current research for understanding the complex relationship between genetic variations and phenotypic traits.8,9,12 However, this task is challenging due to the fact that first enumerating all possible combinations of genetic variants in a dataset of moderate size is computationally infeasible, and second it is difficult to separate the additive effects and the synergistic effects among multiple genetic factors. The majority of epistasis studies focus on pairwise interactions and rarely look beyond interactions of orders higher than two. It is computationally intensive to enumerate three-way combinations in human genome data and, furthermore, not many good methods of quantifying three-way epistasis are proposed in the literature.14
In the present study, we proposed a fast, model-free, and non-parametric measure to detect and characterize three-way epistatic interactions. It is a natural extension of the pairwise synergy measure using information gain by measuring the total three-way mutual information and then subtracting three main effects and three pairwise synergies. Our approach was shown to be able to detect pure three-way epistasis, that is, no observable effect at either the two-way or one-way level, in both synthetic and real population-based genetics data. Our study is not the first attempt to quantify three-way epistasis using information-theoretic measures. We compared our measure to a previously published one .32 Both measures exclude main effects. However, is biased towards three-way models that possess strong pairwise interactions. Our excludes all one-way and two-way effects and is able to detect pure three-way synergy that is only observable when all three attributes are considered together.
Both three-way synergy measures were applied to a pulmonary tuberculosis dataset from a West African population. Several potentially relevant innate immunity genes, CD209, PTX3, TLRs, and VDR, were included in these data to investigate their role in pulmonary tuberculosis susceptibility. The dendritic cell-specific intercellular adhesion molecule-3-grabbing non-integrin (DC-SIGN or CD209) is a crucial M tuberculosis receptor expressed on the surface of dendritic cells, and is involved in the initiation of innate immune response through identification of potential infectious agents. CD209 has previously been found to be associated with tuberculosis.33,34 Long pentraxin 3 (PTX3) is produced by innate immunity cells and vascular cells in response to proinflammatory signals and TLR engagement.35 PTX3 levels have been shown to be correlated to the degree of infection in lungs, and PTX3 plasma levels can be monitored to measure treatment efficacy because PTX3 concentration decreases as an infection is mitigated.36 TLRs are a family of receptors that are a key component in the innate immune system. TLRs recognize pathogenic molecules and control host immune response against them, and have been extensively proved to impact susceptibility to infectious and inflammatory diseases.37,38 The VDR has been shown to be linked to TLRs. It was found that TLR activation of human macrophages upregulated expression of the VDR and the vitamin D1-hydroxylase genes.39 This link suggests that the difference among human populations' ability in producing vitamin D may contribute to susceptibility to microbial infection. Furthermore, a case–control and family study reported the association between VDR polymorphisms and susceptibility to tuberculosis.40
The strongest three-way epistasis models we found (figure 5) could further extend and strengthen the understanding of how those relevant genes might work in a synergistic way. Although the synergistic effects were captured in a statistical manner, they may indicate either direct or indirect biological relationships among those genetic factors. In particular, the strongest model shows a ‘nested’ epistasis hierarchy with both strong pairwise interactions and a strong pure three-way interaction. These three SNPs are from VDR and CD209, which indicates a potential correlation between these two genes. More interestingly, the strongest model shows a three-way synergy among TLR9, PTX3, and CD209. There have been studies reporting correlations and molecular interactions between TLRs and CD209.41 However, no published research has indicated the three-way biological synergy among all three genes. Our findings may suggest that multiple pathways interweave in the innate immune system to defend the human body against M tuberculosis, and the deficiencies in all those three genetic factors greatly increase the risk of developing tuberculosis. We believe that with further biological validations, our findings could be very helpful in predicting high-risk M tuberculosis-infected individuals and preventing their tuberculosis clinical developments.
Investigating high-order genetic interactions is an arduous task, and yet essential for understanding the complex genetic architecture of human diseases. The effectiveness of our information-gain approach in detecting three-way interactions was verified using both synthetic and real genetics data. Note that our measurement is scalable regardless of the size of genetics data. However, when large-scale genome-wide data are considered and exhaustive enumeration of all possible three-way genetic attribute combinations is infeasible, data pre-screening using intelligent data-mining techniques or biology knowledge will be required. There are some interesting venues to extend our approach in future studies. First, it is important to study the genetic interactions on continuous traits. In that case, the probability density functions for continuous random variables will be used to replace the probability mass function in current discrete information-theoretic measures. Second, it will be interesting to extend the measures on synergies higher than three way. However, as more attributes are involved, the interaction hierarchy gets more complicated. More carefully designed mathematical measures are required. We anticipate that designing powerful and efficient methods to quantify high-order epistasis has the great potential in improving disease treatment and healthcare by revealing the genetic complexity of common human diseases.
This paper has been corrected since it was published Online First. A number of equations have been corrected, and the funding statement has been reworded.
TH designed the study, performed the analyses, and drafted the manuscript. YC participated in the design of the study and drafting the manuscript. JWK participated in the design of the study. RLC participated in the analyses. CW, GS, and SMW conducted the tuberculosis data collection. JHM conceived of the study and participated in its design.
This work was supported by the National Institutes of Health (NIH) of the USA by grants R01-LM009012, R01-LM010098, R01-AI59694, and R01-EY022300 to JHM.
Provenance and peer review
Not commissioned; externally peer reviewed.