Modeling sequence–sequence interactions for drug response

Motivation: Genetic interactions or epistasis may play an important role in the genetic etiology of drug response. With the availability of large-scale, high-density single nucleotide polymorphism markers, a great challenge is how to associate haplotype structures and complex drug response through its underlying pharmacodynamic mechanisms. Results: We have derived a general statistical model for detecting an interactive network of DNA sequence variants that encode pharmacodynamic processes based on the haplotype map constructed by single nucleotide polymorphisms. The model was validated by a pharmacogenetic study for two predominant beta-adrenergic receptor ( (cid:1) AR) subtypes expressed in the heart, (cid:1) 1AR and (cid:1) 2AR. Haplotypes from these two receptors trigger significant interaction effects on the response of heart rate to different dose levels of dobutamine. This model will have implications for pharmacogenetic and pharmacogenomic research and drug discovery. Availability: A computer program written in Matlab can be downloaded from the webpage of statistical genetics group at the University of Florida. online.


INTRODUCTION
The increasing number of genetic studies for complex biological characteristics in humans requires more advanced techniques of statistical analysis. This is due to two reasons. First, genes interact in a complex network to determine the final phenotype of a trait. Second, the phenotypic formation of every trait can be partitioned into its multiple continuous developmental steps on a time scale. Any statistical approaches that attempt to draw a picture of the genetic architecture of a complex phenotype should simultaneously consider the interactions among different genes and the dynamic changes of these interactions during the process of phenotypic formation.
It has been well recognized that a complex trait is controlled by a number of Mendelian-inherited, environmentally sensitive genes, of which some act additively, whereas many others are operational in a multiplicative or compensatory way (Frankel and Schork, 1996;Moore, 2003). Many current statistical techniques used in genetic research assume the additive control of genes (Lynch and Walsh, 1998), aimed to facilitate data analysis and modeling, which may provide misleading results when genetic interactions or epistasis actually occur. The mapping strategy, pioneered by Lander and Botstein (1989), has proven to be powerful for detecting individual genes, known as quantitative trait loci (QTL), that affect a complex trait based on the cosegregation of genotyped markers and the underlying QTL that are bracketed by the markers. QTL identified from this strategy presents a hypothesized chromosomal segment whose DNA sequence is unknown. More recently, Liu et al. (2004) have developed a new strategy that can identify specific DNA sequence variants for a complex trait. This strategy that relies on the recent advent of highthroughput single nucleotide polymorphism (SNP) technologies allows for the genome-wide scan of causal DNA sequences, called quantitative trait nucleotides or QTNs. Lin and Wu (2006) have developed a conceptual framework for detecting interaction effects between different QTNs that contribute to quantitative variation.
Considering the dynamic feature of many complex traits, a series of new statistical models, called functional mapping, have been recently developed to characterize QTL that are responsible for genetic variation in these dynamic traits (Ma et al., 2002;Wu et al., 2004a, b, c). Functional mapping capitalizes on the mathematical functions that describe biological processes and embeds them within the context of genetic mapping theory. By estimating the mathematical parameters that determine the patterns of longitudinal trajectories, functional mapping has proven efficient and effective for unveiling the genetic architecture of complex traits (Wu et al., 2003;Wu and Lin, 2006).
A different but statistically similar example of these so-called time series or longitudinal traits is drug response, a field that has gained much attention due to possible clinical applications, ranging from individualized therapy to new drug development (Arranz et al., 2003). Recent studies have suggested that a variable number of polymorphisms in various genes are involved in modulating the response and/or side effects to drugs (Serretti and Artioli, 2003;Tate et al., 2005). Lin et al. (2005) extended the idea of functional mapping to map and detect individually acting genetic variants for drug response by integrating the underlying pharmacodynamic mechanisms into functional mapping. But the model that can address how epistatic interactions between different QTNs affect the pattern and form of drug response and timing of various events in the response process has not been developed.
The motivation of this study is to develop a statistical method for detecting genetic interactions that control drug response. The new method will incorporate Lin and Wu (2006) QTN-QTN interaction model into the functional mapping framework to investigate the roles of epistatic interactions between different DNA sequence variants in encoding drug response. The epistatic model will take into account the autocorrelation of drug response measured at multiple dosage levels and provide a quantitative framework for a series of hypothesis tests regarding the interplay between genetic actions/interactions and the shape and pattern of drug response. The model can be used to estimate and test a network of genetic interactions, thus enabling clinical geneticists to draw a detailed picture of genetic control and regulation for the pharmacodynamic process of drug response. We will use an example of a pharmacogenetic study for two beta-adrenergic receptor(AR) subtypes expressed in the heart to validate the usefulness of the epistasis model.

The normal mixture model
Approaches for QTL and QTN mapping are statistically similar in the conceptual formation of a mixture model. For QTL mapping, each observation must arise from one of multiple QTL genotypes, although the QTL genotype for this individual is unknown (Lander and Botstein, 1989). The normal mixture model is constructed to contain the possible impact of these QTL genotypes each of which is assumed to follow a normal distribution. As shown in Liu et al. (2004), QTN mapping is also based on a normal mixture model but in which the phenotypic value of a heterozygous individual for two or more SNPs is thought to arise as one of multiple diplotypes constructed by a set of SNPs.
Diplotype difference can be assumed to result from the composition of different haplotypes. We define the haplotype that is different from the rest of haplotypes as reference or risk haplotype (Liu et al., 2004). Those remaining haplotypes are thus defined as non-reference or nonrisk haplotype. Consider a QTN (A) that is composed of a set of SNPs. Let A and A be the risk and non-risk haplotypes for this QTN, respectively, which thus form three different composite diplotypes, AA, A A and A A. In the mixture model, longitudinal observations for each composite diplotype, j, are characterized by a different multivariate normal distribution with mean vector u j and covariance matrix D.
For QTN mapping, only diplotypes that are heterozygous for two or more SNPs contain different composite diplotypes because these diplotypes are not consistent with their phenotypically observable genotypes. Therefore, the mixture model is formulated only for those diplotypes.

Epistatic effects
We developed an interactive model aimed to detect sequence-sequence epistasis. Let B and B be the risk and non-risk haplotypes at a second QTN (B), respectively.  (Lynch and Walsh, 1998). The time or dosedependent vector for the genotypic value (u jAjB ) of a joint composite diplotype at the two QTNs can be decomposed into nine different components as follows: where stand for the composite diplotypes at QTN A and B, respectively, a and d are the additive and dominant effect vectors at the corresponding QTN, respectively, and i, j, k and l are the additive Â additive, additive Â dominant, dominant Â additive and dominant Â dominant epistatic effect vectors between the two QTNs, respectively. If the maximum likelihood estimates (MLEs) of the genotypic value vectors at the left side of Equation (1) can be observed, we can solve the vectors for the overall mean, additive, dominant and four kinds of epistatic effects between two QTNs by

Likelihood and estimation
The mixture model used to map QTNs includes the proportions of each mixture component and the probability distribution functions of phenotypic observations given for that component. The mixture proportions are the relative frequencies of those diplotypes that are phenotypically the same, and they can be expressed as a function of haploid frequencies.
Suppose there are R and S SNPs for QTN A and B, respectively. The two alleles, 1 and 0, at each of these SNPs are symbolized by k 1 , . . . ; k R and l 1 , . . . ; l S , respectively. A haplotype frequency is denoted as p A k1k2ÁÁÁkR for QTN A and p B l1l2ÁÁÁlS for QTN B. In this article, two QTNs are assumed to be independent in their frequency. Thus, across-QTN haplotype frequencies can be calculated as the product of the corresponding haplotype frequencies from a different QTN, expressed as where, the parentheses are used to separate two different QTNs for a given across-QTN haplotype. With these across-QTN haplotype frequencies, expected across-QTN diplotype frequencies and across-QTN genotype frequencies can be calculated, respectively, under Hardy-Weinberg equilibrium (Lynch and Walsh, 1998;Wu et al., 2002).
With across-QTN diplotype and genotype frequencies, the likelihood function based on across-QTN genotype observations, can be constructed. We simplify the presentation of our model by assuming two SNPs for each QTN. In Supplementary Table 1, all possible genotypes and diplotypes are given as well as their frequencies at two SNPs genotyped from a QTN (say A). Each diplotype is composed of two haplotypes, one from the mother and the other from the father. The diplotype frequencies can be expressed in terms of the haplotype frequencies. The same genotype may contain two different diplotypes, depending on its heterozygosity. Supplementary Table 1 also provides the relative frequencies with which a genotype carries a particular haplotype. Such relative frequencies will be useful for partitioning observed genotypes into underlying diplotypes in the construction of likelihood function based on the phenotype. Let Þ be the population genetic parameters to be estimated for QTN A and B. Lin and Wu (2006) formulated a loglikelihood function of these unknown haplotype frequencies given observed genotypes (n) in a multinomial form. They also provided a series of closed-form solutions for the expectation-maximization (EM) algorithm to estimate these haplotype frequencies.
Haplotype frequencies can be expressed as a function of allelic frequencies and linkage disequilibria (LD). For a two-SNP haplotype for QTN A, we have where, D A is the linkage disequilibrium between the two SNPs at QTN A. Thus, once haplotype frequencies are estimated, we can estimate allelic frequencies and LD by solving Equation (4). Similar calculations are also done for QTN B. While traditional models assume the association between the genotype and drug response (Gong et al., 2004), our model can estimate the effects of different diplotypes on the pharmacodynamic response of drugs. A particular four-SNP genotype from two QTNs, expressed as ðk 1 k 0 1 =k 2 k 0 2 Þðl 1 l 0 1 =l 2 l 0 2 Þ, can be partitioned into four possible diplotypes, ½ðk 1 k 2 Þðl 1 l 2 Þ½ðk 0 1 k 0 2 Þðl 0 1 l 0 2 Þ, ½ðk 1 k 2 Þðl 1 l 0 2 Þ½ðk 0 1 k 0 2 Þðl 0 1 l 2 Þ, ½ðk 1 k 0 2 Þðl 1 l 2 Þ½ðk 0 1 k 2 Þðl 0 1 l 0 2 Þ and ½ðk 1 k 0 2 Þðl 1 l 0 2 Þ ½ðk 0 1 k 2 Þðl 0 1 l 2 Þ. Let : q be quantitative genetic parameters that specify the mean vectors of different diplotypes and the residual covariance matrix. The likelihood of unknown population (: p ) and quantitative genetic parameters (: q ) given a drug response measured at C dosage levels, y ¼ fyð1Þ, . . . ; yðCÞg, and SNP genotype observations, n, can be constructed (see Supplementary Material-Likelihood).
Assume that haplotype [11] (denoted by A) is the risk haplotype for QTN A, and haplotype [10] (denoted B) is the risk haplotype for QTN B, respectively. As shown above, this leads to nine different across-QTN composite diplotypes. Equation (1) provides the structure of an arbitrary across-QTN composite diplotype formed by the risk and non-risk haplotypes. For a given composite diplotype j A j B , we have a multivariate normal distribution expressed as f jAjB ðy i ; : q Þ ¼ 1 At a particular dosage c, the relationship between the observation and genotypic value can be described by a regression model, where, ijAjB is the indicator variable denoted as 1 if a composite diplotype j A j B is considered for subject i and 0 otherwise and e i (c) is the residual error (i.e. the accumulative effect of polygenes and errors). The error term is specified by a structured antedependence (SAD) model (Zhao et al., 2005), in which the error at a particular dosage level is determined by the error at a previous dosage level and innovative error that occurs at the current dosage level (Gabriel, 1962). Ultimately, the covariance matrix in Equation (5) is described by two parameters, antedependence parameter and innovation variance.

Modeling the mean-covariance structure
Traditional genetic mapping for multiple traits attempts to estimate each element in the mean vector and the covariance matrix. This may not be efficient and effective for two reasons. First, when dosage level c is high, an exponentially increasing number of parameters need to be estimated. Second, this approach does not consider biological principles underlying pharmacodynamic responses. The mainstay of modeling drug response is the Hill, or sigmoid Emax, equation, which postulates the following relationship between drug concentration (c) and drug effect (E) (Giraldo, 2003) E where E max is the asymptotic (limiting) effect, EC 50 is the drug concentration that results in 50% of the maximal effect and H is the slope parameter that determines the slope of the concentration-response curve. The larger H, the steeper the linear phase of the logconcentration-effect curve. When the effect is a continuous variable, estimates of E max , EC 50 and H are usually obtained by extended least squares or iteratively reweighted least squares when there is sufficient data for analysis of individual subjects. When sparse data are pooled from multiple patients, then a population analysis is a better approach. Different from such a traditional treatment, we will estimate these curve parameters separately for different composite diplotypes. The composite diplotype-specific mean vectors in Equation (5) will be modeled by the E max model, expressed as where the mathematical parameters ðE maxjAjB ; EC 50jAjB ; H jAjB Þ, denoted by ? jAjB , describe the drug response profile for composite diplotype j A j B . Thus, based on Equation (8), our estimates will be concentrated on ? jAjB rather than on u jAjB . This modeling of the mean vectors has two advantages: (1) clinically meaningful curves are used in genetic mapping so that the results will be closer to biological reality, and (2) the number of parameters to be estimated is reduced, thus increasing the power of the model to detect significant QTN and their interactions. It is not parsimonious to estimate all the elements in the withinsubject covariance matrix among different dose levels because some structure exists for dosage-dependent variances and correlations. The structure of the residual covariance matrix in (5) can be modeled by Zimmerman and N uñez-Ant on's (2001) SAD model. Using matrix notation, the error term in (6) can be expressed as where e ¼ ½eð1Þ, . . . ; eðCÞ 0 , ¼ ½ð1Þ, . . . ; ðCÞ 0 and for the first-order SAD or SAD(1) model The variance-covariance matrix of the longitudinal trait is then expressed as where, D is the innovation variance-covariance matrix and is expressed as: The closed forms for the inverse and determinant of matrix D help to estimate the parameters, : v ¼ ð 1 , . . . ; r ; 2 Þ, that model the matrix. The vector ? jAjB and : v form the quantitative genetic parameters : q .

HYPOTHESIS TESTS
With our epistatic model, we can make a number of hypothesis tests regarding the genetic control of overall drug response to a spectrum of dosages and other clinically important events. Two major hypotheses can be made in the following sequence: (1) the association between different SNPs within each QTN by testing their LD, and (2) the significance of an assumed across-QTN risk haplotype for its effect on drug response. The LD between two given SNPs within QTN A can be tested using the following hypotheses: The log-likelihood ratio (LR) test statistic for the significance of LD is calculated by comparing the likelihood values under the H 1 (full model) and H 0 (reduced model) using where, the tilde and hat denote the MLEs of unknown parameters under H 0 and H 1 , respectively. The LR A D is considered to asymptotically follow a 2 distribution with one degree of freedom. The MLEs of allelic frequencies under H 0 can be estimated using the EM algorithm described above, but with the constraint p A 11 p A 00 ¼ p A 10 p A 01 . A similar test can be made for QTN B.
Diplotype or haplotype effects on a complex disease can be tested using null and alternative hypotheses expressed as H 0 : ? jAjB ¼ ?ðj A ; j B ¼ À1; 0; 1Þ H 1 : at least one equality in H 0 does not hold ð13Þ The log-likelihood ratio test statistic (LR E ) under these two hypotheses can be similarly calculated. Unlike the significance test used in traditional QTL mapping (Lander and Botstein, 1989), the H 0 of hypothesis (13) for QTN haplotyping does not contain a nonidentifiable parameter and, therefore, the distribution of the LR calculated can be thought to asymptotically follow a 2 distribution with 24 degrees of freedom. In practice, if the sample size used is not adequately large, the permutation test approach proposed by Churchill and Doerge (1994), which does not rely upon the distribution of the LR E , may be used to determine the critical threshold for determining the existence of a QTN.
Different genetic effects, such as the additive (a A and a B ), dominant (d A and d B ) and additive Â additive (i aa ), additive Â dominant (i ad ), dominant Â additive (i da ) and dominant Â dominant effects (i dd ) between QTN A and B can also be tested individually. The critical thresholds for these individual effects can be obtained from a 2 -distribution table with one degree of freedom. But they can be empirically determined on the basis of simulation studies if the sample size used is not adequately large.
Our model allows for the tests of different outcomes for drug response. For overall drug response curves, hypothesis tests based on mathematical parameters can be informative for the characterization of genetic control. One alternative test can be performed on the area under curve (AUC). The AUC can be calculated by taking integral for each composite diplotype, expressed as The null hypothesis based on the AUC can be formulated as AUC jAjB ¼ AUC. The permutation tests can be used to determine the critical value for AUC-related hypothesis tests.

Worked example
To show how our model works in practice, we use it to analyze a real example from a pharmacogenetic study for drug response. Numerous genes have been investigated to affect heart function (Chagnon et al., 2003;Mason et al., 1999). The 1AR and 2AR genes are two such examples (Green et al., 1995;Large et al., 1997) in each of which there are several polymorphisms common in the population. Two common polymorphisms are identified at codons 49 (Ser49Gly) and 389 (Arg389Gly) for the 1AR gene and at codons 16 (Arg16Gly) and 27 (Gln27Glu) for the 2AR gene, respectively. The polymorphisms in each of these two receptor genes are in linkage disequilibrium, which suggests the importance of taking into account haplotypes, rather than a single polymorphism, when defining biological function. This study attempts to detect haplotype variants within these candidate genes which determine drug response.
To determine whether sequence variants at the two polymorphisms from one gene interact with those from the other gene to affect drug response, a group of 155 men and women were investigated with ages from 32 to 86 years old with a large variation in heart rates in response to different dosage levels of dobutamine. Dobutamine was injected into these subjects to investigate their response in heart rate to this drug. The subjects received increasing doses of dobutamine, until they achieved target heart rate response or predetermined maximum dose. The dose levels used were 0 (baseline), 5, 10, 20, 30 and 40 mcg/min, at each of which heart rate was measured. The time interval of 3 min is allowed between two successive doses for subjects to reach a plateau in response to that dose. Of the subjects, 57 reached the thresholds of their heart rates before the highest dosage. As a demonstration of the model, we simply ignored these subjects. Thus, only 98 subjects in whom there were heart rate data at all the six dose levels were included for data analysis.
Each of these patients was determined for their genotypes at codon 49 with two allele Ser49 (A) and Gly49 (G) and codon 389 with two alleles Arg389 (C) and Gly389 (G) within the 1AR gene on chromosome 10 as well as at codon 16 with two alleles Arg16 (A) and Gly16 (G) and codon 27 with two alleles Gln27 (C) and Glu27 (G) within the 2AR gene on chromosome five, and measured for the response in heart rate to dobutamine. Two SNPs from each gene theoretically form 81 across-QTN genotypes, but, because these two genes are independent, the frequencies of these genotypes can be expressed as the product of the genotype frequencies from each gene. The four haplotype frequencies within each gene were estimated, which are used to estimate allele frequencies and linkage disequilibrium at each gene (Supplementary Table  2). Highly significant LD,D 1 ¼ 0:04 for 1AR andD 2 ¼ 0:13 for 2AR, was detected between two SNPs for each gene (P50:001) based on the hypothesis test (12).
By assuming that one haplotype is different from the rest of haplotypes at each gene, this model can detect the risk haplotypes that display significant main and interaction effects  (13). Because of unobserved genotypes at a candidate gene, some composite diplotypes do not exist for several risk haplotype or risk and non-risk haplotype combinations. The maximal LR value (56.32) with all the nine composite genotypes was obtained when GC as the risk haplotype of the 1AR gene is combined with GG as the risk haplotype of the 2AR gene. The LR values for the other combinations range from 4.62 to 40.52. The optimal riskhaplotype combination for drug response is statistically tested with 1000 permutation tests which obtained the critical threshold value of 54.08 at the 5% significance level. This test suggests that there exist significant haplotype effects at these two candidate genes on heart rate curves.
Statistical tests for individual genetic effects for DNA sequence variants based on Equation (1) suggest that the additive and dominant effects at each of the candidate genes are highly significant. Four kinds of epistasis between the two genes are all significant at the 0.0001 significance level (Table 2). Figure 1 displays the profiles of heart rate to increasing concentration levels of dobutamine for nine across-QTN composite genotypes drawn with the estimated response parameters given in Supplementary Table 2. Considerable over-crossing among different curves suggests sequence-sequence interactions within 1AR and 2AR.

Simulation
This model has been examined through simulation studies (Supplementary Material-Simulation Result). The results suggest that the model provides reasonably precise estimates of the population genetic parameters, and the parameters that model drug response for different diplotypes and the covariance matrix of drug response measured at different dosage levels. The model is quite stable and robust in which reasonable power for QTN detection can be obtained for a small sample size (100) or a small heritability (0.1).

Concluding remarks
Given a recent upsurge of interest in pharmacogenetics and pharmacogenomics, there is a pressing need for the development of statistical models for unraveling the genetic etiology of pharmacological variation. QTL mapping, by superimposing real biological phenotypes on genome sequence, structural polymorphisms and gene expression data, can provide an unbiased view of the network of gene actions and interactions that build a complex phenotype-like drug response. Through an integrated approach, studies can move to characterize the best drug, the best dosage of the drug and the best injection time for individual patients based on their genetic structure. By incorporating genetic tests into the prescription process, physicians might improve outcomes for patients. Genetically, drug response is a complex trait, involving multiple biochemical pathways each controlled by different interacting genes (Marchini et al., 2005). Traditional QTL mapping can probe the important chromosomal segments for drug response, but is difficult to identify the causal DNA sequences of these segments. The completion of the human genome project makes it possible to genome-wide associate DNA sequence variants with complex phenotypes. The model proposed in this article has capacity to characterize the effects of genetic factors and their epistatic effects on drug response at the DNA sequence level.
There are several ways in which our model may be extended. First, the model can be modified to consider different interacting QTNs that are in linkage disequilibrium by estimating more disequilibrium parameters. Noureddine et al. (2005) identified the linkage disequilibrium between modifier genes and genes for Parkinson disease. Second, with the principle of two-SNP interactions, we can model an arbitrary number of SNPs whose sequences are associated with the phenotypic variation. As shown in Lin and Wu (2006), the multi-SNP sequencing model will encounter the problem of many haplotypes and haplotype pairs. An AIC-or BIC-based model selection strategy (Burnham and Andersson, 1998) has been framed to determine the haplotype that is most distinct from the rest of haplotypes in explaining quantitative variation. Third, our model is based on the data set without genotyping errors. It has been well observed that even small genotyping errors can substantially reduce power to genetic association (Gordon et al. 2002;Kang et al. 2004). The impact of genotyping errors and misclassification on the detection and estimation of sequence-sequence interactions deserves in-depth investigations. Dobutamine concentration (mcg) Heart rate increase (bpm) Fig. 1. Nine profiles of heart rate in response to different dosages of dobutamine, each representing a different across-QTN composite diplotype (foreground), identified for SNPs within two genes. The profiles of 98 studied subjects are also shown (background).