Abstract

Protein sequence data arise more and more often in vaccine and infectious disease research. These types of data are discrete, high-dimensional, and complex. We propose to study the impact of protein sequences on binary outcomes using a kernel-based logistic regression model, which models the effect of protein through a random effect whose variance–covariance matrix is mostly determined by a kernel function. We propose a novel, biologically motivated, profile hidden Markov model (HMM)-based mutual information (MI) kernel. Hypothesis testing can be carried out using the maximum of the score statistics and a parametric bootstrap procedure. To improve the power of testing, we propose intuitive modifications to the test statistic. We show through simulation studies that the profile HMM-based MI kernel can be substantially more powerful than competing kernels, and that the modified test statistics bring incremental gains in power. We use these proposed methods to investigate two problems from HIV-1 vaccine research: (1) identifying segments of HIV-1 envelope (Env) protein that confer resistance to neutralizing antibody and (2) identifying segments of Env that are associated with attenuation of protective vaccine effect by antibodies of isotype A in the RV144 vaccine trial.

1. Introduction

A protein sequence is a string of letters, with each letter representing an amino acid. There are 20 kinds of amino acids, each distinct from others, but some are more alike than others. One approach to include protein sequence covariates in the regression framework is to encode each amino acid as a point in the Euclidean space, also known as vectorialization. Two common encoding methods are: (1) mismatch encoding, which turns each letter into a categorical variable with 20 levels, one for each amino acid and (2) properties encoding, which represents each amino acid as a 5–20-variate vector where each dimension of the vector corresponds to one physicochemical property. The length of a protein or a fragment of a protein with functional significance varies from 10 to over 1000 amino acids, which translates to hundreds to thousands of variables in the Euclidean space. Interaction between amino acids is also expected to be important since a protein functions as a 3D structure. As such, a weakly parametric approach to modeling the effect of protein sequences is more attractive than a parametric approach.

We propose to study protein sequences via the following kernel-based logistic model:  
\[\mbox {logit}\{ \Pr (Y_{i}=1) \} ={\boldsymbol{{{x}}}}_{i}^{\rm T}\boldsymbol {\beta }+h({\boldsymbol{{{z}}}}_{i})\quad {\text {and}}\quad {\boldsymbol{{{h}}}}\sim {N}(\boldsymbol {0},\lambda K_{\rho }),\]
(1.1)
where |$i$| indexes the unit of observation, |$Y_{i}$| is the binary outcome of interest, |${\boldsymbol{{{z}}}}_{i}$| denotes the |$i$|th protein sequence, and |${\boldsymbol{{{x}}}}_{i}$| is a vector of covariates other than sequences. |${\boldsymbol{{{h}}}}=\{ h({\boldsymbol{{{z}}}}_{1}),\ldots ,h({\boldsymbol{{{z}}}}_{n})\} ^{\rm T}$|⁠, |$\lambda$| is a non-negative scalar, and |$K_{\rho }$| is a square matrix defined through a symmetric positive semi-definite kernel function |$f_{\rho }$|⁠: |$K_{\rho }[i,j] =f_{\rho }({\boldsymbol{{{z}}}}_{i}, {\boldsymbol{{{z}}}}_{j})$|⁠. Many but not all kernels depend on a scale or bandwidth parameter, denoted as |$\rho$| here. Model (1.1) defines a generalized linear mixed model, where |$\boldsymbol {\beta }$| is the fixed effect and |$h({\boldsymbol{{{z}}}}_{i})$| is the random effect. Unlike traditional random effects models where the random effect usually has a group structure and the variance–covariance matrix of |${\boldsymbol{{{h}}}}$| is block-diagonal, here all random effects are correlated with each other. Model (1.1) has been previously proposed to test for goodness-of-fit (le Cessie and van Houwelingen, 1995) and to test the effect of human genetic variation (Liu and others, 2008), among other applications.

A test of the variance component |$\lambda =0$| answers the question of whether the protein sequence |${\boldsymbol{{{z}}}}$| is associated with the outcome. A challenge here is the presence of a scale parameter in the variance model that vanishes under the null hypothesis. This has sometimes been referred to as the “Davies problem” in the literature (Davies, 1987). A common approach is to choose a score test statistic, standardize it to a common scale, and take the maximum over a grid of values of the scale parameter. To obtain the reference distribution for the maximum of score statistics, several authors have proposed tests based on bootstrap methods (e.g. Sinha, 2009). We follow these authors in proposing a parametric bootstrap approach for carrying out the hypothesis testing. We also propose a new and improved way to standardize the score statistic. Our simulation studies show that our testing procedure has the correct type I error rate and is moderately more powerful than existing procedures when the powers are under 80%.

The rest of the paper is organized as follows. We introduce a kernel for studying protein sequences in Section 2 and present the details of the testing procedure in Section 3. Section 4 contains two simulation studies, and Section 5 presents the analysis of two motivating datasets of importance to HIV-1 vaccine research. We end with a discussion in Section 6.

2. Profile hidden Markov model mutual information kernel

As with all kernel methods, the power of testing |$\lambda =0$| based on (1.1) critically depends on the choice of kernel function. A good kernel function should capture the similarity between two observations in the most domain-relevant fashion possible, particularly for complex data types such as protein sequences. Proteins sharing sequence similarity and evolution origin are called homologs. Because homologs tend to perform similar biological functions, identifying them received substantial attention at the start of model organism genome sequencing projects. Many methods have been developed, and some of the most successful ones are underpinned by the profile hidden Markov model (HMM) (Eddy, 1998). Using profile HMM to model a collection of protein sequences requires multiple sequence alignment. The goal of aligning the sequences is to insert gaps at select places between amino acids for each protein to optimize a criterion function that measures both similarity between sequences and penalty associated with gap creation and extension. Multiple sequence alignment is a difficult and important task, and many tools have been developed for it, the best of which produce “good-enough” solutions and are widely used in practice (e.g. Edgar, 2004).

A profile HMM used in homologous sequence identification has three sets of states: match, delete, and insert. For our purpose of using a profile HMM to model a multiple sequence alignment, an HMM architecture without the insert states is more appropriate (Mitchison, 1999; Fong and others, 2010). Figure 1 shows a profile HMM with |$2L+2$| states: a begin state, an end state, a sequence of |$L$|match states, and a sequence of |$L$|delete states. To generate a protein sequence from this model, we start at the begin state. A random decision is made to enter either the first match state or the first delete state. If we are in a match state, we are to emit an amino acid by simulating from a multinomial random variable of one trial with 20 categories. On the other hand, in a delete state, no amino acids are generated. We then make another random decision to transition to another state. This process ends when we are in the end state.

Fig. 1.

Profile HMM with |$L=3$|⁠. |$\mathbf {b}$| and |$\mathbf {e}$| denote the begin and end states, respectively.

Fig. 1.

Profile HMM with |$L=3$|⁠. |$\mathbf {b}$| and |$\mathbf {e}$| denote the begin and end states, respectively.

Our profile HMM is parameterized by two sets of transition probabilities and one set of emission probabilities. The emission probabilities are the means of the multinomial random variables associated with the match states. Denote the set of transition probabilities exiting from match states by |$\boldsymbol {\tau }=\{ \boldsymbol {\tau }_{1},\ldots ,\boldsymbol {\tau }_{L}\}$|⁠, i.e. |$\boldsymbol {\tau }_{l}=P(s_{l}|s_{l-1}={\text {match}})$|⁠. Since there are two possible states (match or delete) at each position, |$\boldsymbol {\tau }_{l}$| is a probability vector of length 2, with elements that sum to 1. Similarly, let |$\boldsymbol {\nu }=\{ \boldsymbol {\nu }_{1},\ldots ,\boldsymbol {\nu }_{L}\}$| denote the set of transition probabilities exiting from delete states, i.e. |$\boldsymbol {\nu }_{l} =P(s_{l}|s_{l-1}={\text {delete}})$|⁠. Finally, let |$\boldsymbol {\xi } =\{ \boldsymbol {\xi }_{1},\ldots ,\boldsymbol {\xi }_{L}\}$| denote the set of emission probabilities, where |$\boldsymbol {\xi }_{l}$| is a probability vector of length 20 that sums to 1. Let |$\boldsymbol {\theta }$| refer to all the parameters |$(\boldsymbol {\tau },\boldsymbol {\nu },\boldsymbol {\xi })$|⁠. The likelihood of the protein sequence can be written as a product of transition and emission probabilities as in Fong and others (2010).

To build a kernel from a probabilistic model of protein sequences, we adopt the general framework of mutual information (MI) kernels (Seeger, 2002). To use the MI kernels framework, we need a two-level hierarchical model. The first level is the profile HMM, and the second level is the distributional assumptions we make on the parameters of the profile HMM. In Seeger (2002), the distributions in the second level are termed the mediator distributions. Let |${\boldsymbol{{{z}}}}_{1}$| and |${\boldsymbol{{{z}}}}_{2}$| be two protein sequences. Let |$\pi (\boldsymbol {\theta })$| be the pdf of the mediator distribution. Let |$M({\boldsymbol{{{z}}}} _{1},{\boldsymbol{{{z}}}}_{2}) =\int f({\boldsymbol{{{z}}}}_{1}|\boldsymbol {\theta })f({\boldsymbol{{{z}}}}_{2}|\boldsymbol {\theta })\pi (\boldsymbol {\theta })\,{{\rm d}}\boldsymbol {\theta }$| and |$M({\boldsymbol{{{z}}}}_{1})=\int f({\boldsymbol{{{z}}}} _{1}|\boldsymbol {\theta })\pi (\boldsymbol {\theta })\,{{\rm d}}\boldsymbol {\theta }$|⁠. The MI score is defined as |$J({\boldsymbol{{{z}}}}_{1}, {\boldsymbol{{{z}}}}_{2}) =\log [M({\boldsymbol{{{z}}}} _{1},{\boldsymbol{{{z}}}}_{2}) /\{ M({\boldsymbol{{{z}}}}_{1}) M({\boldsymbol{{{z}}}}_{2})\} ]$|⁠. |$J( {\boldsymbol{{{z}}}}_{1},{\boldsymbol{{{z}}}}_{2})$| can be viewed as a measure of the amount of information |${\boldsymbol{{{z}}}}_{1}$| and |${\boldsymbol{{{z}}}}_{2}$| share via the mediator distribution |$\pi ( \boldsymbol {\theta })$|⁠. To form a positive-definite kernel from the MI score, Seeger (2002) suggested an exponential embedding, which leads to |$K({\boldsymbol{{{z}}}} _{1},{\boldsymbol{{{z}}}}_{2}) =M({\boldsymbol{{{z}}}}_{1},{\boldsymbol{{{z}}}} _{2}) /\sqrt {M({\boldsymbol{{{z}}}}_{1},{\boldsymbol{{{z}}}}_{1}) M({\boldsymbol{{{z}}}}_{2},{\boldsymbol{{{z}}}}_{2})}$|⁠. The kernel can be equivalently written as |${E}_{\boldsymbol {\theta }}\{ f({\boldsymbol{{{z}}}}_{1}|\boldsymbol {\theta })f({\boldsymbol{{{z}}}}_{2} |\boldsymbol {\theta })\} /\sqrt {{E}_{\boldsymbol {\theta } }\{ f^{2}({\boldsymbol{{{z}}}}_{1}|\boldsymbol {\theta })\} {E}_{\boldsymbol {\theta }}\{ f^{2}({\boldsymbol{{{z}}}} _{2}|\boldsymbol {\theta })\} }$|⁠, which shows it is the uncentered Pearson correlation between |$f({\boldsymbol{{{z}}}}_{1}|\boldsymbol {\theta })$| and |$f({\boldsymbol{{{z}}}}_{2}|\boldsymbol {\theta })$| induced by the mediator distribution |$\pi (\boldsymbol {\theta })$|⁠.

The profile HMM MI kernel thus defined has two weaknesses. First, even though the kernel incorporates biological knowledge about protein sequence evolution, it is constructed without regard to the dependent variables in specific regression problems. As it is reasonable to think that different outcomes may be impacted to different degrees by the protein sequence evolution, the performance of the kernel can be improved by introducing an outcome-dependent element. Secondly, the profile HMM makes the assumption that the transition probabilities and the emission probabilities are assumed independent across different columns of the multiple sequence alignment. While it is possible to extend the profile HMM to allow correlation between these probabilities, there is not much prior knowledge that can be used to integrate out this aspect of the model, partly because the correlation can be application-specific. Based on these considerations, we introduce an extra parameter |$\rho$| into the kernel. |$K_{\rho }({\boldsymbol{{{z}}}} _{1}, {\boldsymbol{{{z}}}}_{2}) =\{ K({\boldsymbol{{{z}}}}_{1}, {\boldsymbol{{{z}}}}_{2})\} ^{\rho ^{-2}}$|⁠. Hereafter, we will refer to this kernel as the profile HMM MI kernel.

The choice of the mediator distribution |$\pi (\boldsymbol {\theta })$| is central to the performance of the kernel. For the task of identifying homologous protein sequences, the profile HMM is always used together with a prior on |$\boldsymbol {\theta }$| to bring in biological knowledge, and we adopt the prior as the mediator distribution in the profile HMM MI kernel. In the protein sequence analysis literature, a popular prior for the emission probability |$\xi _{l}$| is a 9-component mixture of 20D Dirichlet distributions introduced by Sjölander and others (1996). Popular priors for the transition probabilities are beta priors with their parameters also estimated from a large collection of aligned sequences. The prior on |$\tau _{l}$| often favors transitions from match-to-match over match-to-delete moves, and the prior on |$\nu _{l}$| often favors transitions from delete-to-match over delete-to-delete moves. Finally, the priors for different parameters are commonly assumed to be independent. Given this mediator distribution, we have |$\boldsymbol {\xi } _{l}\sim w_{1}\mathsf {Dir}(\boldsymbol {\alpha }_{\boldsymbol {\xi }} ^{(9)})+ \cdots +w_{9}\mathsf {Dir}(\boldsymbol {\alpha }_{\boldsymbol {\xi }} ^{(9)})$| where |$\mathsf {Dir}(\boldsymbol {\alpha }_{\boldsymbol {\xi }}^{(\cdot )})$| denotes a Dirichlet distribution with parameter |$\boldsymbol {\alpha }^{(\cdot )}$|⁠, |$\tau _{l}\sim \mathsf {Beta}(\boldsymbol {\alpha }_{\tau }),$| and |$\nu _{l}\sim \mathsf {Beta}(\boldsymbol {\alpha }_{\nu })$|⁠. It follows that the |$M({\boldsymbol{{{z}}}}_{1},{\boldsymbol{{{z}}}}_{2})$| of the profile HMM MI kernel has a closed-form expression  
\[M({\boldsymbol{{{z}}}}_{1},{\boldsymbol{{{z}}}}_{2})=\prod _{l=1}^{L}\left \{ \dfrac {B(\boldsymbol {\alpha }_{\tau }+ \boldsymbol {c}_{\tau l})} {B(\boldsymbol {\alpha }_{\tau })}\dfrac {B(\boldsymbol {\alpha }_{\nu }+ \boldsymbol {c}_{\nu l})}{B(\boldsymbol {\alpha }_{\nu })}\sum _{j=1}^{9} w_{j}\dfrac {B(\boldsymbol {\alpha }_{\boldsymbol {\xi }}^{(j)}+ \boldsymbol {c}_{\xi l})}{B(\boldsymbol {\alpha }_{\boldsymbol {\xi }}^{(j)})}\right \} ,\]
where |$\boldsymbol {c}_{\tau l}$| is a vector of length 2 denoting the number of paths, |$\{ 0,1,2\}$|⁠, going through the match state at the |$(l-1)$|th position, |$\boldsymbol {c}_{\nu l}$| is defined similarly for the delete states, and |$\boldsymbol {c}_{\xi l}$| is a length 20 vector denoting the amino acid count at the |$l$|th position and |$B$| is the beta function. If |$\pi (\boldsymbol {\theta })$| is not chosen properly, the hypothesis testing procedure in Section 3 will continue to have the right size, but the power will suffer.

We now compare profile HMM MI kernel with two radial basis function (RBF) kernels based on mismatch encoding and properties encoding. In the latter, we use seven properties from Lohmann and others (1994): surface area, refractivity, hydrophobicity, polarity, residue volume, bulkiness, and hydrophilicity. Given two vectors |${\boldsymbol{{{z}}}}_{i}$| and |${\boldsymbol{{{z}}}} _{j}$|⁠, RBF kernel function is |$K_{\rho }[i,j] =\exp (-\rho ^{-2}\Vert {\boldsymbol{{{z}}}}_{i}-{\boldsymbol{{{z}}}}_{j}\Vert ^{2})$|⁠. The three kernels for protein sequences of length 1 are shown in Figure 2. It is reassuring to see that there is a similar pattern between properties RBF kernel and profile HMM MI kernel. For example, two amino acids with nonpolar, neutral side chains, Methionine (M), and Leucine (L), are similar to each other according to both kernels. However, profile HMM MI kernel and properties RBF kernel differ in several aspects. The profile HMM MI kernel appears to be less “absolute”, e.g. Alanine (A) shares commonality with more amino acids under this kernel. The difference depends on what physicochemical properties are used to form the latter kernel and what mediator distribution is used to form the former kernel, which highlights another important difference between the two kernels. For the profile HMM MI kernel, the choice of the mediator distribution can be driven by prior domain knowledge. For example, it may be appropriate to use the distribution of the emission probabilities and transition probabilities from a protein family as the mediator distribution. On the other hand, choosing the right set of physicochemical properties is much harder. Lastly, the two kernels handle gaps differently. While profile HMM MI kernel models gaps in the aligned protein sequences probabilistically by the “silent” delete states, an ad hoc modification has to be made to the properties RBF kernel. For example, one may define the Euclidean distance between the opening gap and the extension gaps in a stretch of gaps and an amino acid to be 3 and 1, respectively.

Fig. 2.

Comparison of kernels. |$\rho$| equals 1 for (a) and (c) and 0.3 for (b).

Fig. 2.

Comparison of kernels. |$\rho$| equals 1 for (a) and (c) and 0.3 for (b).

The profile HMM MI kernel we have proposed is related to the SVM-Fisher method by Jaakkola and others (2000), one of the first methods to use a biologically driven probabilistic model to define a kernel for complex data types. The model used by the SVM-Fisher method has only one level since it treats the parameters of the profile HMM as unknown constants. Its main disadvantage is the lack of a mechanism to bring in prior information through a second level of modeling. In the machine learning literature, another group of kernels exists for protein sequences that includes the composition kernel (Ding and Dubchak, 2001; Cai and others, 2001), the spectrum kernel (Leslie and others, 2002), the string kernel (Leslie and others, 2004; Kuang and others, 2005), and the context-tree kernel (Cuturi and Vert, 2005). These kernels do not make use of the profile HMM, and do not require multiple sequence alignment of the protein sequences, whose time complexity is on the order of |$O(n^{4})$| (Edgar, 2004). Instead, they reduce a protein sequence to a set of counters that count the occurrences of |$k$|-mers (substrings of |$k$| letters), where |$k$| is typically 3 or 4. While creating a multiple sequence alignment to construct a profile HMM MI kernel is computationally prohibitive when the sample size is in the thousands; this type of kernels scales up much better, but usually at the cost of reduced classification accuracy (Cuturi and Vert, 2005). There is also a third group of kernels (Lanckriet and others, 2004; Clark and Radivojac, 2013) that represents protein sequence as time series data.

3. Parametric bootstrap test using a modified maximum of score statistics

Following many in the literature (e.g. le Cessie and van Houwelingen, 1995; Liu and others, 2008), we take a score test approach to testing |$\lambda =0$| in model (1.1). The log restricted maximum likelihood (REML) given both the random effects and the fixed effects is given by  
\[l(\lambda ;\boldsymbol {\beta },{\boldsymbol{{{h}}}})=-\tfrac {1}{2}\log |X^{\rm T} V^{-1}X|-\tfrac {1}{2}\log |V|-\tfrac {1}{2}(\tilde {\boldsymbol {y}} -X\boldsymbol {\beta })^{\rm T}V^{-1}(\tilde {\boldsymbol {y}}-X\boldsymbol {\beta }),\]
where |$\tilde {\boldsymbol {y}}=X\boldsymbol {\beta }+ {\boldsymbol{{{h}}}} +D^{-1}(\boldsymbol {y}-\boldsymbol {\mu })$|⁠, |$V=\mbox {Var} (\tilde {\boldsymbol {y}})=\lambda K_{\rho }+D^{-1}$|⁠, |$D=\mbox {diag} \{ \boldsymbol {\mu }(\boldsymbol {1}-\boldsymbol {\mu })\} ,$| and |$\boldsymbol {\mu }=\mbox {logit}^{-1}(X\boldsymbol {\beta }+ {\boldsymbol{{{h}}}})$|⁠. Taking derivative of the REML with respect to |$\lambda$| and setting |$\lambda =0$|⁠, we get the score |$\mbox {Tr}(PK_{\rho }) -(\boldsymbol {y}-\boldsymbol {\mu }) ^{\rm T}K_{\rho }(\boldsymbol {y}-\boldsymbol {\mu })$|⁠, where |$P=P(D) =D-DX(X^{\rm T}DX)^{-1}X^{\rm T}D$|⁠. If |$\boldsymbol {\mu }$| is known, then the score test can be based on the random quadratic form |$Q_{\rho }=(\boldsymbol {y}-\boldsymbol {\mu })^{\rm T}K_{\rho }(\boldsymbol {y}-\boldsymbol {\mu })$|⁠. As |$\boldsymbol {\mu }$| is unknown, some sort of adjustment has to be made. le Cessie and van Houwelingen (1995) proposed a test statistic based on plugging in the maximum likelihood estimate of |$\boldsymbol {\beta }$| under the null hypothesis: |$\hat {Q}_{\rho }=(\boldsymbol {Y}-\boldsymbol {\hat {\mu }})^{\rm T}K_{\rho }(\boldsymbol {Y}-\boldsymbol {\hat {\mu }})$|⁠, where |$\boldsymbol {\hat {\mu } }=\mbox {logit}^{-1}(X\boldsymbol {\hat {\beta }})$|⁠. Under appropriate regularity conditions, |$\hat {Q}_{\rho }$| is asymptotically equivalent to a sum of weighted chi-squared distributions (Zhang and Lin, 2003). Under further conditions, it also has an asymptotic normal distribution (Jong, 1987). One sufficient condition is that the extra kurtosis of |$\hat {Q}_{\rho }$| tends to 0, an alternative sufficient condition only involving second moments is that the matrix |$(D^{1/2}) ^{\rm T}WD^{1/2}$| has negligible eigenvalues, where |$W=A^{\rm T}K_{\rho }A$| and |$A=I-DX(X^{\rm T}DX) ^{-1}X^{\rm T}$|⁠.

The test statistic |$\hat {Q}_{\rho }$| depends on the unknown parameter |$\rho$| which vanishes under the null hypothesis. Davies (1987) deals with the problem of testing a hypothesis when a nuisance parameter is present only under the alternative hypothesis by formulating a test statistic that assumes the unknown parameter is known and taking the supremum of the test statistics evaluated at a range of values of the unknown parameter. To follow a similar approach here, |$\hat {Q}_{\rho }$| has to be normalized/standardized in some way.

Denote the mean and variance estimators of |$\hat {Q}_{\rho }$| by |$m$| and |$v,$| respectively. To use the normal distribution as a reference distribution for |$\hat {Q}_{\rho }$|⁠, we define |$T_{\rho }^{N}=(\hat {Q}_{\rho }-m)/\sqrt {v}$|⁠. If the sample size is not sufficiently large, the distribution of |$\hat {Q}_{\rho }$| may be better approximated by a scaled chi-square distribution. When using a scaled chi-square |$c\chi _{\kappa }^{2}$| as the reference distribution, the parameters |$c$| and |$\kappa$| can be determined by matching the first two moments of |$\hat {Q}_{\rho }$| and |$c\chi _{\kappa }^{2}$|⁠. We can then either define |$T_{\rho }^{\chi ^{2}}=\mbox {Pr}(\hat {Q}_{\rho }/c\leq \chi _{\kappa }^{2})$|⁠, where |$c=v/(2m)$|⁠, |$\kappa =2m^{2}/v,$| and |$\chi _{\kappa }^{2}$| is a chi-squared distribution with |$\kappa$| degrees of freedom; or use the fact that |$(9\kappa /2)^{1/2}\{ (\chi _{\kappa }^{2}/\kappa )^{1/3}+ ( 9\kappa /2)^{-1}-1\}$| is approximately a standard normal random variable (le Cessie and van Houwelingen, 1995) to define a pivot. Our simulation results show that the performance of the two approaches are similar (not shown) and we will show results from the former only.

By the well-known generalized linear model theory |$\hat {Q}_{\rho }$| can be approximated by |$(\boldsymbol {Y}-\boldsymbol {\mu })^{\rm T}W(\boldsymbol {Y}-\boldsymbol {\mu })$|⁠, and it can be shown that asymptotically |$\hat {Q}_{\rho }$| has mean |$\mbox {Tr}(PK_{\rho })$| and variance |$2\mbox {Tr}(PK_{\rho }PK_{\rho }) + \sum _{i}\{ W_{i,i}^{2}\mu _{2i}^{2}(\mu _{4i}/\mu _{2i} ^{2}-3)\}$|⁠, where |$W_{i,i}$| is the |$i$|th diagonal elements of |$W$|⁠, and |$\mu _{2i}$| and |$\mu _{4i}$| are the second and fourth central moments of |$Y_{i}$|⁠, respectively (e.g. le Cessie and van Houwelingen, 1995). We refer to them by |$m_{0}$| and |$v_{0}$|⁠, respectively. One way to estimate the mean and variance of |$\hat {Q}_{\rho }$| is to replace |$\boldsymbol {\mu }$| with |$\boldsymbol {\hat {\mu }}$|⁠. Denote |$\hat {D}=\mbox {diag}\{ \boldsymbol {\hat {\mu }}(\boldsymbol {1}-\boldsymbol {\hat {\mu }})\}$| and |$\hat {P}=P(\hat {D})$|⁠. The plug-in estimator for the mean of |$\hat {Q}_{\rho }$| is simply |$\mbox {Tr}(\hat {P}K_{\rho })$|⁠. The plug-in estimator for the variance is defined similarly. We refer to the plug-in estimators of the mean and variance by |$m_{\rm I}$| and |$v_{\rm I}$|⁠, respectively. As noted by le Cessie and van Houwelingen (1995), the performance of plug-in estimators may not be optimal.

To improve the finite sample performance of the tests, we propose simple modifications to the mean and variance estimators of |$\hat {Q}_{\rho }$|⁠. Noting that |$\mbox {Tr}(\hat {D})=1^{\rm T}\boldsymbol {\hat {\mu }}-\boldsymbol {\hat {\mu }}^{\rm T}\boldsymbol {\hat {\mu }}$| and |$\boldsymbol {\hat {\mu }}^{\rm T} \boldsymbol {\hat {\mu }}\simeq E(\boldsymbol {\hat {\mu }}^{\rm T}\boldsymbol {\hat {\mu }})=\boldsymbol {\mu }^{\rm T}\boldsymbol {\mu }+ \mbox {Tr}\{ \mbox {Var} (\boldsymbol {\hat {\mu }})\}$|⁠, we define |$\tilde {D}=\hat {D}-\mbox {Tr} \{ \hat {D}X(X^{\rm T}\hat {D}X) ^{-1}X^{\rm T}\hat {D}\}$| and |$\tilde {P}=P(\tilde {D})$|⁠. We propose using |$m_{\rm II}\equiv \mbox {Tr}(\tilde {P}K_{\rho })$| to estimate the mean of |$\hat {Q}_{\rho }$|⁠. Theorem 1 of supplementary material available at Biostatistics online shows that the bias of |$\mbox {Tr}(\tilde {P}K_{\rho })$| is an order less than |$\mbox {Tr}(\hat {P}K_{\rho })$|⁠. For the variance estimator, we propose using an estimated variance of |$\hat {Q}_{\rho }-m_{\rm II}$|⁠, denoted by |$v_{\rm II}$|⁠, in the place of |$v$| in |$T_{\rho }^{N}$| and |$T_{\rho } ^{\chi ^{2}}$|⁠. The variance of |$\hat {Q}_{\rho }-m_{\rm II}$| is given in Theorem 2 of supplementary material available at Biostatistics online. An estimator of the variance can be obtained by replacing |$\boldsymbol {\mu }$| with |$\boldsymbol {\hat {\mu }}$| where appropriate and by using an empirical estimator for the last expectation term.

For any given set of estimators for the mean and variance of |$\hat {Q}_{\rho }$|⁠, we define the test statistics as the maximum of |$T_{\rho }^{N}$| or |$T_{\rho }^{\chi ^{2}}$| over a grid of |$\rho$|'s: |$T_{\max }^{N}=\mbox {sup} \{ T_{\rho }^{N}:\rho \}$| and |$T_{\max }^{\chi ^{2}}=\mbox {sup}\{ T_{\rho }^{\chi ^{2}}:\rho \}$|⁠. Each value of |$\rho$| corresponds to an instance of model (1.1) with a different |$K_{\rho }$|⁠. For both RBF kernels and profile HMM MI kernel, the diagonal elements of |$K_{\rho }$| are all 1, and |$K_{\rho }$| is the correlation matrix of the random effects |$\{ h( {\boldsymbol{{{z}}}} _{i})\} _{i=1}^{n}$|⁠. Intuitively, we would like to have a set of models with different assumptions of the level of correlation among the random effects. Hence, we choose |$\rho$|'s so that the median correlations among the random effects are uniformly spaced between 0 and 1. This is in contrast with choosing a grid uniform in |$\rho$|⁠, e.g. in Liu and others (2008); because |$K_{\rho }$| is not a linear function of |$\rho$|⁠, a grid uniform in |$\rho$| will be far from uniform in |$K_{\rho }$|⁠. The impact of different choices of the density and range of the grid is explored in the simulation studies.

To obtain the |$p$|-value, we carry out a parametric bootstrap-based testing procedure (Stute and others, 1993) as follows, where we use |$T_{\max }$| to refer to both |$T_{\max }^{N}$| and |$T_{\max }^{\chi ^{2}}$|⁠.

  1. Fit a logistic regression model on the observed dataset and obtain |$\hat {\boldsymbol {\beta }}$| and |$T_{\max ,{\rm obs}}$|⁠.

  2. From the |$\hat {\boldsymbol {\beta }}$| of the observed dataset and the observed design matrix and the null model, simulate |$B$| datasets. For each simulated dataset, fit the null model and compute a |$T_{\max ,b}$|⁠, where the subscript |$b$| indexes the simulated datasets.

  3. The |$p$|-value is the proportion of |$T_{\max ,b}>T_{\max ,{\rm obs}}$| in the |$B$| simulated samples.

As |$T_{\rho }^{N}$| is an affine transformation of |$\hat {Q}_{\rho }$|⁠, we expect some power gain by using modified estimators |$m_{\rm II}$| and |$v_{\rm II}$| in the place of |$m$| and |$v$| in |$T_{\rho }^{N}$| to compute |$T_{\max }^{N}$|⁠, because the resulting |$T_{\rho }^{N}$| is more comparable across different |$\rho$|'s. It is not clear whether the same modifications can improve tests based on |$T_{\max }^{\chi ^{2}}$|⁠, as |$T_{\rho }^{\chi ^{2}}$| is a more complicated function of |$\hat {Q}_{\rho }$|⁠, |$m$|⁠, and |$v$|⁠. In Section 4, we show through simulation studies that using the modified estimators can actually help tests based on both |$T_{\max }^{N}$| and |$T_{\max }^{\chi ^{2}}$|⁠.

4. Simulation studies

In the following studies, 10 000 and 2000 replications are performed to evaluate sizes and powers, respectively. In step 2 of the parametric bootstrap-based testing procedure, |$B$| is set to |$2000$|⁠.

4.1. Simulation with Euclidean space covariates

In the first simulation study, we simulate data from the Euclidean space to examine the operating characteristics of the various maximum of score tests. We use a simulation setup from Liu and others (2008). We generate |$n$| observations from |$\mbox {logit}\{ \Pr (Y=1) \} =x+ah(z)$|⁠, with |$h(z)=2(z_{1}-z_{2})^{2}+z_{2}z_{3}+3\sin (2z_{3})z_{4}+z_{5}^{2}+2\cos (z_{4})z_{5}$| and |$x=z_{1}+v/2$|⁠. All |$z$|'s and |$v$| are generated independently from the standard normal distribution. Three different values of |$a$| are used, with |$a=0$| to study the type I error rate and |$a=0.4$| and |$0.8$| to study the power of the tests. We use different values of |$n=50$|⁠, |$100$|⁠, |$200$| to demonstrate the effect of increasing sample size. The null hypothesis is that |$\mbox {logit}\{ \Pr (Y=1)\} =\beta _{0}+ \beta _{1}x$|⁠, the alternative is that |$Y$| also depends on |${\boldsymbol{{{z}}}}$|⁠, and we let |$K_{\rho }$| be based on the RBF kernel.

We first examine the distribution of |$\hat {Q}_{\rho }$| as a function of |$\rho$|⁠. The estimated density curves of |$\hat {Q}_{\rho }$| at |$n=100$| from 10 000 replications along with the best normal approximation and the best chi-square approximation are compared for three values of |$\rho$|⁠: 1, 3, and 10 (Figure C.1 of supplementary material available at Biostatistics online). A small |$\rho$| corresponds to small off-diagonal values in |$K_{\rho }$| and a high effective sample size, and we see that for |$\rho =1$| both the normal distribution and the chi-square distribution offer good approximation to the empirical distribution. A large |$\rho$| corresponds to a low effective sample size and we see that for |$\rho =10$| the scaled chi-square distribution is a reasonable approximation but not the normal distribution. When |$\rho =3$|⁠, the empirical distribution is not approximated well by either distribution, but the scaled chi-square distribution approximates the right tail better than the normal distribution. The type I error rates of different asymptotic distribution-based tests under fixed |$\rho$| are shown in Table B.1 of supplementary material available at Biostatistics online.

We now study the type I error rates and power of maximum of score tests. The grid of |$\rho$|'s is chosen as specified in Section 3. Specifically, let |$\boldsymbol {r}$| be 10 numbers spaced uniformly between |$[0.005,0.995]$|⁠, and |$\boldsymbol {\rho }=\sqrt {{\rm median}(\Vert {\boldsymbol{{{z}}}}_{1}-{\boldsymbol{{{z}}}}_{2}\Vert ^{2}) /\{ -\log (\boldsymbol {r})\} }$|⁠, where |${\rm median}(\Vert {\boldsymbol{{{z}}}}_{1}-{\boldsymbol{{{z}}}}_{2}\Vert ^{2})$| is the median of the pairwise Euclidean distances between |${\boldsymbol{{{z}}}}_{i}$|'s. Simulation results in Table 1 show that the size of the proposed test is close to the nominal value of 5% in all cases. There is a small to modest gain in power by using the modified estimates of the mean and variance of |$\hat {Q}_{\rho }$| (⁠|$m_{\rm II}$|⁠,|$v_{\rm II}$|⁠) rather than the plug-in estimates (⁠|$m_{\rm I}$|⁠,|$v_{\rm I}$|⁠). For example, under |$a=0.8$|⁠, |$T_{\max }^{\chi ^{2}}$|⁠, and |$n=100$|⁠, the power of using (⁠|$m_{\rm I}$|⁠,|$v_{\rm I}$|⁠) and (⁠|$m_{\rm II}$|⁠,|$v_{\rm II}$|⁠) are 69% and 79%, respectively; under |$a=0.4$|⁠, |$T_{\max }^{\chi ^{2}}$|⁠, and |$n=100$|⁠, the power of using (⁠|$m_{\rm I}$|⁠,|$v_{\rm I}$|⁠) and (⁠|$m_{\rm II},v_{\rm II}$|⁠) are 55% and 60%, respectively. The latter difference is statistically significant: a paired two-sample Wilcoxon test yields a |$p$|-value of |$10^{-15}$|⁠. Comparing |$T_{\max }^{\chi ^{2}}$| and |$T_{\max }^{N}$|⁠, we see that the former has a |$2$||$4\%$| advantage over the latter when |$n=50$| or |$100$|⁠. We also investigate the effect of a denser grid of |$\rho$|'s and a different range of |$\boldsymbol {r}$| under sample size 100. The results (Table B.2 of supplementary material available at Biostatistics online) indicate that the conclusions are not sensitive to the choices explored.

Table 1.

Size and power (%) of parametric bootstrap-based maximum of score tests from the first simulation study

|$a=0$|
|$a=0.4$|
|$a=0.8$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$|
|$n$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$|
50 4.7 4.6 4.6 4.7 21 25 19 23 26 35 23 31 
100 4.9 5.1 5.0 5.0 55 60 52 56 69 79 65 76 
200 5.1 5.2 5.3 5.3 96 96 95 95 100 100 99 100 
|$a=0$|
|$a=0.4$|
|$a=0.8$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$|
|$n$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$|
50 4.7 4.6 4.6 4.7 21 25 19 23 26 35 23 31 
100 4.9 5.1 5.0 5.0 55 60 52 56 69 79 65 76 
200 5.1 5.2 5.3 5.3 96 96 95 95 100 100 99 100 

Tests based on |$(m_{\rm II}, v_{\rm II})$| show |$8$|–11|$\%$| gain in power over |$(m_{\rm I}, v_{\rm I})$| when |$a=0.8$| and |$n=50$| or |$100$|⁠.

Table 1.

Size and power (%) of parametric bootstrap-based maximum of score tests from the first simulation study

|$a=0$|
|$a=0.4$|
|$a=0.8$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$|
|$n$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$|
50 4.7 4.6 4.6 4.7 21 25 19 23 26 35 23 31 
100 4.9 5.1 5.0 5.0 55 60 52 56 69 79 65 76 
200 5.1 5.2 5.3 5.3 96 96 95 95 100 100 99 100 
|$a=0$|
|$a=0.4$|
|$a=0.8$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$|
|$n$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$|
50 4.7 4.6 4.6 4.7 21 25 19 23 26 35 23 31 
100 4.9 5.1 5.0 5.0 55 60 52 56 69 79 65 76 
200 5.1 5.2 5.3 5.3 96 96 95 95 100 100 99 100 

Tests based on |$(m_{\rm II}, v_{\rm II})$| show |$8$|–11|$\%$| gain in power over |$(m_{\rm I}, v_{\rm I})$| when |$a=0.8$| and |$n=50$| or |$100$|⁠.

4.2. Simulation with protein sequence covariate

In our second simulation study, we compare the performance of three protein sequence kernels. To simulate the protein sequence data, we first download 328 subtype CRF01_AE HIV-1 V2 region sequences and 172 subtype B HIV-1 V2 region sequences from the HIV sequence database at http://www.hiv.lanl.gov/. We build a profile HMM for each subtype using HMMER (Eddy, 1998). Both profile HMMs have 44 sets of match/delete states. We also construct a hybrid profile HMM by merging the first 22 sets of states from subtype B profile HMM with the last 22 sets of states from subtype CRF01_AE profile HMM. To study the type I error of the tests, we generate all sequences from subtype B and randomly assign each sequence to either |$Y=0$| or |$Y=1$|⁠. To study the power of the tests, we first simulate |$Y$| from a Bernoulli distribution with mean 0.5, and then simulate the protein sequence from subtype B profile HMM if |$Y=0$| and from subtype CRF01_AE profile HMM or the hybrid profile HMM if |$Y=1$|⁠. There are no additional covariates other than sequences.

To use the profile HMM MI kernel, the protein sequences are aligned with MUSCLE (Edgar, 2004). For the properties RBF, seven-properties encoding is used (Lohmann and others, 1994). The grid of |$\rho$|'s is again chosen as specified in Section 3. Specifically, let |$\boldsymbol {r}$| be 10 numbers spaced uniformly between |$[0.005,0.995]$|⁠, and |$\boldsymbol {\rho }=\sqrt {{\rm median}(\log \{ K({\boldsymbol{{{z}}}}_{1},{\boldsymbol{{{z}}}}_{2}) \} ) /\log (\boldsymbol {r})}$|⁠, where |${\rm median}(\log \{ K({\boldsymbol{{{z}}}} _{1},{\boldsymbol{{{z}}}}_{2})\} )$| is the median of |$\log \{ K({\boldsymbol{{{z}}}}_{1},{\boldsymbol{{{z}}}}_{2})\}$| between all pairs of observed data. Different grid choices are also explored and reported in Table B.1 of supplementary material available at Biostatistics online, which indicate that the conclusions are not sensitive to the choices explored.

Table 2 shows the size and power of the tests for all three kernels for sample sizes 20, 50, and 100. In most cases, the sizes are close to the nominal value of 0.05, but it is worth noting that using the profile HMM MI kernel with (⁠|$m_{\rm I}$|⁠,|$v_{\rm I}$|⁠) can be anti-conservative when the sample size is 50 or smaller. For example, at |$n=50$|⁠, the type I error rate is 6.1% using |$T_{\max }^{\chi ^{2}}$|⁠, and the exact binomial test of the rejection rate being 5% is |$10^{-6}$|⁠. The profile HMM MI kernel is substantially more powerful than the two RBF kernels in all combinations. For example, when the sequences are generated from a mixture of the subtype B and hybrid profile HMMs and the sample size is 50, MI kernel rejects 72% of the time, while the two RBF kernels reject 34% and 11%, respectively. Surprisingly, the RBF kernel based on physicochemical properties performs worse than that based on match–mismatch. This could be due to the suboptimal definition of Euclidean distance between the opening/extension gap and an amino acid, but we experimented with several different definitions and did not see an improvement. It could also be due to a suboptimal choice of physicochemical properties used to calculate the distance, although this set of properties is a common choice in the literature (Lohmann and others, 1994).

Table 2.

Size and power (%) of parametric bootstrap-based maximum of score tests from the second simulation study

B only
|$\hbox {B} + \hbox {hybrid}$|
|$\hbox {B} + \hbox {AE}$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$|
|$n$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$|
Profile HMM MI 
20 5.9 4.2 5.9 4.3 35 35 38 38 60 64 71 73 
50 6.1 5.1 6.0 5.1 72 73 74 74 86 86 90 90 
100 5.6 5.0 5.6 5.1 85 86 86 86 93 93 95 96 
Mismatch RBF 
20 5.6 4.8 5.6 4.8 11 10 13 13 40 41 48 49 
50 5.5 5.1 5.5 5.1 34 36 36 37 80 80 84 84 
100 5.1 4.9 5.1 4.8 68 71 68  91 91 93 93 
Properties RBF 
20 5.4 4.9 5.4 5.0 17 16 23 22 
50 5.2 4.9 5.1 4.9 11 13 11 13 52 53 54 54 
100 4.9 4.7 4.9 4.8 29 33 27 31 82 81 83 83 
B only
|$\hbox {B} + \hbox {hybrid}$|
|$\hbox {B} + \hbox {AE}$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$|
|$n$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$|
Profile HMM MI 
20 5.9 4.2 5.9 4.3 35 35 38 38 60 64 71 73 
50 6.1 5.1 6.0 5.1 72 73 74 74 86 86 90 90 
100 5.6 5.0 5.6 5.1 85 86 86 86 93 93 95 96 
Mismatch RBF 
20 5.6 4.8 5.6 4.8 11 10 13 13 40 41 48 49 
50 5.5 5.1 5.5 5.1 34 36 36 37 80 80 84 84 
100 5.1 4.9 5.1 4.8 68 71 68  91 91 93 93 
Properties RBF 
20 5.4 4.9 5.4 5.0 17 16 23 22 
50 5.2 4.9 5.1 4.9 11 13 11 13 52 53 54 54 
100 4.9 4.7 4.9 4.8 29 33 27 31 82 81 83 83 
Table 2.

Size and power (%) of parametric bootstrap-based maximum of score tests from the second simulation study

B only
|$\hbox {B} + \hbox {hybrid}$|
|$\hbox {B} + \hbox {AE}$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$|
|$n$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$|
Profile HMM MI 
20 5.9 4.2 5.9 4.3 35 35 38 38 60 64 71 73 
50 6.1 5.1 6.0 5.1 72 73 74 74 86 86 90 90 
100 5.6 5.0 5.6 5.1 85 86 86 86 93 93 95 96 
Mismatch RBF 
20 5.6 4.8 5.6 4.8 11 10 13 13 40 41 48 49 
50 5.5 5.1 5.5 5.1 34 36 36 37 80 80 84 84 
100 5.1 4.9 5.1 4.8 68 71 68  91 91 93 93 
Properties RBF 
20 5.4 4.9 5.4 5.0 17 16 23 22 
50 5.2 4.9 5.1 4.9 11 13 11 13 52 53 54 54 
100 4.9 4.7 4.9 4.8 29 33 27 31 82 81 83 83 
B only
|$\hbox {B} + \hbox {hybrid}$|
|$\hbox {B} + \hbox {AE}$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$T_{\max }^{\chi ^{2}}$|
|$T_{\max }^{N}$|
|$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$||$m_{\rm I}$||$m_{\rm II}$|
|$n$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$||$v_{\rm I}$||$v_{\rm II}$|
Profile HMM MI 
20 5.9 4.2 5.9 4.3 35 35 38 38 60 64 71 73 
50 6.1 5.1 6.0 5.1 72 73 74 74 86 86 90 90 
100 5.6 5.0 5.6 5.1 85 86 86 86 93 93 95 96 
Mismatch RBF 
20 5.6 4.8 5.6 4.8 11 10 13 13 40 41 48 49 
50 5.5 5.1 5.5 5.1 34 36 36 37 80 80 84 84 
100 5.1 4.9 5.1 4.8 68 71 68  91 91 93 93 
Properties RBF 
20 5.4 4.9 5.4 5.0 17 16 23 22 
50 5.2 4.9 5.1 4.9 11 13 11 13 52 53 54 54 
100 4.9 4.7 4.9 4.8 29 33 27 31 82 81 83 83 

Using modified estimates of the mean and variance of the random quadratic form leads to only slight improvement in power than using the plugged-in estimate, presumably because there are no fixed effect covariates. For example, using the mismatch RBF kernel and the |$T_{\max }^{N}$|⁠, at |$n=100$| the power of (⁠|$m_{\rm II}$|⁠,|$v_{\rm II}$|⁠) is 71% while the power of (⁠|$m_{\rm I}$|⁠, |$v_{\rm I}$|⁠) is 68%. The |$p$|-value from a paired two-sample Wilcoxon test comparing testing outcomes is |$6\times 10^{-13}$|⁠. We do not see any improvement in profile HMM MI kernel-based tests when |$n=50$|⁠, and this may be due to the anti-conservative behavior of (⁠|$m_{\rm I}$|⁠, |$v_{\rm I}$|⁠) mentioned before. Comparing |$T_{\max }^{\chi ^{2}}$| and |$T_{\max }^{N}$|⁠, we see that in contrast with simulation study 1, using |$T_{\max }^{\chi ^{2}}$| leads to |$1$||$4\%$| decrease compared with using |$T_{\max }^{N}$| when |$n$| is 50 or 100. The disadvantage can be more severe when |$n$| is 20.

5. Applications in HIV-1 vaccine studies

5.1. Antigen resistance to neutralizing antibody

Neutralizing antibody is a class of antibodies that bind to virus particles and neutralize viral infectivity. Not until recently were broadly neutralizing antibodies for HIV-1 discovered from several HIV-1-infected individuals. One group of antibodies all target the V1/V2 region of the envelope (Env) protein. They achieve 70–80% breadth of neutralization against a diverse set of HIV-1 isolates. It is of great interest to learn how variation in particular segments of Env sequence is associated with HIV-1 strains resistant to neutralization. Based on structural information, we divide the V1/V2 region into four structural segments (McLellan and others, 2011): strand B (154–163), first linker (164–166), strand C (167–176), and second linker (177–184). The numbering is relative to HXB2. We analyzed a dataset of 32 Env proteins, from diverse HIV-1 strains for which there was published neutralization data for four antibodies: CH01, CH04, PGT141, and PGT145. Additional description and references for this dataset is available in supplementary material available at Biostatistics online.

Among the 32 Env proteins, 18/14 were resistant/sensitive to CH01, respectively; 19/13 were resistant/sensitive to CH04, respectively; 14/18 were resistant/sensitive to PGT141, respectively; and 7/25 were resistant/sensitive to PGT145, respectively. For each antibody and each segment of Env, we tested the association between resistance and protein sequence. The resultant |$p$|-values are listed in Table 3. For CH01 and CH04, which arise in the same donor, it was the sequences in the first linker segment that were associated with resistance to antibody neutralization. For PGT141, it was also the sequences in the first linker segment that were implicated. For PGT145, no significant association was found. These results suggest that the first linker segment may be an additional dominant determinant of resistance to CH01, CH04, and PG141. The native state of Env proteins on the HIV-1 Env is trimeric, and the four antibodies studied here, in contrast to other antibodies in this group (such as PG9), are more specific for trimeric forms of Env over monomeric forms (Walker and others, 2011). That the first linker segment may be a major determinant of resistance to the antibodies studied here fits well with the fact that the first linker appears to be located at the interface of the different Env monomers within the trimer (Julien and others, 2013; Lyumkis and others, 2013; Pancera and others, 2014).

Table 3.

Hypothesis testing p-values for the antigen resistance data example

Profile HMM MI
Mismatch RBF
Properties RBF
|$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$||$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$||$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$|
CH01 
 Strand B 0.338 0.342 0.374 0.383 0.091 0.091 
 First linker 0.001 0.001 0.001 0.001 0.085 0.088 
 Strand C 0.121 0.125 0.101 0.106 0.207 0.208 
 Second linker 0.502 0.515 0.840 0.849 0.805 0.814 
CH04 
 Strand B 0.735 0.742 0.627 0.640 0.394 0.394 
 First linker 0.005 0.005 0.005 0.005 0.118 0.117 
 Strand C 0.089 0.094 0.065 0.071 0.180 0.180 
 Second linker 0.775 0.778 0.953 0.951 0.303 0.310 
PGT141 
 Strand B 0.463 0.477 0.394 0.404 0.708 0.717 
 First linker 0.041 0.043 0.040 0.042 0.051 0.048 
 Strand C 0.238 0.237 0.277 0.274 0.218 0.217 
 Second linker 0.948 0.948 0.983 0.978 0.752 0.763 
PGT145 
 Strand B 0.189 0.189 0.297 0.304 0.652 0.637 
 First linker 0.557 0.563 0.768 0.761 0.092 0.090 
 Strand C 0.248 0.244 0.236 0.233 0.133 0.129 
 Second linker 0.472 0.459 0.352 0.343 0.367 0.381 
Profile HMM MI
Mismatch RBF
Properties RBF
|$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$||$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$||$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$|
CH01 
 Strand B 0.338 0.342 0.374 0.383 0.091 0.091 
 First linker 0.001 0.001 0.001 0.001 0.085 0.088 
 Strand C 0.121 0.125 0.101 0.106 0.207 0.208 
 Second linker 0.502 0.515 0.840 0.849 0.805 0.814 
CH04 
 Strand B 0.735 0.742 0.627 0.640 0.394 0.394 
 First linker 0.005 0.005 0.005 0.005 0.118 0.117 
 Strand C 0.089 0.094 0.065 0.071 0.180 0.180 
 Second linker 0.775 0.778 0.953 0.951 0.303 0.310 
PGT141 
 Strand B 0.463 0.477 0.394 0.404 0.708 0.717 
 First linker 0.041 0.043 0.040 0.042 0.051 0.048 
 Strand C 0.238 0.237 0.277 0.274 0.218 0.217 
 Second linker 0.948 0.948 0.983 0.978 0.752 0.763 
PGT145 
 Strand B 0.189 0.189 0.297 0.304 0.652 0.637 
 First linker 0.557 0.563 0.768 0.761 0.092 0.090 
 Strand C 0.248 0.244 0.236 0.233 0.133 0.129 
 Second linker 0.472 0.459 0.352 0.343 0.367 0.381 

|$T_{\max }^{\chi ^{2}}$| and |$T_{\max }^{N}:$| two parametric bootstrap tests using |$m_{\rm II}$| and |$v_{\rm II}$|⁠. |$10\,000$| bootstrap samples are used.

Table 3.

Hypothesis testing p-values for the antigen resistance data example

Profile HMM MI
Mismatch RBF
Properties RBF
|$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$||$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$||$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$|
CH01 
 Strand B 0.338 0.342 0.374 0.383 0.091 0.091 
 First linker 0.001 0.001 0.001 0.001 0.085 0.088 
 Strand C 0.121 0.125 0.101 0.106 0.207 0.208 
 Second linker 0.502 0.515 0.840 0.849 0.805 0.814 
CH04 
 Strand B 0.735 0.742 0.627 0.640 0.394 0.394 
 First linker 0.005 0.005 0.005 0.005 0.118 0.117 
 Strand C 0.089 0.094 0.065 0.071 0.180 0.180 
 Second linker 0.775 0.778 0.953 0.951 0.303 0.310 
PGT141 
 Strand B 0.463 0.477 0.394 0.404 0.708 0.717 
 First linker 0.041 0.043 0.040 0.042 0.051 0.048 
 Strand C 0.238 0.237 0.277 0.274 0.218 0.217 
 Second linker 0.948 0.948 0.983 0.978 0.752 0.763 
PGT145 
 Strand B 0.189 0.189 0.297 0.304 0.652 0.637 
 First linker 0.557 0.563 0.768 0.761 0.092 0.090 
 Strand C 0.248 0.244 0.236 0.233 0.133 0.129 
 Second linker 0.472 0.459 0.352 0.343 0.367 0.381 
Profile HMM MI
Mismatch RBF
Properties RBF
|$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$||$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$||$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$|
CH01 
 Strand B 0.338 0.342 0.374 0.383 0.091 0.091 
 First linker 0.001 0.001 0.001 0.001 0.085 0.088 
 Strand C 0.121 0.125 0.101 0.106 0.207 0.208 
 Second linker 0.502 0.515 0.840 0.849 0.805 0.814 
CH04 
 Strand B 0.735 0.742 0.627 0.640 0.394 0.394 
 First linker 0.005 0.005 0.005 0.005 0.118 0.117 
 Strand C 0.089 0.094 0.065 0.071 0.180 0.180 
 Second linker 0.775 0.778 0.953 0.951 0.303 0.310 
PGT141 
 Strand B 0.463 0.477 0.394 0.404 0.708 0.717 
 First linker 0.041 0.043 0.040 0.042 0.051 0.048 
 Strand C 0.238 0.237 0.277 0.274 0.218 0.217 
 Second linker 0.948 0.948 0.983 0.978 0.752 0.763 
PGT145 
 Strand B 0.189 0.189 0.297 0.304 0.652 0.637 
 First linker 0.557 0.563 0.768 0.761 0.092 0.090 
 Strand C 0.248 0.244 0.236 0.233 0.133 0.129 
 Second linker 0.472 0.459 0.352 0.343 0.367 0.381 

|$T_{\max }^{\chi ^{2}}$| and |$T_{\max }^{N}:$| two parametric bootstrap tests using |$m_{\rm II}$| and |$v_{\rm II}$|⁠. |$10\,000$| bootstrap samples are used.

5.2. IgA binding antibody attenuation of vaccine protective effect

HIV-1 vaccine development efforts have recently been boosted by encouraging reports from the RV144 HIV-1 vaccine trial, which showed a modest 31% vaccine efficacy. A follow-up immune correlates study was conducted to understand the potential mechanism of the vaccine effect among the vaccine recipients (Haynes and others, 2012). Their results suggested that the vaccine worked through eliciting antibodies of isotype G (IgG), and that some, but not all, vaccine-elicited antibodies of isotype A (IgA) might attenuate the protective effect of IgG antibodies. Specifically, IgA antibodies that bind two strains of Env exhibit the attenuation effect, while IgA antibodies that bind 16 others strains of Env do not. Here we seek to identify a segment of the Env sequence that is associated with the attenuation effect exhibited by the IgA antibodies.

Ideally we would like to focus on segments of Env that are bound by vaccine-elicited IgA antibodies. Lacking that information, we started from four small regions, or hot spots, of Env that have experimental evidence of IgG antibodies binding (Haynes and others, 2012). We named the four hot spots as C1 hot spot, C5 hot spot, V2 hot spot, and V3 hot spot, respectively, after the regions in which they are located. We excluded the V2 hot spot from consideration because experimental evidence showed little to no IgA antibody binding to it (Haynes and others, 2012). Of the remaining three hot spots, the C1 hot spot (HXB2-relative residue 104–133) has been known to also bind vaccine-elicited IgA antibodies (Haynes and others, 2012; Tomaras and others, 2013). There was no experimental evidence on whether the C5 hotspot (472–511) and V3 hot spot (301–326) bind IgA. For each of these three hot spots, we tested the null hypothesis that the sequences are not associated with attenuation. Results, shown in Table 4, suggest that only the C1 hot spot is associated with attenuation. The results are consistent with a recent work by Tomaras and others (2013), which showed experimentally that the IgA attenuation effect might be driven by the binding of IgA antibodies to the C1 region of the HIV-1 Env protein, and that the binding of the IgA antibodies blocked binding of the IgG antibodies and prevented IgG-dependent cell-mediated cytotoxicity from killing the HIV-1 viruses. Our results further suggest that the IgA attenuation may be mediated through the C1 hot spot within the C1 region.

Table 4.

Hypothesis testing p-values for the IgA binding antibody attenuation example

Profile HMM MI
Mismatch RBF
Properties RBF
|$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$||$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$||$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$|
C1 hot spot 0.140 0.017 0.141 0.019 0.140 0.016 
C5 hot spot 0.437 0.305 0.431 0.248 0.443 0.282 
V3 hot spot 0.407 0.280 0.418 0.288 0.387 0.267 
Profile HMM MI
Mismatch RBF
Properties RBF
|$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$||$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$||$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$|
C1 hot spot 0.140 0.017 0.141 0.019 0.140 0.016 
C5 hot spot 0.437 0.305 0.431 0.248 0.443 0.282 
V3 hot spot 0.407 0.280 0.418 0.288 0.387 0.267 

|$T_{\max }^{\chi ^{2}}$| and |$T_{\max }^{N}:$| two parametric bootstrap tests using |$m_{\rm II}$| and |$v_{\rm II}$|⁠. |$10\,000$| bootstrap samples are used.

Table 4.

Hypothesis testing p-values for the IgA binding antibody attenuation example

Profile HMM MI
Mismatch RBF
Properties RBF
|$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$||$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$||$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$|
C1 hot spot 0.140 0.017 0.141 0.019 0.140 0.016 
C5 hot spot 0.437 0.305 0.431 0.248 0.443 0.282 
V3 hot spot 0.407 0.280 0.418 0.288 0.387 0.267 
Profile HMM MI
Mismatch RBF
Properties RBF
|$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$||$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$||$T_{\max }^{\chi ^{2}}$||$T_{\max }^{N}$|
C1 hot spot 0.140 0.017 0.141 0.019 0.140 0.016 
C5 hot spot 0.437 0.305 0.431 0.248 0.443 0.282 
V3 hot spot 0.407 0.280 0.418 0.288 0.387 0.267 

|$T_{\max }^{\chi ^{2}}$| and |$T_{\max }^{N}:$| two parametric bootstrap tests using |$m_{\rm II}$| and |$v_{\rm II}$|⁠. |$10\,000$| bootstrap samples are used.

6. Discussion

In this paper, we used a kernel-based logistic regression model to study the effect of protein sequences on a binary outcome of interest. We applied the proposed methods to two problems from HIV-1 vaccine development and obtained new biological insights. The proposed methods have been implemented in a R package krm, which can be downloaded from the Comprehensive R Archive Network.

We studied three kernels, two RBF kernels based on Euclidean representation of protein sequence and one novel MI kernel based on the profile HMM. In simulation studies where sequences were simulated from profile HMM, profile HMM MI kernel showed significant advantage over RBF kernels. In the two real data applications, profile HMM MI kernel and mismatch RBF kernel performed similarly, while properties RBF kernel seemed to underperform. We also studied four different versions of test statistics. Simulation studies and real data applications both suggest that using |$T_{\max }^{N}$| with modified mean and variance estimators offers the best performance for studying protein sequences.

Supplementary material

Supplementary Material is available at http://biostatistics.oxfordjournals.org.

Funding

This work was supported by the Intramural Research Program of the Vaccine Research Center, National Institute of Allergy and Infectious Diseases (NIAID), NIH, the cooperative agreement W81XWH-07-2-0067 between the Henry M. Jackson Foundation and the Department of Defense, NIAID grant UM1-AI-068618 to the HIV Vaccine Trials Network.

Acknowledgement

The authors thank the HIV-1 vaccine community for compiling and making publically available extensive neutralization data and binding antibody data. The authors thank Larry Liao at Duke for providing Env protein sequences in the RV144 correlates study. Conflict of Interest: None declared.

References

Cai
Y. D.
Liu
X. J.
Xu
X. B.
Zhou
G. P.
(
2001
).
Support vector machines for predicting protein structural class
.
BMC Bioinformatics
2
(
1
),
3
.

Clark
W. T.
Radivojac
P.
(
2013
).
Vector quantization kernels for the classification of protein sequences and structures
.
Pacific Symposium on Biocomputing
, Volume
19
.
Singapore
:
World Scientific
, pp.
316
327
.

Cuturi
M.
Vert
J. P.
(
2005
).
The context-tree kernel for strings
.
Neural Networks
18
(
8
),
1111
1123
.

Davies
R. B.
(
1987
).
Hypothesis testing when a nuisance parameter is present only under the alternative
.
Biometrika
74
(
1
),
33
43
.

Ding
C. H. Q.
Dubchak
I.
(
2001
).
Multi-class protein fold recognition using support vector machines and neural networks
.
Bioinformatics
17
(
4
),
349
358
.

Eddy
S. R.
(
1998
).
Profile hidden Markov models
.
Bioinformatics
14
,
755
763
.

Edgar
R. C.
(
2004
).
MUSCLE: multiple sequence alignment with high accuracy and high throughput
.
Nucleic Acids Research
32
(
5
),
1792
.

Fong
Y.
Wakefield
J.
Rice
K.
(
2010
).
Bayesian mixture modeling using a hybrid sampler with application to protein subfamily identification
.
Biostatistics
11
(
1
),
18
33
.

Haynes
B. F.
Gilbert
P. B.
McElrath
M. J.
Zolla-Pazner
S.
Tomaras
G. D.
Alam
S. M.
Evans
D.T.
Montefiori
D.C.
Karnasuta
C.
Sutthent
R.
and others. (
2012
).
Immune-correlates analysis of an HIV-1 vaccine efficacy trial
.
New England Journal of Medicine
366
(
14
),
1275
1286
.

Jaakkola
T.
Diekhans
M.
Haussler
D.
(
2000
).
A discriminative framework for detecting remote protein homologies
.
Journal of Computational Biology
7
(
1–2
),
95
114
.

Jong
P.
(
1987
).
A central limit theorem for generalized quadratic forms
.
Probability Theory and Related Fields
75
(
2
),
261
277
.

Julien
J.-P.
Cupo
A.
Sok
D.
Stanfield
R. L.
Lyumkis
D.
Deller
M. C.
Klasse
P. J.
Burton
D. R.
Sanders
R. W.
Moore
J. P.
Ward
A. B.
Wilson
I. A.
(
2013
).
Crystal structure of a soluble cleaved HIV-1 envelope trimer
.
Science
342
(
6165
),
1477
1483
.

Kuang
R.
Ie
E.
Wang
K.
Wang
K.
Siddiqi
M.
Freund
Y.
Leslie
C.
(
2005
).
Profile-based string kernels for remote homology detection and motif extraction
.
Journal of Bioinformatics and Computational Biology
3
(
03
),
527
550
.

Lanckriet
G. R. G.
De Bie
T.
Cristianini
N.
Jordan
M. I.
Noble
W. S.
(
2004
).
A statistical framework for genomic data fusion
.
Bioinformatics
20
(
16
),
2626
2635
.

le Cessie
S.
van Houwelingen
H. C.
(
1995
).
Testing the fit of a regression model via score tests in random effects models
.
Biometrics
51
(
2
),
600
614
.

Leslie
C. S.
Eskin
E.
Cohen
A.
Weston
J.
Noble
W. S.
(
2004
).
Mismatch string kernels for discriminative protein classification
.
Bioinformatics
20
(
4
),
467
476
.

Leslie
C. S.
Eskin
E.
Noble
W. S.
(
2002
)
. The spectrum kernel: a string kernel for SVM protein classification
.
Pacific Symposium on Biocomputing
, Volume
7
.
Singapore
:
World Scientific
, pp.
566
575
.

Liu
D.
Ghosh
D.
Lin
X.
(
2008
).
Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models
.
BMC Bioinformatics
9
(
1
),
292
.

Lohmann
R.
Schneider
G.
Behrens
D.
Wrede
P.
(
1994
).
A neural network model for the prediction of membrane-spanning amino acid sequences
.
Protein Science
3
(
9
),
1597
1601
.

Lyumkis
D.
Julien
J.-P.
de Val
N.
Cupo
A.
Potter
C. S.
Klasse
P.-J.
Burton
D. R.
Sanders
R. W.
Moore
J. P.
Carragher
B.
Wilson
I. A.
Ward
A. B.
(
2013
).
Cryo-em structure of a fully glycosylated soluble cleaved HIV-1 envelope trimer
.
Science
342
(
6165
),
1484
1490
.

McLellan
J. S.
Pancera
M.
Carrico
C.
Gorman
J.
Julien
J.-P.
Khayat
R.
Louder
R.
Pejchal
R.
Sastry
M.
Dai
K.
and others. (
2011
).
Structure of HIV-1 gp120 v1/v2 domain with broadly neutralizing antibody pg9
.
Nature
480
(
7377
),
336
343
.

Mitchison
G. J.
(
1999
).
A probabilistic treatment of phylogeny and sequence alignment
.
Journal of Molecular Evolution
49
(
1
),
11
22
.

Pancera
M.
Zhou
T
Druz
A.
Georgiev
I.S.
Soto
C.
Gorman
J.
Huang
J.
Acharya
P.
Chuang
G.Y.
Ofek
G.
and others. (
2014
).
Structure and immune recognition of trimeric pre-fusion HIV-1 Env
.
Nature.
514
(
7523
)
455
461
.

Seeger
M.
(
2002
).
Covariance kernels from Bayesian generative models
.
Advances in Neural Information Processing Systems
2
,
905
912
.

Sinha
S. K.
(
2009
).
Bootstrap tests for variance components in generalized linear mixed models
.
Canadian Journal of Statistics
37
(
2
),
219
234
.

Sjölander
K.
Karplus
K.
Brown
M.
Hughey
R.
Krogh
A.
Mian
I. S.
Haussler
D.
(
1996
).
Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology
.
Computer Applications in the Biosciences: CABIOS
12
(
4
),
327
.

Stute
W.
Manteiga
W. G.
Quindimil
M. P.
(
1993
).
Bootstrap based goodness-of-fit-tests
.
Metrika
40
(
1
),
243
256
.

Tomaras
G.
Ferrari
G.
Shen
X.
Alam
S.
Liao
H.
Pollara
J.
Bonsignori
M.
Moody
M.
Fong
Y.
Chen
X.
and others. (
2013
).
Vaccine-induced plasma IgA specific for the C1 region of the HIV-1 envelope blocks binding and effector function of IgG
.
Proceedings of the National Academy of Sciences
110
(
22
),
9019
9024
.

Walker
L. M.
Huber
M.
Doores
K. J.
Falkowska
E.
Pejchal
R.
Julien
J.-P.
Julien
Jean-Philippe
Wang
S.-K.
Ramos
A.
Chan-Hui
P.
Moyle
M.
and others (
2011
).
Broad neutralization coverage of HIV by multiple highly potent antibodies
.
Nature
477
(
7365
),
466
470
.

Zhang
D.
Lin
X.
(
2003
).
Hypothesis testing in semiparametric additive mixed models
.
Biostatistics
4
,
57
.

Author notes

Equal contribution.

Supplementary data