Summary

For a class of latent Markov models for discrete variables having a longitudinal structure, we introduce an approach for formulating and testing linear hypotheses on the transition probabilities of the latent process. For the maximum likelihood estimation of a latent Markov model under hypotheses of this type, we outline an EM algorithm that is based on well-known recursions in the hidden Markov literature. We also show that, under certain assumptions, the asymptotic null distribution of the likelihood ratio statistic for testing a linear hypothesis on the transition probabilities of a latent Markov model, against a less stringent linear hypothesis on the transition probabilities of the same model, is of χ¯2 type. As a particular case, we derive the asymptotic distribution of the likelihood ratio statistic between a latent class model and its latent Markov version, which may be used to test the hypothesis of absence of transition between latent states. The approach is illustrated through a series of simulations and two applications, the first of which is based on educational testing data that have been collected within the National Assessment of Educational Progress 1996, and the second on data, concerning the use of marijuana, which have been collected within the National Youth Survey 1976–1980.

1. Introduction

The latent Markov (LM) model was introduced by Wiggins (1973) for the analysis of longitudinal data and has been successfully applied in several fields, such as psychological and educational measurement (Langeheine et al., 1994; Vermunt et al., 1999), sociology (Van de Pol and Langeheine, 1990; Humphreys, 1998; Mannan and Koval, 2003), medicine (Auranen et al., 2000; Cook et al., 2000), criminology (Bijleveld and Mooijaart, 2003) and the analysis of customer behaviour (Poulsen, 1990). The basic assumption of this model is that any occasion-specific response variable depends only on a discrete latent variable, which in turn depends on the latent variables corresponding to the previous occasions according to a first-order Markov chain. Therefore, the LM model may be seen as an extension of a Markov chain model which allows for measurement errors, but also as an extension of the latent class (LC) model (Lazarsfeld and Henry, 1968; Goodman, 1974), in which the assumption that any subject belongs to the same LC throughout the survey is suitably relaxed. As such, this model enables several aspects to be taken into account which characterize longitudinal studies, i.e. serial dependence between observations, measurement errors and unobservable hetero-geneity.

The LM model may be constrained in several ways to make it more parsimonious and easier to interpret. In particular, we deal with a class of LM models in which

  • (a)

    the conditional distribution of the response variables given the latent process may be formulated as in a generalized linear model and

  • (b)

    the latent process is time homogeneous with transition probabilities that may be formulated as in a linear probability model.

For the maximum likelihood estimation of these models, we outline an EM algorithm (Dempster et al., 1977), in which the E-step is based on well-known recursions in the hidden Markov literature (see MacDonald and Zucchini (1997)) and the M-step is based on Newton–Raphson-type iterative algorithms. To simplify the implementation of the algorithm, we often make use of matrix notation and we give some simple rules to avoid numerical instability when long sequences of response variables are analysed. We also deal with the asymptotic distribution of the likelihood ratio (LR) statistic for testing linear hypotheses on the parameters of the model assumed on the transition probabilities of the latent process. Among these hypotheses, those formulated by constraining certain transition probabilities to be equal to 0 are of particular interest. An example is the hypothesis of no transition between latent states, which may be tested by comparing an LC model with an LM model based on a transition matrix with at least one off-diagonal element allowed to be positive. Note that hypotheses of this type cannot be tested within more conventional approaches, in which the transition probabilities are modelled through a link function based, for instance, on multinomial logits (Vermunt et al., 1999). However, the parameters may be on the boundary of the parameter space under these hypotheses and therefore we are dealing with an inferential problem under non-standard conditions (Self and Liang, 1987), which typically occurs in constrained statistical inference (Silvapulle and Sen, 2004). By applying some results that are known in this literature, we show that the LR statistic in question has an asymptotic χ¯2-distribution under the null hypothesis. The finite sample accuracy of the inference based on this asymptotic distribution is investigated through a series of simulations.

In short, the main contribution of the present paper is the formulation of a general framework for likelihood inference under linear hypotheses on the transition probabilities of a class of LM models. Within this framework, we also allow for hypotheses that are expressed by constraining one or more probabilities to be equal to 0. Hypotheses of this type are of interest in many situations. To our knowledge, a framework like this has not previously been considered in the literature concerning LM models. It has not been explicitly considered either within other inferential approaches or within the hidden Markov literature, which is strongly related to the literature concerning LM models (for a review see Archer and Titterington (2002) and Scott (2002)). In contrast, maximum likelihood estimation of LM models has been considered for a long time (Poulsen, 1982) and, under standard conditions, the use of the LR statistic for testing hypotheses on the parameters of an LM model has already been suggested (see, for instance, Visser et al. (2002)).

The paper is organized as follows. The class of LM models is illustrated in Section 2 and maximum likelihood estimation of these models in Section 3. In the latter section we also deal with the Fisher information matrix and the problem of classifying subjects on the basis of the latent states. The asymptotic distribution of the LR statistic under linear hypotheses on the transition probabilities is dealt with in Section 4. Finally, two applications involving real data sets are illustrated in Section 5 and the main conclusions are drawn in Section 6.

2. A class of homogeneous latent Markov models

Let Y={Yt,t=1,…,s} be a sequence of s discrete random variables with support {1,…,d}. The basic assumption of the LM model is that these variables are conditionally independent given a latent process X={Xt,t=1,…,s} which follows a first-order Markov chain with state space {1,…,c}. This assumption obviously makes sense only if the elements of Y correspond to measurements that are repeated at different occasions on the same subject. This is the case not only for longitudinal data but also for data that are derived from the administration of a set of test items to a group of subjects, which frequently arise in psychological and educational measurement (see also example 1 in Section 2.1). In the latter case, an LM model may be validly applied only if the items are administered to all the subjects in the same order. Note also that assuming that the latent process follows a first-order Markov chain is equivalent to assuming that any latent variable Xt is conditionally independent of X1,…,Xt−2 given Xt−1. This assumption is seldom considered restrictive and, also because of its easy interpretation, is preferred to more complex assumptions on the dependence structure of the latent variables. In this paper, we also assume that the latent process is time homogeneous, so that the transition probabilities πx|w=p(Xt=x|Xt−1=w), w,x=1,…,c, do not depend on t, whereas the initial probabilities λx=p(X1=x), x=1,…,c, are completely unconstrained.

The assumptions above imply that the distribution of X may be expressed as

1

whereas the conditional distribution of Y given X may be expressed as

2

where φt,y|x=p(Yt=y|Xt=x). Consequently, for the manifest distribution of Y we have the expression

Using matrix notation, this probability may also be expressed as

where 1j is a column vector of j 1s and, for t=1,…,s, yt={yj,j=1,…,t} and q(yt) is a column vector with elements p(Xt=x,Y1=y1,…,Yt=yt), x=1,…,c. This vector may be computed through the recursion

3

with λ={λx,x=1,…,c} denoting the initial probability vector, Π={πx|w,w,x=1,…,c} denoting the transition probability matrix and φt,y={φt,y|x,x=1,…,c} denoting the vector of the conditional probabilities of the response variables given the latent process. When s is very large, the elements of q(yt) could become so small as to vanish during recursion (3). To avoid this, we can multiply the quantity that is obtained at any step of this recursion by a suitable constant a. Thus, the vector that is computed at the tth step of the recursion, which is denoted by q*(yt), will be equal to at  q(yt), whereas p*(y)=q*(ys)1s will be equal to as  p(y). As will be clear in what follows, this dummy renormalization does not affect the estimation process.

In the basic LM model illustrated above, the conditional probabilities of the response variables given the latent process and the transition probabilities between the latent states are completely unconstrained. In many situations, however, it makes sense to parameterize these probabilities to give a more parsimonious and easily interpretable LM model.

2.1. Modelling conditional probabilities of the response variables

Let φ be a column vector with elements φt,y|x, t=1,…,s, x=1,…,c, y=1,…,d, arranged by letting the index y run faster than x and x run faster than t, and let η=η(φ) be a link function that maps φ onto ℝsc(d−1). We assume that

4

where Z is a full rank design matrix and γ is a vector of parameters. Obviously, regardless of the specific choice of the link function, letting Z=Isc(d−1), with Ij denoting an identity matrix of dimension j, is equivalent to assuming that the conditional distribution of any Yt given Xt is unconstrained. More interesting cases are illustrated below.

2.1.1. Example 1

In the case of binary variables, by parameterizing the conditional distribution of Yt given Xt=x through the logit ηt|x= log (φt,2|x/φt,1|x) and assuming that

5

we can formulate an LM version of the Rasch (1961) model, the best-known item response theory model (see, among others, Hambleton and Swaminathan (1985) and De Boeck and Wilson (2004)). This model finds its natural application in educational assessment, when a group of examinees must be assessed on the basis of the responses that they provide to a series of test items. In this setting, the parameter ψx may be interpreted as the ability of the subjects in the xth latent class and δt as the difficulty of the tth item. In a similar way, we can formulate a multidimensional version of the Rasch model based on one of the parameterizations that were considered by Kelderman and Rijkes (1994). Note that, by assuming that the latent process follows a Markov chain, we allow a subject to move from one LC to another; this is not allowed within the LC formulation of the Rasch model that was studied by Lindsay et al. (1991). Thus, we take into account the possibility of an implicit learning phenomenon or of an item which may provide clues for responding to other items. In these cases, the assumption of local independence, which is crucial in item response theory, is violated; for a discussion, see Hambleton and Swaminathan (1985), Section 2.3. By means of the approach proposed here, we can obviously test for the presence of phenomena of this type.

2.1.2. Example 2

When the response variables have more than two levels, we can parameterize the conditional distribution of Yt given Xt=x through logits with the first category as base-line, i.e. ηt,y|x= log (φt,y+1|x/φt,1|x), and assume that

However, when the response variables have an ordinal nature, global logits are more suitable; these logits are based on the ratio between the survival function and the distribution function for any possible cut point y, i.e.

see also Samejima (1996).

In the present approach, we also allow for inequality constraints of the type Kγ0k on the parameters γ, where K is a full rank matrix with k rows and 0j denotes a vector of j 0s. The main use of these constraints is for making the latent states uniquely identifiable.

2.1.3. Example 3

If the same logit link function as that which has been considered in example 1 is used to parameterize the distribution of a set of binary response variables, we can require that

6

so that the latent states are ordered from that corresponding to the smallest probability of success to that corresponding to the largest probability of success. We may also combine this requirement with a Rasch parameterization of type (5). In this case, we need a reduced set of inequality constraints, ψ1ψc, to ensure that condition (6) holds.

2.2. Modelling transition probabilities

Let π=vec(Π), where vec(·) is the row vector operator, and ρ denote the subvector of π in which the diagonal elements of Π are omitted because they are redundant and note that π=a+Aρ, with a and A defined in  Appendix A.1. We assume that

7

where W is a full rank matrix of size c(c−1)×b with at most one positive element in any row and all the other elements equal to 0. Note that we are formulating a linear model directly on the transition probabilities and not on a vector of parameters that are related to such probabilities through a link function that maps them onto ℝc(c−1). However, as already mentioned in Section 1, our approach allows the formulation of the hypothesis that one or more elements of Π are equal to 0. But, to ensure that all the transition probabilities are non-negative, we must impose suitable restrictions on the parameter vector β. Because of these restrictions, we are not in a standard inferential problem in order to derive the asymptotic distribution of the LR statistic for testing hypotheses on β; this will be discussed in detail in Section 4. In particular, it may be easily verified that the constraint that all the diagonal elements of Π are non-negative is equivalent to the constraint TWβ1c, with T=Ic1c1. Moreover, since we require W to have at most one positive element in any row, the constraint that all the off-diagonal elements of Π are non-negative may simply be expressed as β0b. As shown by the examples below, this requirement of the design matrix W does not limit the applicability of the present class of models. Note that the model in which the transition probabilities are unconstrained may be formulated by letting W=Ic(c−1).

2.2.1. Example 4

A significant reduction in the number of parameters of an LM model may be achieved by constraining all the off-diagonal elements of the transition matrix Π to be equal to each other; with c=3, for instance, we have

8

This constraint may be formulated by letting W=1c(c−1). A less stringent constraint is that Π is symmetric; this is equivalent to assuming that the probability of transition from latent state w to latent state x is the same as that of the reverse transition:

This hypothesis may be formulated by letting W=U+L, where U and L consist respectively of the subset of columns of Ic(c−1) corresponding to the elements of ρ which are upper triangular and lower triangular in Π (excluding those in the diagonal).

2.2.2. Example 5

When the latent states are ordered in a meaningful way by assuming, for instance, that condition (6) holds, it may be interesting to formulate the hypothesis that a subject in latent state w may move only to latent state x=w+1,…,c or to x=w+1. With c=3, for instance, we have respectively

The first hypothesis may be formulated by letting W=U and the second by letting W equal a suitable subset of columns of U. Another formulation which is appropriate with ordered states is based on the assumption that, for any w and x, the probability πx|w is proportional to 2−|xw| and therefore decreases when the distance from w and x increases, i.e.

In this case W=2−1i1+…+2−(c−1)ic−1, where ij is obtained by summing the columns of Ic(c−1) corresponding to the elements πx|w of ρ with |xw|=j.

2.3. Parameter space

By combining different choices of the parameterization of the conditional distribution of the response variables (given the latent process) and of the transition probabilities, we give rise to a class of LM models which obviously includes the basic LM model that has been illustrated at the beginning of this section. We recall that these parameterizations are formulated as η=Zγ, Kγ0k, for a suitable chosen link function η=η(φ), and ρ=Wβ. Therefore, the vector of the non-redundant parameters of any of these models may be expressed as θ=(α,β,γ), where α={ log (λx+1/λ1),x=1,…,c−1} is a vector of logits for the initial probabilities of the latent states.

An important point is the shape of the parameter space. We have that

where 𝒜=ℝc−1, ℬ={β:β0b,TWβ1c} and 𝒞={γ:Kγ0k}. This notation will be useful in Section 4 to illustrate the derivation of the asymptotic distribution of the LR statistic for testing linear hypotheses on β.

3. Maximum likelihood estimation

Let ny be the frequency of the response configuration y in a sample of n subjects, let 𝒴 be the set of the distinct response configurations that have been observed at least once and let n={ny,y ∈ 𝒴} be the vector of the corresponding frequencies. By assuming that these subjects are independent of each other, the log-likelihood of any model in the class of LM models that has been outlined in the previous section may be expressed as

where p(y) is computed as a function of θ by using recursion (3). Since the cardinality of 𝒴 is always less than or equal to n, l(θ) may even be computed for large values of s, provided that n is reasonable. Note also that, if a dummy renormalization is used in recursion (3), the log-likelihood of the model may be computed on the basis of the renormalized probabilities p*(y) as

where the last term does not depend on θ and therefore may be omitted.

In what follows, we show how, in order to estimate θ, we can maximize l(θ) by means of an EM algorithm (Dempster et al., 1977). We also deal with the Fisher information matrix and the estimation of class membership probabilities which may be used to classify subjects on the basis of the response configurations that they provide.

3.1. The EM algorithm

Suppose that we knew the frequencies mx,y of the contingency table in which the subjects are cross-classified according to the latent configuration x and the response configuration y. We could then compute the complete-data log-likelihood

9

where 𝒳 is the support of X and p(y|x) and p(x) are defined respectively in (equations (2) and (1). The following decomposition of log-likelihood 9) holds:

10

where

with fx denoting the number of subjects who at the first occasion are in latent state x, gw,x denoting the number of transitions from latent state w to latent state x and ht,x,y denoting the number of subjects who at occasion t are in latent state x and provide response y.

Remark 1. Since we assume that all the probabilities λx and φt,y|x are strictly positive, we may always compute l1(α) and l3(γ). In contrast, since we may have one or more transition probabilities πx|w equal to 0, computing l2(β) requires special care. Therefore, we use the convention that, if πx|w=0 by assumption, gw,x  pt log (πx|w)≡0 for all possible values of gw,x, whereas if πx|w=0, but not by assumption, gw,x  pt log (πx|w) is set equal to 0 when gw,x=0 and to −∞ when gw,x>0. Thus, a necessary condition for a vector β˜ to maximize l2(β) is that the corresponding probabilities π˜x|w that are not constrained to 0 by assumption are greater than 0 for any w and x such that gw,x>0.

Because of decomposition (10), the vector θ˜=(α˜,β˜,γ˜) which maximizes l(θ) may be found by maximizing its components separately as follows.

  • (a)

    Maximization ofl1(α): simply set α˜ equal to the vector { log (fx+1/f1),x=1,…,c−1}.

  • (b)
    Maximization ofl2(β): let π˜ be the subvector of π whose elements are not constrained to 0 under assumption (7) and let g¯ be the corresponding vector with elements gw,x and note that π¯=Bπ, where B is obtained by selecting a suitable subset of rows from the matrix Ic2. Consequently, because of remark 1, we can express the log-likelihood at issue as
    Provided that all the elements of g¯ are strictly positive, we also have that l2(β) tends to −∞ when one or more elements of π¯ approach 0; therefore, the vector β˜ that maximizes this function must be an interior point of the space ℬ that has been defined in Section 2.3. Moreover, for any interior point β of ℬ, the second derivative of l2(β), equal to
    is negative definite and therefore this function has a unique maximum. This maximum may be found by means of a Fisher scoring algorithm that at any step updates the estimate of β as
    where β(0) denotes the estimate of β at the end of the previous step and

    with g˙.={g¯w,,w=1,,c}, g¯w,=xg¯w,x are respectively the score vector and the Fisher information matrix for l2(β). When one or more elements of g¯ are equal to 0, the maximum of l2(β) might be attained at a β˜ having one or more elements equal to 0 and therefore the algorithm above must be appropriately modified. A simpler solution consists in adding a small positive number, e.g. 10−8, to all the elements of g¯, so that the same algorithm may be used without further adjustments. We observed that the effect on the solution is usually negligible and no extra time is required for the implementation.

  • (c)
    Maximization ofl3(γ): to take into account the presence of inequality constraints of the type Kγ0k on the parameter vector γ, this maximization is performed by means of an algorithm that is based on the solution of a series of constrained least squares problems. At any step of this algorithm, the estimate of γ is updated by solving
    where
    is the working dependent variable, γ(0) is the estimate of γ at the end of the previous step and s3(γ) and F3(γ) are respectively the score vector and the Fisher information matrix for l3(β). These are given by

    where the vector h has elements ht,x,y arranged as in φ, D(η) is the derivative matrix of φ with respect to η and h˙ is a column vector with elements ht,x,·y  ht,x,y, t=1,…,s, x=1,…,c, arranged by letting the index x run faster than t. For a detailed description of this constrained maximization algorithm in similar contexts, see Dardanoni and Forcina (1998) and Bartolucci and Forcina (2000). When there are no inequality constraints on γ, the algorithm reduces to the usual Fisher scoring algorithm. Moreover, when we have no restrictions on the conditional distribution of any Yt given Xt, i.e. Z=Isc(d−1), we have an explicit solution for γ˜ or, equivalently, for η˜.

Since the frequencies fx, gw,x and ht,x,y in equation (10) are unknown, the EM algorithm maximizes l(θ) as above (the M-step), once these frequencies have been substituted with the corresponding conditional expected values given the observed data and the current value of the parameters (the E-step). This process is iterated until convergence in l(θ). At the E-step, in particular, the conditional expected values at issue are obtained as

where I(·) is the indicator function, rt(x|y)=p(Xt=x|Y=y) and rt(w,x|y)=p(Xt−1=w,Xt=x|Y=y). Let Rt(y), t=2,…,s, be the c×c matrix with elements rt(w,x|y) arranged by letting w run by row and x by column. This matrix may be easily computed through the following backward recursion which is well known in the hidden Markov chain literature (see Baum et al. (1970), Levinson et al. (1983) and MacDonald and Zucchini (1997), Section 2.2):

11

where y>t={yj,j=t+1,…,s} and

12

When s is very large, this recursion may require a dummy renormalization. This may be performed again by multiplying the quantity that is computed at any step of (expression (12) by a suitable constant a. The resulting vector that is computed at the tth step, r*(y>t), may be used instead of r(y>t) in equation 11), provided that q(yt1) and p(y) are replaced by the corresponding renormalized quantities q*(yt1) and p*(y).

Finally, note that any M-step consists in solving a series of simple maximization problems which have a unique solution. Therefore, along the same lines as Shi et al. (2005), it is possible to prove that the observed log-likelihood l(θ) has increased after any EM step and that the above algorithm converges to a local maximum of this function. However, this local maximum cannot be guaranteed to correspond to the global maximum since, as for any other latent variable model, the likelihood may be multimodal. As usual, this problem may be addressed by trying different initializations of the algorithm and then choosing θ^, the parameter value which at convergence gives the highest value of l(θ), as the maximum likelihood estimate of θ. However, a small simulation study, the results of which are not reported here, showed us that the chance of there being more than one local maximum is usually low when the number of observations is large in comparison with the number of parameters and the model assumed holds.

3.1.1. Initialization of the algorithm

We observed that an effective strategy to initialize the above EM algorithm is by performing the first E-step with probabilities φt,y|x, λx and πx|w chosen as follows.

  • (a)

    φt,y|x=ex,y  nt,yj  ex,j  nt,j, where nt,y is the observed frequency of Yt=y and the coefficient ex,y increases with y when x>(c+1)/2, decreases with y when x<(c+1)/2 and is constant in y when x=(c+1)/2;

  • (b)

    λx=1/c for any x;

  • (c)

    πx|w=1−τ when w=x and πx|w=τ/(c−1) otherwise, where τ is a suitable constant between 0 and 1.

If it is necessary to try different initializations of the EM algorithm, these probabilities may be generated at random by drawing any block of them summing to 1 from a Dirichlet distribution with parameters that are defined according to the rules that are given above. For instance, the starting value of λ could be drawn from a Dirichlet distribution with parameter vector proportional to 1c, the first row of Π could be drawn from a Dirichlet distribution with parameter vector proportional to ((1−τ),τ/(c−1),…,τ/(c−1)) and so on.

3.2. Fisher information matrix, standard errors and local identifiability

The Fisher information matrix of the observed data may be expressed as

13

where p is a vector with elements p(y) for any y arranged in lexicographical order and Q(θ) is the derivative matrix of this vector with respect to θ (see  Appendix A.2). In particular, we are usually interested in F(θ^), i.e. the information matrix computed at the maximum likelihood estimate of θ, which may be used for computing standard errors and checking identifiability; this matrix is also used within the testing procedure that is illustrated in Section 4. When s is large, F(θ^) cannot be computed as in equation (13). In this case, we can validly use the empirical variance–covariance matrix of the score, instead of F(θ^); see also McLachlan and Peel (2000), Section 2.15.

The standard error for a parameter estimate may be computed as the square root of the corresponding diagonal element of F(θ^)1. However, if the true value of θ is close to the boundary of the parameter space Θ, in this way we can considerably overestimate the true standard error. But knowledge of standard errors for the parameter estimates is relatively less important in this situation since the distribution of the estimators may depart significantly from normality, and therefore, even though these standard errors are known, they cannot be validly used to construct confidence intervals and testing hypotheses in the usual way; for a discussion on this issue see Silvapulle and Sen (2004), Section 4.9.

The information matrix can also be used for checking local identifiability of the model at θ^; this is a weaker condition than global identifiability, on which literature on latent variable models has focused for a long time (see, for example, McHugh (1956) and Goodman (1974)). In particular, a commonly accepted procedure for checking that the model is locally identifiable at θ^ consists in checking that F(θ^) is of full rank. This is essentially equivalent to checking that the Jacobian Q(θ^) is of full rank; see also Rothenberg (1971).

3.3. Estimation of class membership probabilities and classification

Once a certain LM model has been fitted, the element λ^t,x, x=1,…,c, of the vector λ^t=(^t1)λ^ is an estimate of the proportion of subjects who are in latent class x at the tth occasion. When t=1, this vector equals λ^; the same occurs, regardless of t, when the transition matrix is diagonal. Moreover, when we are dealing with an LM model that is based on parameterization (5), we can estimate the average ability of the population at occasion t as

From the EM algorithm we also obtain, for any x and t and any response configuration y that has been observed at least once, the estimate r^t(x|y) of the conditional probability of being in latent class x at occasion t given y. By using these posterior estimates we can assign a subject to a given latent class on the basis of the response configuration that he or she provided or, when the model is based on a Rasch parameterization, estimate his or her ability as

4. Testing linear hypotheses on the transition probabilities

In this section, we deal with the asymptotic distribution of the LR statistic for testing a hypothesis of the type

with M denoting a full rank matrix of dimension m×b, on the latent process parameters. Note that we are not formulating a hypothesis directly on the transition probabilities, but on a reduced vector of parameters on which they depend through a linear model of type (7). This gives more flexibility to the approach. We can test, for instance, the hypothesis that the transition matrix Π is diagonal, and therefore an LC formulation holds, against the alternative that Π is symmetric or with all the off-diagonal elements equal to each other (see example 4). We also recall that the parameter space of β, ℬ, is specified by means of inequality constraints (see Section 2.3) which, under hypothesis H0, may imply additional equality constraints on β with respect to those which are formulated as Mβ=0m. This happens, for instance, when M=1b. In this case, only β=0b jointly satisfies β0b and Mβ=0m, even if M has only one row. To avoid these situations, and without loss of generality, we require the dimension of the space ℬ0=ℬ∩{β:Mβ=0m} to be equal to bm. It is also convenient to reformulate more precisely the testing problem that we are dealing with as H0:θΘ0 against H1:θΘΘ0, where Θ is the parameter space of the model assumed and Θ0=𝒜×ℬ0×𝒞.

The LR statistic for testing H0 against H1 may be expressed as

14

where θ^0 is the maximum likelihood estimate of θ under the constraint Mβ=0m and θ^ is the unconstrained estimate. Note that θ^0 may be computed through the same EM algorithm as that which has been illustrated in the previous section for computing θ^; the only difference with respect to the original formulation is that we must use WM instead of W as a design matrix for the linear model on the transition probabilities, where M is a b×(bm) full rank matrix spanning the null space of M. Now let θ0=(α0,β0,γ0) be the true value of θ under H0 and assume that TWβ0<1c and Kγ0>0k. If we also have that β0>0b, θ0 is an interior point of Θ and therefore it follows from standard inferential results that D has an asymptotic χm2-distribution. This occurs, for instance, when we are testing an LM model that is based on a transition matrix of type (8) against the same LM model in which this matrix is unconstrained; in this case, D has an asymptotic χc(c1)12-distribution under the null hypothesis. In contrast, if some elements of β0 are equal to 0, θ0 is on the boundary of Θ and therefore we are in a non-standard case. This occurs, for instance, when we are testing an LC model against an LM model.

The asymptotic distribution of an LR test statistic when the parameters are on the boundary of the parameter space has been studied by many researchers, such as Chernoff (1954), Shapiro (1985) and Self and Liang (1987); for a review see also Silvapulle and Sen (2004). In our case let H(θ)=F(θ)/n be the average information matrix of the assumed LM model and G be the block of g rows of the matrix Ib such that the vector Gβ contains the elements of β which are constrained to 0 under hypothesis H0; note that GM=0g. Also let J be the block of the remaining bg rows of Ib and assume that the elements of Jβ0 are strictly positive under H0. Thus, among all the inequality constraints that are involved in the specification of the parameter space, only those which are formulated as Gβ0g are active, in the sense that only these constraints have some chance of being violated when n grows indefinitely. Consequently, following Silvapulle and Sen (2004), proposition 4.8.2 (see also Self and Liang (1987), theorem 3), we find that D is asymptotically distributed as

where ℒ={θ:Mβ=0m} is a linear space nested into the polyhedral convex cone 𝒫={θ:Gβ0g} and VN{0,H(θ0)−1}. It turns out that the distribution of Q, and therefore the asymptotic distribution of D, is of χ¯2 type, which is well known in constrained statistical inference (for a review see Shapiro (1988) and Silvapulle and Sen (2004), chapter 3). This result is stated more precisely in the following theorem.

Theorem 1. Provided that H(θ0) is of full rank and that hypothesis H0 holds with Jβ0>0bg, TWβ0<1c and Kγ0>0k, the asymptotic distribution of the LR statistic (14) is the same as that of Z1+Z2, where Z1 and Z2 are two independent random variables with

where Σ0=GH(θ0)−1G and 𝒪g denotes the non-negative orthant of dimension g.

The distribution function of Z1+Z2 may be expressed as

15

where wj(g,Σ0), j=0,…,g, are suitable weights which may be computed through a simple Monte Carlo simulation procedure, such as that outlined in Silvapulle and Sen (2004), Section 3.5. By exploiting this expression, we can easily compute an asymptotic p-value for D, once we have replaced θ0 in Σ0 by its consistent estimate θ^0. Note also that, even though we need to compute the weights in equation (15) by simulation, the resulting procedure for computing a p-value for D is much faster than a parametric bootstrap which requires estimating the same models several times and is therefore impractical in most real applications.

As a particular case of theorem 1, we have that in which there are no elements of β constrained to 0 under hypothesis H0 (g=0) and so the LR statistic D has a χm2 asymptotic distribution. Two other interesting cases are dealt with in the following corollary.

Corollary 1. Under the same regularity conditions as for theorem 1, the LR statistic D for testing an LC model based on parameterization (4) against the corresponding LM version with unconstrained transition matrix Π has asymptotic distribution

16

with Σ0 equal to the block of H(θ0)−1 corresponding to the parameters in β. When instead the transition matrix of the LM model is parameterized through a single parameter (b=1), D has asymptotic distribution

17

The latter case, in which the weights of the χ¯2-distribution are explicitly given, occurs, for instance, when the larger model that is involved in the comparison is an LM model based on a transition matrix of type (8).

4.1. A simulation study

To assess the accuracy of the finite sample inference based on the asymptotic results that have been illustrated above, we carried out a simulation study of the following LR statistics:

  • (a)

    statistic D1 between an unconstrained LC model with c=2 classes and its LM version,

  • (b)

    statistic D2 between an LC Rasch model with c=3 classes and its LM version with transition matrix constrained as in equation (8) and

  • (c)
    statistic D3 between an LM model with c=3 states and transition matrix of the type
    and the same model with transition matrix of the type

    under the assumption that, conditionally on the latent process, the response variables have the same distribution.

In particular, for various values of s and n and in the case of binary response variables, we generated 2000 random samples from a set of suitably chosen LM models. For each of these samples, we computed the value of each of the LR statistics above, together with the corresponding p-value based on the asymptotic null distribution which is of type (16) for D1, type (17) for D2 and χ12+χ¯2(0,O2), with Σ0 suitably defined, for D3.

In the first series of these simulations, random samples were generated from the smaller model considered within any LR statistic (e.g. the LC model with c=2 classes for D1), with parameters chosen as follows:

  • (a)

    for statistic D1, ϕt,2|1=13,ϕt,2|2=23,t, and πx=12,x;

  • (b)

    for statistic D2, ϕt,2|1=14,ϕt,2|2=12,ϕt,2|3=34,t, and πx=13,x;

  • (c)

    for statistic D3, ϕt,2|1=14,ϕt,2|2=12,ϕt,2|3=34,t, πx=13,x, and β=0.1.

Thus, it was possible to compare the actual significance level of the test with the nominal level. For nominal levels below 0.25, this comparison is shown in Fig. 1 for s=6,12,24 and n=100,200,400 and in Fig. 2 for s=100 and n=250,500,1000. In the latter case we use β=0.02 for the model under which the null distribution of D3 has been simulated.

Comparison between actual and nominal significance levels for the tests based on the LR statistics D1, D2 and D3 with s=6, 12, 24 and n=100 (·······), n=200 (–––) and n=400 (——)
Fig. 1

Comparison between actual and nominal significance levels for the tests based on the LR statistics D1, D2 and D3 with s=6, 12, 24 and n=100 (·······), n=200 (–––) and n=400 (——)

Comparison between actual and nominal significance levels for the tests based on the LR statistics D1, D2 and D3 with s=100 and n=250 (·······), n=500 (–––) and n=1000 (——)
Fig. 2

Comparison between actual and nominal significance levels for the tests based on the LR statistics D1, D2 and D3 with s=100 and n=250 (·······), n=500 (–––) and n=1000 (——)

On the basis of the results from these simulations, we can conclude that, provided that s is moderate (Fig. 1), the asymptotic null distribution of D1, D2 and D3 is a reasonable approximation of the finite sample distribution, even when the sample size is small in comparison with the number of possible configurations of the response variables. With s large (Fig. 2), the quality of the approximation of the asymptotic null distribution is still good for statistic D3, but becomes worse for D1 and D2. This difference is because the models that were fitted for computing D1 and D2 have a number of parameters that increase with s, whereas this does not occur for D3. Note, however, that, even though the quality of the approximation of the asymptotic distribution is not as good for D1 and D2, the actual significance level is smaller than the nominal level and therefore the asymptotic distribution may safely be used in this case also.

To investigate the power of the LR tests in question, we also considered the following models:

  • (a)

    for statistic D1, an LM model with c=2 states and transition matrix of type (8) with parameters ϕt,2|1=13,ϕt,2|2=23,t, πx=12,x, and β between 0 and 0.05 (for s=6,12,24) and between 0 and 0.0005 (for s=100);

  • (b)

    for statistic D2, an LM model with c=3 states and transition matrix of type (8) with parameters ϕt,2|1=14,ϕt,2|2=12,ϕt,2|3=34,t, πx=13,x, and β between 0 and 0.05 (for s=6,12,24) and between 0 and 0.0005 (for s=100);

  • (c)
    for statistic D3, an LM model with c=3 states and transition matrix

    with parameters ϕt,2|1=14,ϕt,2|2=12,ϕt,2|3=34,t, πx=13,x, and β1=0.1 and β2 between 0 and 0.05 (for s=6,12,24) and β1=0.02 and β2 between 0 and 0.0005 (for s=100).

The estimated rate of rejection of any test at the 5% nominal level is shown in Fig. 3 for s=6,12,24 and in Fig. 4 for s=100. As we may expect, the power of any test increases with the number of response variables (s) and the sample size (n). With s=6 and n=100, for instance, the test based on the LR statistic D2 has a probability of about 14 of rejecting the null hypothesis when β=0.05 (Fig. 3). This probability becomes equal to about 34 when s=12 and is equal to 1 when s=24. Moreover, when s=100 (Fig. 4), this test has a very large probability of detecting also very small deviations from the null hypothesis, such as β=0.0001.

Rejection rates of the tests based on the LR statistics D1, D2 and D3 for s=6, 12, 24 and n=100 (·······), n=200 (–––) and n=400 (——)
Fig. 3

Rejection rates of the tests based on the LR statistics D1, D2 and D3 for s=6, 12, 24 and n=100 (·······), n=200 (–––) and n=400 (——)

Rejection rates of the tests based on the LR statistics D1, D2 and D3 for s=100 and n=250 (·······), n=500 (–––) and n=1000 (——)
Fig. 4

Rejection rates of the tests based on the LR statistics D1, D2 and D3 for s=100 and n=250 (·······), n=500 (–––) and n=1000 (——)

5. Empirical illustration

To illustrate the approach that is proposed in this paper, we describe the analysis of two data sets, the first of which concerns the responses of a group of students to an educational test, and the second the use of marijuana among young people.

5.1. Educational testing data set

The educational testing data set concerns a sample of n=1510 examinees who responded to a set of s=12 items on mathematics, extrapolated from a larger data set that was collected in 1996 by the Educational Testing Service within a US project called the National Assessment of Educational Progress; for a more detailed description, see Bartolucci and Forcina (2005). The interest here is in testing for the presence of phenomena, such as those described in example 1 of Section 2.1, which may cause a violation of the basic assumptions of item response theory models. Since the items were administered to all the examinees in the same order, it is appropriate to use an LM model for studying phenomena of this type. Note also that, since the contingency table on which the data may be displayed is very sparse, a direct comparison of an LC model that is based on a parameterization of type (4) with the saturated model is not as reliable as a comparison with an LM model that is based on the same parameterization. Thus, the latter may be seen as a valid alternative to the saturated model in assessing the goodness of fit of an LC model.

For this data set we chose c=3 latent states, corresponding to the number of classes for which the basic LC model has the smallest Bayes information criterion (Schwarz, 1978). With this number of states, the basic LM model has a deviance of 1899.16 with respect to the saturated model. This deviance may be considered adequate if compared with the number of degrees of freedom, equal to 4051. The estimated logits ηt|x under this model are shown in Table 1, and the estimated initial probabilities λx and transition probabilities πx|w are shown in Table 2.

Table 1

Estimated logits η^t|x under the basic LM model and the corresponding LC model

tη^t|x(LM)η^t|x(LC)
x=1x=2x=3x=1x=2x=3
1−0.7281.1332.250−0.7111.1142.241
2−1.0921.1802.794−1.0871.1682.783
3−0.9950.1491.817−1.0120.1531.801
40.0782.2973.5090.0852.2923.508
5−1.346−0.5080.594−1.344−0.5000.587
6−0.3780.9072.084−0.3610.9112.081
7−0.8160.1771.578−0.8070.1901.577
8−2.410−0.5711.772−2.303−0.5361.774
9−1.5330.4312.844−1.4150.4532.846
10−1.4400.3841.854−1.3060.4101.854
11−2.804−1.7840.027−2.743−1.6560.033
12−3.217−1.411−0.248−3.010−1.309−0.251
tη^t|x(LM)η^t|x(LC)
x=1x=2x=3x=1x=2x=3
1−0.7281.1332.250−0.7111.1142.241
2−1.0921.1802.794−1.0871.1682.783
3−0.9950.1491.817−1.0120.1531.801
40.0782.2973.5090.0852.2923.508
5−1.346−0.5080.594−1.344−0.5000.587
6−0.3780.9072.084−0.3610.9112.081
7−0.8160.1771.578−0.8070.1901.577
8−2.410−0.5711.772−2.303−0.5361.774
9−1.5330.4312.844−1.4150.4532.846
10−1.4400.3841.854−1.3060.4101.854
11−2.804−1.7840.027−2.743−1.6560.033
12−3.217−1.411−0.248−3.010−1.309−0.251
Table 1

Estimated logits η^t|x under the basic LM model and the corresponding LC model

tη^t|x(LM)η^t|x(LC)
x=1x=2x=3x=1x=2x=3
1−0.7281.1332.250−0.7111.1142.241
2−1.0921.1802.794−1.0871.1682.783
3−0.9950.1491.817−1.0120.1531.801
40.0782.2973.5090.0852.2923.508
5−1.346−0.5080.594−1.344−0.5000.587
6−0.3780.9072.084−0.3610.9112.081
7−0.8160.1771.578−0.8070.1901.577
8−2.410−0.5711.772−2.303−0.5361.774
9−1.5330.4312.844−1.4150.4532.846
10−1.4400.3841.854−1.3060.4101.854
11−2.804−1.7840.027−2.743−1.6560.033
12−3.217−1.411−0.248−3.010−1.309−0.251
tη^t|x(LM)η^t|x(LC)
x=1x=2x=3x=1x=2x=3
1−0.7281.1332.250−0.7111.1142.241
2−1.0921.1802.794−1.0871.1682.783
3−0.9950.1491.817−1.0120.1531.801
40.0782.2973.5090.0852.2923.508
5−1.346−0.5080.594−1.344−0.5000.587
6−0.3780.9072.084−0.3610.9112.081
7−0.8160.1771.578−0.8070.1901.577
8−2.410−0.5711.772−2.303−0.5361.774
9−1.5330.4312.844−1.4150.4532.846
10−1.4400.3841.854−1.3060.4101.854
11−2.804−1.7840.027−2.743−1.6560.033
12−3.217−1.411−0.248−3.010−1.309−0.251
Table 2

Estimated initial probabilities λ^x and transition probabilities π^x|w under the basic LM model and the corresponding LC model

xλ^x(LM)λ^x(LC)wπ^x|w(LM)π^x|w(LC)
x=1x=2x=3x=1x=2x=3
10.1780.16610.9820.0180.0001.0000.0000.000
20.4440.43520.0000.9870.0130.0001.0000.000
30.3780.40030.0000.0030.9970.0000.0001.000
xλ^x(LM)λ^x(LC)wπ^x|w(LM)π^x|w(LC)
x=1x=2x=3x=1x=2x=3
10.1780.16610.9820.0180.0001.0000.0000.000
20.4440.43520.0000.9870.0130.0001.0000.000
30.3780.40030.0000.0030.9970.0000.0001.000
Table 2

Estimated initial probabilities λ^x and transition probabilities π^x|w under the basic LM model and the corresponding LC model

xλ^x(LM)λ^x(LC)wπ^x|w(LM)π^x|w(LC)
x=1x=2x=3x=1x=2x=3
10.1780.16610.9820.0180.0001.0000.0000.000
20.4440.43520.0000.9870.0130.0001.0000.000
30.3780.40030.0000.0030.9970.0000.0001.000
xλ^x(LM)λ^x(LC)wπ^x|w(LM)π^x|w(LC)
x=1x=2x=3x=1x=2x=3
10.1780.16610.9820.0180.0001.0000.0000.000
20.4440.43520.0000.9870.0130.0001.0000.000
30.3780.40030.0000.0030.9970.0000.0001.000

According to the estimates in Table 1, the first LC may be interpreted as that of the less capable subjects, whereas the third class may be interpreted as that of the most capable subjects. Moreover, according to the results in Table 2, the second class contains the largest number of subjects and there is a high persistence, i.e. the probability of a transition from one latent state to another is always very low. The largest off-diagonal element of the estimated transition matrix, π^2|1, is in fact equal to 0.018.

To simplify this model, we first tried to impose an additive structure of type (5) on the logits ηt|x. In this way, however, the deviance increases by 115.02 with a p-value, computed on the basis of the χ222-distribution, that is very close to 0 and therefore this restriction must be rejected. We then tried to impose some restrictions on the latent structure. In particular, we first tried to impose the constraint that all the off-diagonal elements of Π are equal to each other as in equation (8). The estimate of the common transition probability is β^=0.001, whereas the LR statistic D with respect to the initial LM model is equal to 2.24 with a p-value, computed on the basis of the χ52-distribution, equal to 0.814. Consequently, this restriction cannot be rejected. Finally, we tried to impose the restriction of no transitions between latent states. For the LC model, resulting from this restriction, we have an LR statistic of 0.61 with respect to the LM model with constant transition probabilities. According to corollary 1, this statistic has null asymptotic distribution of type (17), so that the corresponding p-value is 0.216; consequently, the LC model cannot be rejected and therefore there is no evidence of violations of the local independence assumption. A similar conclusion may be reached by directly comparing this model with the basic LM model in which the transition matrix is unconstrained. In this case, D=2.86 with a p-value of 0.261 computed on the basis of distribution (16). The estimates of the parameters of the LC model are shown in Tables 1 and 2 together with those of the parameters of the basic LM model.

5.2. Marijuana consumption data set

The marijuana consumption data set has been taken from five annual waves (1976–1980) of the National Youth Survey (Elliot et al., 1989) and is based on n=237 respondents who were aged 13 years in 1976. The use of marijuana is measured through s=5 ordinal variables, one for each annual wave, with d=3 levels equal to 1 for never in the past year, 2 for no more than once a month in the past year and 3 for more than once a month in the past year. A variety of models which may be used for the analysis of this data set has been illustrated by Vermunt and Hagenaars (2004). In particular, they mention

  • (a)

    marginal models, which represent an extension of log-linear models,

  • (b)

    transitional models, which include Markov chain models, and

  • (c)

    random-effect models, which include the LC model.

However, as mentioned by Vermunt and Hagenaars (2004), an LM approach is desirable since it combines many characteristics of the approaches above and may be appreciated for its flexibility and easy interpretation. An analysis of this type is illustrated below.

Also for this data set, we used LM models with c=3 latent states. With this number of states, the basic LM model has a deviance of 85.80 with 204 degrees of freedom, with respect to the saturated model. The corresponding parameter estimates are shown in Tables 3 and 4. To simplify this model, we assumed the following parameterization for the conditional global logits of any response variable Yt given the latent state Xt=x:

18

where ψx may be interpreted as the tendency to use marijuana for a subject in LC x and the parameter δy is common to all the response variables Yt, because these variables are of the same nature, since they are repeated measurements of the same phenomenon under the same circumstances. The deviance of the resulting model with respect to the initial model is 23.58 with a p-value equal to 0.600, and therefore this restriction cannot be rejected. We then tried to add some restrictions on the latent structure. In particular, for the hypothesis π3|1=π1|3=0, which implies that the transition matrix is tridiagonal, the LR statistic with respect to the previous model is equal to 2.02 with a p-value of 0.172, whereas this statistic is equal to 4.67 with a p-value of 0.059 for the hypothesis that the transition matrix is upper triangular. Finally, for the null hypothesis of no transitions, the LR statistic is equal to 233.73 with a p-value less than 10−4. In the light of these results, we chose as final model that based on parameterization (18) and the hypothesis that the transition matrix is tridiagonal; the latter implies that a transition from latent state w to latent state x is possible only when x=w−1 or x=w+1. The parameter estimates of this model are shown in Tables 5 and 6.

Table 3

Estimated global logits η^t,y|x under the basic LM model

tη^t,y|x
x=1x=2x=3
y=1y=2y=1y=2y=1y=2
1−3.548−5.236−3.672−13.8161.128−1.187
2−4.066−13.816−0.393−13.8163.0820.445
3−13.122−13.8162.694−2.6502.7521.805
4−2.942−4.0921.025−13.81613.8163.090
5−2.018−3.5801.140−13.81613.8164.989
tη^t,y|x
x=1x=2x=3
y=1y=2y=1y=2y=1y=2
1−3.548−5.236−3.672−13.8161.128−1.187
2−4.066−13.816−0.393−13.8163.0820.445
3−13.122−13.8162.694−2.6502.7521.805
4−2.942−4.0921.025−13.81613.8163.090
5−2.018−3.5801.140−13.81613.8164.989
Table 3

Estimated global logits η^t,y|x under the basic LM model

tη^t,y|x
x=1x=2x=3
y=1y=2y=1y=2y=1y=2
1−3.548−5.236−3.672−13.8161.128−1.187
2−4.066−13.816−0.393−13.8163.0820.445
3−13.122−13.8162.694−2.6502.7521.805
4−2.942−4.0921.025−13.81613.8163.090
5−2.018−3.5801.140−13.81613.8164.989
tη^t,y|x
x=1x=2x=3
y=1y=2y=1y=2y=1y=2
1−3.548−5.236−3.672−13.8161.128−1.187
2−4.066−13.816−0.393−13.8163.0820.445
3−13.122−13.8162.694−2.6502.7521.805
4−2.942−4.0921.025−13.81613.8163.090
5−2.018−3.5801.140−13.81613.8164.989
Table 4

Estimated initial probabilities λ^x and transition probabilities π^x|w under the basic LM model

xλ^xwπ^x|w
x=1x=2x=3
10.79110.9110.0680.021
20.13720.0900.7460.163
30.07230.0000.1280.872
xλ^xwπ^x|w
x=1x=2x=3
10.79110.9110.0680.021
20.13720.0900.7460.163
30.07230.0000.1280.872
Table 4

Estimated initial probabilities λ^x and transition probabilities π^x|w under the basic LM model

xλ^xwπ^x|w
x=1x=2x=3
10.79110.9110.0680.021
20.13720.0900.7460.163
30.07230.0000.1280.872
xλ^xwπ^x|w
x=1x=2x=3
10.79110.9110.0680.021
20.13720.0900.7460.163
30.07230.0000.1280.872
Table 5

Estimates of the param-eters ψx and δy for the final LM model

xψ^xyδ^y
10.0001−5.011
25.7512−8.177
310.876
xψ^xyδ^y
10.0001−5.011
25.7512−8.177
310.876
Table 5

Estimates of the param-eters ψx and δy for the final LM model

xψ^xyδ^y
10.0001−5.011
25.7512−8.177
310.876
xψ^xyδ^y
10.0001−5.011
25.7512−8.177
310.876
Table 6

Estimated initial probabilities λ^x and transition probabilities π^x|w for the final LM model

xλ^xwπ^x|w
x=1x=2x=3
10.89610.8350.1650.000
20.08920.0700.6860.244
30.01530.0000.0820.918
xλ^xwπ^x|w
x=1x=2x=3
10.89610.8350.1650.000
20.08920.0700.6860.244
30.01530.0000.0820.918
Table 6

Estimated initial probabilities λ^x and transition probabilities π^x|w for the final LM model

xλ^xwπ^x|w
x=1x=2x=3
10.89610.8350.1650.000
20.08920.0700.6860.244
30.01530.0000.0820.918
xλ^xwπ^x|w
x=1x=2x=3
10.89610.8350.1650.000
20.08920.0700.6860.244
30.01530.0000.0820.918

In Fig. 5 we also show, for t=1,…,s, the estimates λ^t,x of the marginal probabilities of the latent states and the estimate ψ¯t of the average tendency to use marijuana, computed as described in Section 3.3. Finally, in Fig. 6 we show, again for t=1,…,s, the posterior estimates r^t(x|y) and ψ^t(y), given the response configuration y=(00122). Note that ψ^t(y) increases much faster in time than ψ^t and therefore we can conclude that, for a subject with this response configuration, the tendency to use marijuana increases much faster than the average. At occasion t, the distance from the average could be simply measured by ψ^t(y)ψ^t.

Estimated probabilities λ^t,x and estimated average tendency to use marijuana ψ¯^t
Fig. 5

Estimated probabilities λ^t,x and estimated average tendency to use marijuana ψ¯^t

Estimated probabilities r^t(x|y) and estimated tendency to use marijuana ψ^t(y) for a subject with response configuration y=(0 0 1 2 2)
Fig. 6

Estimated probabilities r^t(x|y) and estimated tendency to use marijuana ψ^t(y) for a subject with response configuration y=(0 0 1 2 2)

6. Conclusions

A general framework has been proposed for the formulation and testing of linear hypotheses on the transition probabilities of a class of LM models for discrete variables having a longitudinal structure. These hypotheses may be combined with restrictions on the conditional distribution of the response variables given the latent process, which may be formulated as in a generalized linear model. To test linear hypotheses on the transition probabilities, we recommend the use of an LR statistic and we prove that the null asymptotic distribution of this statistic is of χ¯2 type, i.e. a mixture of χ2-distributions with weights that may be simply computed by simulation. This is one of the main findings of the paper. Note that we have only dealt explicitly with the case of hypotheses on the latent process since hypotheses on the conditional distribution of the response variables given the latent process may be tested on the basis of standard inferential results when this distribution is parameterized as within our approach.

The approach proposed here covers the case of independent and identically distributed observations and therefore it is directly applicable in a contingency table setting. However, it may be easily extended to the case in which individual covariates are present. For instance, we could parameterize the conditional distribution of any response variable given the corresponding latent variable as in a generalized linear model with coefficients depending on the latent variable. In practice, this results in a model with occasion-specific random effects which follow a first-order Markov chain. The results concerning the distribution of the LR statistic for testing hypotheses on the transition matrix of the latent process hold for these models as well, but with minor adjustments.

Another possible extension of the approach is to the case of linear inequality constraints on the transition probabilities. Estimating an LM model under constraints of this type is not a difficult task and may be performed by means of a modified version of the EM algorithm that has been illustrated in Section 3. In particular, to take these constraints into account, the maximization of l2(β) during the M-step of this algorithm must be performed by solving a series of constrained least square problems similar to those which are solved for the maximization of l3(γ). However, the asymptotic distribution of the LR statistic for testing hypotheses of this type is not in general of χ¯2 type and may be difficult to use in practice for computing p-values. For this reason, and since it does not seem that hypotheses of particular interest may be expressed through inequality constraints on the transition probabilities, we explicitly considered only the case of equality constraints.

Acknowledgements

I thank the Joint Editor, an Associate Editor and three referees for their stimulating comments.

References

Archer
,
G. E. B.
and
Titterington
,
D. M.
(
2002
)
Parameter estimation for hidden Markov chains
.
J. Statist. Planng Inf.
,
108
,
365
390
.

Auranen
,
K.
,
Arjas
,
E.
,
Leino
,
T.
and
Takala
,
A. K.
(
2000
)
Transmission of pneumococcal carriage in families: a latent Markov process model for binary longitudinal data
.
J. Am. Statist. Ass.
,
95
,
1044
1053
.

Bartolucci
,
F.
and
Forcina
,
A.
(
2000
)
A likelihood ratio test for MTP2 within binary variables
.
Ann. Statist.
,
28
,
1206
1218
.

Bartolucci
,
F.
and
Forcina
,
A.
(
2005
)
Likelihood inference on the underlying structure of IRT models
.
Psychometrika
,
70
,
31
43
.

Baum
,
L. E.
,
Petrie
,
T.
,
Soules
,
G.
and
Weiss
,
N.
(
1970
)
A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains
.
Ann. Math. Statist.
,
41
,
164
171
.

Bijleveld
,
C. C. J. H.
and
Mooijaart
,
A.
(
2003
)
Latent Markov modelling of recidivism data
.
Statist. Neerland.
,
57
,
305
320
.

Chernoff
,
H.
(
1954
)
On the distribution of the likelihood ratio
.
Ann. Math. Statist.
,
25
,
573
578
.

Cook
,
R. J.
,
Ng
,
E. T. M.
and
Meade
,
M. O.
(
2000
)
Estimation of operating characteristics for dependent diagnostic tests based on latent Markov models
.
Biometrics
,
56
,
1109
1117
.

Dardanoni
,
V.
and
Forcina
,
A.
(
1998
)
A unified approach to likelihood inference on stochastic orderings in a nonparametric context
.
J. Am. Statist. Ass.
,
93
,
1112
1123
.

De Boeck
,
P.
and
Wilson
,
M.
(
2004
)
Explanatory Item Response Models: a Generalized Linear and Nonlinear Approach
.
New York
:
Springer
.

Dempster
,
A. P.
,
Laird
,
N. M.
and
Rubin
,
D. B.
(
1977
)
Maximum likelihood from incomplete data via the EM algorithm (with discussion)
.
J. R. Statist. Soc. B
,
39
,
1
38
.

Elliot
,
D. S.
,
Huizinga
,
D.
and
Menard
,
S.
(
1989
)
Multiple Problem Youth: Delinquency, Substance Use, and Mental Health Problems
.
New York
:
Springer
.

Goodman
,
L. A.
(
1974
)
Exploratory latent structure analysis using both identifiable and unidentifiable models
.
Biometrika
,
61
,
215
231
.

Hambleton
,
R. K.
and
Swaminathan
,
H.
(
1985
)
Item Response Theory: Principles and Applications
.
Boston
:
Kluwer Nijhoff
.

Humphreys
,
K.
(
1998
)
The latent Markov chain with multivariate random effects—an evaluation of instruments measuring labor market status in the British Household Panel Study
.
Sociol. Meth. Res.
,
26
,
269
299
.

Kelderman
,
H.
and
Rijkes
,
C. P. M.
(
1994
)
Loglinear multidimensional IRT models for polytomously scored items
.
Psychometrika
,
59
,
147
177
.

Langeheine
,
R.
,
Stern
,
E.
and
Van de Pol
,
F.
(
1994
)
State mastery learning: dynamic models for longitudinal data
.
Appl. Psychol. Measmnt
,
18
,
277
291
.

Lazarsfeld
,
P. F.
and
Henry
,
N. W.
(
1968
)
Latent Structure Analysis
.
Boston
:
Houghton Mifflin
.

Levinson
,
S. E.
,
Rabiner
,
L. R.
and
Sondhi
,
M. M.
(
1983
)
An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition
.
Bell Syst. Tech. J.
,
62
,
1035
1074
.

Lindsay
,
B.
,
Clogg
,
C. C.
and
Grego
,
J.
(
1991
)
Semiparametric estimation in the Rasch model and related exponential response models, including a simple latent class model for item analysis
.
J. Am. Statist. Ass.
,
86
,
96
107
.

MacDonald
,
I. L.
and
Zucchini
,
W.
(
1997
)
Hidden Markov and Other Models for Discrete-valued Time Series
.
London
:
Chapman and Hall
.

Mannan
,
H. R.
and
Koval
,
J. J.
(
2003
)
Latent mixed Markov modelling of smoking transitions using Monte Carlo bootstrapping
.
Statist. Meth. Med. Res.
,
12
,
125
146
.

McHugh
,
R. B.
(
1956
)
Efficient estimation and local identification in latent class analysis
.
Psychometrika
,
21
,
331
347
.

McLachlan
,
G.
and
Peel
,
D.
(
2000
)
Finite Mixture Models
.
New York
:
Wiley
.

Poulsen
,
C. S.
(
1982
)
Latent Structure Analysis with Choice Modeling Applications
.
Aarhus
:
Aarhus School of Business Administration and Economics
.

Poulsen
,
C. S.
(
1990
)
Mixed Markov and latent Markov modelling applied to brand choice behaviour
.
Int. J. Res. Marktng
,
7
,
5
19
.

Rasch
,
G.
(
1961
) On general laws and the meaning of measurement in psychology. In
Proc. 4th Berkeley Symp. Mathematical Statistics and Probability
, vol.
4
, pp.
321
333
.
Berkeley
:
University of California Press
.

Rothenberg
,
T. J.
(
1971
)
Identification in parametric models
.
Econometrica
,
39
,
577
591
.

Samejima
,
F.
(
1996
)
Evaluation of mathematical models for ordered polytomous responses
.
Behaviormetrika
,
23
,
17
35
.

Schwarz
,
G.
(
1978
)
Estimating the dimension of a model
.
Ann. Statist.
,
6
,
461
464
.

Scott
,
S. L.
(
2002
)
Bayesian methods for hidden Markov models: recursive computing in the 21st century
.
J. Am. Statist. Ass.
,
97
,
337
351
.

Self
,
S. G.
and
Liang
,
K.-Y.
(
1987
)
Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions
.
J. Am. Statist. Ass.
,
82
,
605
610
.

Shapiro
,
A.
(
1985
)
Asymptotic distribution of test statistics in the analysis of moment structures under inequality constraints
.
Biometrika
,
72
,
133
144
.

Shapiro
,
A.
(
1988
)
Towards a unified theory of inequality constrained testing in multivariate analysis
.
Int. Statist. Rev.
,
56
,
49
62
.

Shi
,
N.-Z.
,
Zheng
,
S.-R.
and
Guo
,
J.
(
2005
)
The restricted EM algorithm under inequality restrictions on the parameters
.
J. Multiv. Anal.
,
92
,
53
76
.

Silvapulle
,
M. J.
and
Sen
,
P. K.
(
2004
)
Constrained Statistical Inference: Inequality, Order, and Shape Restrictions
.
New York
:
Wiley
.

Van de Pol
,
F.
and
Langeheine
,
R.
(
1990
) Mixed Markov latent class models. In
Sociological Methodology
(ed.
C. C.
Clogg
), pp.
213
247
.
Oxford
:
Blackwell
.

Vermunt
,
J. K.
and
Hagenaars
,
J. A.
(
2004
) Ordinal longitudinal data analysis. In
Methods in Human Growth Research
(eds
R. C.
Hauspie
,
N.
Cameron
and
L.
Molinari
), pp.
374
393
.
Cambridge
:
Cambridge University Press
.

Vermunt
,
J. K.
,
Langeheine
,
R.
and
Böckenholt
,
U.
(
1999
)
Discrete-time discrete-state latent Markov models with time-constant and time-varying covariates
.
J. Educ. Behav. Statist.
,
24
,
179
207
.

Visser
,
I.
,
Raijmakers
,
M. E. J.
and
Molenaar
,
P. C. M.
(
2002
)
Fitting hidden Markov models to psychological data
.
Scient. Programmng
,
10
,
185
199
.

Wiggins
,
L. M.
(
1973
)
Panel Analysis: Latent Probability Models for Attitude and Behavior Processes
.
Amsterdam
:
Elsevier
.

Appendix A

A.1. Relationship between π and ρ

We have that π=a+Aρ, where a is obtained by stacking the vectors a1,…,ac one below another, with ax denoting a c-dimensional column vector of 0s apart from the xth element equal to 1, and A is a block diagonal matrix with blocks A1,…,Ac, where Ax=I¯c,xax1c1 and I¯c,x is obtained by removing the xth column from an Ic-matrix.

A.2. Derivative of p with respect to θ

The derivative matrix Q(θ) has elements

arranged in the appropriate order, which may be computed on the basis of the same recursion as that which has been defined in Section 2 to compute p(y). Let λ(αj) denote the derivative of the vector λ with respect to αj, Π(βj) that of Π with respect to βj and ϕt,y(γj) that of φt,y with respect to γj. Consequently, p(θ)(y)=q(θ)(ys)1s, with θ that may correspond to αj, βj or γj, where

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)