## Abstract

**Motivation:** Classification of tissues using static gene-expression data has received considerable attention. Recently, a growing number of expression datasets are measured as a time series. Methods that are specifically designed for this temporal data can both utilize its unique features (temporal evolution of profiles) and address its unique challenges (different response rates of patients in the same class).

**Results:** We present a method that utilizes hidden Markov models (HMMs) for the classification task. We use HMMs with less states than time points leading to an alignment of the different patient response rates. To focus on the differences between the two classes we develop a discriminative HMM classifier. Unlike the traditional generative HMM, discriminative HMM can use examples from both classes when learning the model for a specific class. We have tested our method on both simulated and real time series expression data. As we show, our method improves upon prior methods and can suggest markers for specific disease and response stages that are not found when using traditional classifiers.

**Availability:** Matlab implementation is available from http://www.cs.cmu.edu/~thlin/tram/

**Contact:**zivbj@cs.cmu.edu

## 1 INTRODUCTION

Several methods have been developed for classifying tissues using gene-expression data. Starting with the seminal work of Golub *et al.* (1999) which used ‘ideal profiles’ to classify acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) cancer samples, researchers have been developing and applying classification methods to a wide range of diseases using expression data (Alizadeh *et al.*, 2000; Baranzini *et al.*, 2005). Recently, some of these methods have been commercialized, creating expression-based diagnostic and treatment suggestion tools (van 't Veer *et al.*, 2002).

To date most of the research on classifying expression data focused on static (snapshot) datasets. While these are appropriate for many cases (most notably diagnostics) they are less appropriate for longer term follow-up. Consider for example transplant patients. For these patients physicians need to determine if and when their body starts rejecting the new organ in order to start treatment with immunosuppressing drugs. Another example are patients who have been admitted to the hospital following an accident and are monitored for organ failures. In these and other scenarios classification is improved if one can take into account not only the current state of the patient but also its past state and the changes that have occurred over time. Indeed, large scale efforts are under way to collect and analyze such time series expression datasets so that they can be better utilized in clinical settings (Inflammation, 2008).

A unique challenge for clinical time series expression classification is to account for the patient-specific rate of disease development or treatment response (Kaminski and Bar-Joseph, 2007; Sterrenburg *et al.*, 2004; Weinstock-Guttman *et al.*, 2003). While the overall trajectory of the expression profile may be similar between patients, different patients may progress at different speeds. Thus, a classifier for these time series datasets should be able to take into account the varying response rates. This makes methods that treat the input data as static, such as support vector machines (SVM) with default kernels, less appropriate for this task.

To address these issue we present a method that can both classify the time series expression datasets and account for the differences in patient rates. Our method uses hidden Markov models (HMMs) to represent the expression profiles of the two classes. The HMMs we use contain fewer states for each class than the actual number of time points. Using the probabilistic transitions between these states results in alignment of patients to the model and can account for the varying rates of progress. We further extend the model to learn discriminative HMMs (Gopalakrishnan *et al.*, 1991; Normandin *et al.*, 1994; Woodland and Povey, 2002) in which the parameters are chosen to maximize the difference between the two classes. Finally, we use feature selection to reduce the model complexity. The two resulting models are then used to classify new time series expression data based on the likelihood of the data given in the models.

We have tested our method on simulated and real time-series expression datasets. For all cases we show that our HMM-based classifiers achieve large improvements over methods that have been suggested in the past for this task.

### 1.1 Related work

There has been a lot of previous work on classifying static expression datasets. In addition to the work of Golub *et al.* (1999) mentioned above, many other classifiers including SVMs (Furey *et al.*, 2000), principle component analysis (Bicciato *et al.*, 2003) and K nearest neighbor (KNN; Nutt *et al.*, 2003) were suggested for this task.

More recently a few methods for classifying time series expression data were presented. Baranzini *et al.* (2005) used an exhaustive search strategy to identify genes for a Bayes classifier of time series expression data of multiple sclerosis (MS) patients’ response to interferon-β (IFNβ). While their goal was to classify time series data, their method only used the first time point, so it could not take advantage of the full set of data available. Borgwardt *et al.* (2006) used SVM with specialized kernels that accounted for temporal data. A Kalman Filter was trained generatively for each class, and then the kernel was computed using the trained parameters of the two Kalman Filters. While their methods utilized the entire time series it does not account for the rate differences discussed above, which may lead to inaccuracies.

A number of methods have been suggested for aligning time series datasets to overcome these rate differences. These either rely on dynamic programming (Aach and Church, 2001) or on continuous representation of expression data (Bar-Joseph *et al.*, 2003). However, these methods were primarily developed for clustering expression data. It is not clear how to use these methods for classification of time series data.

HMMs have been used to cluster time series expression data (Schliep *et al.*, 2005), to model dynamic regulatory networks (Ernst *et al.*, 2007), to align Liquid Chromatography—Mass Spectrometry time series (Listgarten *et al.*, 2004), and to identify timing differences in time series expression data (Yoneya and Mamitsuka, 2007). However, we are not aware of any method that used these models for classifying time series data. Interestingly, each of these different applications used a different number of states w.r.t. the number of time points measured. Schliep *et al.*'s clustering model used fewer states than time points. Ernst *et al.*'s regulatory networks model used the same number of states as time points and Listgarten *et al.* used more states than time points for modeling Mass Spectrometry time series data. In this article we have investigated all three options as we discuss in Section 4. On the other hand, Yoneya and Mamitsuka's timing difference model used two type of states. Control states have self-loops similar to Schliep *et al.*'s model; feature states can jump over control states, similar to Listgarten *et al.*'s model. This special state space is designed to infer the ordering between conditions, but not designed to model a gene–expression profile.

HMMs are typically trained generatively using maximum likelihood estimation (MLE). For classification tasks, generative training only utilizes the positive examples for each class, while discriminative training can utilize positive and negative examples. Here we extended a discriminative training method that was originally developed for speech recognition: the maximum mutual information estimate (MMIE; Gopalakrishnan *et al.*, 1991). We discuss this method in more detail in the following Section.

## 2 HMMS FOR ALIGNING TIME SERIES GENE EXPRESSION

To account for different and varying response rate of each patient, we use HMMs. Using HMMs we align a patient's time series gene expression to a common profile. For a classification task we generate two such HMMs, one for good responders and one for poor responders. Conceptually, a hidden state in our HMM correspond to a phase in the treatment response. Since different patients progress at different rates, they enter these states at different times and may stay in one state for more than one time point.

The emission distribution of gene expression in each state is modeled by a multivariate Gaussian distribution whose dimension equals the number of genes used by the classifier. To avoid overfitting, the covariance matrix is assumed to be diagonal. We considered three state space topologies, all being left-right models that conform to the temporal ordering as shown in Figure 1. The first topology is a left-right model with self-loops, i.e. a state indexed *i* has transitions to *i* or *i*+1. The second is a simple left-right model without loops or jumps, so each state exactly matches one time point. The third is a left-right model with jumps, i.e. a state *i* has transitions to *i*+1,*i*+2,…,*i*+*J*, where the maximum jump step *J* is a fixed constant. In the first topology the number of states is less than the number of time points, while in the second topology these two are equal, and in the third topology the number of states is larger than the number of time points. The first and third topologies can be used to align patients by modifying their transition probabilities based on the observed expression data.

We will use the following notations. We are given the time series gene expressions of *K* patients, {*O*_{1},*O*_{2},…,*O*_{K}}. We measure the expression of *G* genes for each patient at *T* time points, represented by a *G*-dimensional multivariate time series *O*_{k} = (*O*_{k1},*O*_{k2},…,*O*_{kT}). For gene selection, *O*_{ktg} denotes the expression of gene *g* at time *t* for patient *k*. The class patient *k* belongs to is denoted as *c*_{k}, *c*_{k}∈{1,2}. For notational simplicity, we assume the patients are from two classes, which is true for many clinical patient classification tasks. However, the algorithm discussed below is applicable for multiclass classification as well.

A HMM λ^{(m)} with multivariate Gaussian emission probability is trained for each class *m*, *m*∈{1,2}. Let

*i*to

*j*and are mean and SD for the Gaussian distribution of state

*j*. The mean and SD of gene

*g*in state

*j*is denoted as and , when it is necessary to specify which gene. We fix the first state for each topology to represent pre-treatment levels. The hidden states of time series

*O*

_{k}are denoted as

*x*

_{k}=(

*x*

_{k1},

*x*

_{k2},…,

*x*

_{kT}).

We also use the standard notation for the sufficient statistics of HMM: is the posterior probability of state *j* at time *t* of observation *O*_{k}, conditioned on the model λ^{(m)}. is the probability of a transition from state *i* to state *j* at time *t* of observation *O*_{k}, conditioned on the model λ^{(m)}.

### 2.1 Generative training of HMM

Given labeled expression data we can learn the parameters of a HMM using the Baum–Welch algorithm for each of the three topologies mentioned above. The resulting models are generative as training only utilize data from expression experiments of patients which belong to that class (good or poor responders). Using such a generative model we can classify a new dataset by building one model for each class. Class assignment is based on maximum conditional likelihood.

It has been shown that MLE is optimal if the true model is indeed the assumed HMM and there is infinite data (Nadas, 1983). Unfortunately, it is unlikely that the data is truly generated by a HMM. Furthermore, the number of training examples for clinical time series classification is very small. Thus, it would be beneficial if we could take advantage of both positive and negative data when building the models for each of the classes. This would allow the models to focus on the differences between the two sets of expression datasets, rather than on the most visible features (which could be the same for all groups of patients, for example stress response which is a common feature in disease response but may not be a useful feature for discriminating good and bad responders).

### 2.2 Discriminative training of HMM

To model the difference between positive and negative examples, we need to optimize a discriminative criteria, such as the conditional likelihood of the true classes given the data. This criteria is also called conditional maximum likelihood estimation (CMLE), often used in discriminative training methods, e.g. logistic regression. Here we use the MMIE technique which was originally developed for speech recognition, to trains HMMs in order to optimize this discriminative criteria. The standard training algorithm for MMIE is an extended version of the Baum–Welch algorithm (Gopalakrishnan *et al.*, 1991). However, unlike generative training, the HMMs for both classes are learned concurrently and parameters in one of the models are affected by the parameters estimated for the other model. The MMIE objective function can be written as,

Where *c*_{k} is the class (1 or 2) of patient *k*.

That is, our goal is to find parameters that will maximize the probability ratio of the good and poor responders models.

The denominator in Equation (1) will be represented by the likelihood of a combined HMM, λ^{den}, such that

λ^{den} is called the denominator model. In practice we learn new models for *p*(λ^{(1)}) and *p*(λ^{(2)}) in each iteration and use them to revise *p*(*O*_{k}|λ^{den}). Thus the denominator model is constructed by combining the state space of the two HMMs λ^{(1)} and λ^{(2)}, and assigning initial probability to the beginning states according to the priors *p*(λ^{(1)}) and *p*(λ^{(2)}). During training, the denominator model is constructed in each iteration after the HMMs λ^{(1)} and λ^{(2)} are updated. While updating one class, the HMM for that class is called the numerator model.

We first discuss the E-step in MMIE which involves the estimation of expected counts summarizing the current parameter settings. This estimation is similar to the ones in the Baum–Welch algorithm and the counts are collected for both the numerator and denominator models. For example, when λ^{(1)} is being updated, is the expected count of state *j* in the positive examples according to the numerator model λ^{(1)}; is the expected count of transition from state *i* to state *j* in the positive examples according to λ^{(1)}. are weighted sums of expression values in the positive examples, and are weighted sums of squared values, where the weightings are the posterior probability of state *j*. Similarly, and are expected counts in all examples according to the denominator model λ^{den}. The calculations when updating λ^{(2)} is similar. These expected counts are obtained from the dynamic matrices of the forward–backward algorithm. Formally,

*away*from the denominator model. This leads to greater focus on emission and transition probabilities that differ between the two models (either across states or at specific states for each gene) contributing to increased discrimination between the two models. However, such subtraction may generate negative transition probabilities or negative variances in emission probabilities. A

*smoothing constant*needs to be added to both the numerator terms and the denominator terms to avoid this. Hence the reestimation formulas of MMIE are, where

*D*

_{E}and

*D*

_{T}are smoothing constants for emission and transition probabilities, respectively.

It has been shown that Equation (2)–(4) will converge to a local maximum of the MMIE objective function, given sufficiently large smoothing constants *D*_{E} and *D*_{T} (Normandin *et al.*, 1994). However, it is not known how large they must be for the objective function to converge. If the smoothing constants are too small, update may not increase the (discriminative) objective function, but if they are too large, convergence will be too slow. A useful lower bound is the minimal values that ensures that the HMM parameters remain valid. Empirically, setting the smoothing constants to twice the lower bound leads to fast convergence (Woodland and Povey, 2002). Thus we set *D*_{E} to twice the minimal value that makes all variances positive; *D*_{T} is set to twice the minimal value that makes all transition probabilities positive. See Appendix A for details on how these values can be computed.

The HMM parameters are updated by a weighted average of the previous parameters and the reestimations. We follow previous works and set the *learning rate*, the weight of reestimations, as the error rate (Normandin *et al.*, 1994). Hence the learning rate is larger in the beginning and smaller when nearing convergence.

## 3 GENE SELECTION FOR TIME SERIES EXPRESSION CLASSIFICATION

Gene selection is critical in clinical gene expression classification for several reasons. First, the number of patients (data points) is small compared to the number of genes (features), resulting in overfitting. It is expected that restricting to a subset of relevant genes will improve classification accuracy. Second, a small subset of genes that discriminate between the classes can lead to biomarker discovery. The selected genes can be further examined by more experiments to find out the causal factors of different response to a treatment.

We consider the problem of gene selection as a feature selection problem and will use the terms ‘gene’ and ‘features’ interchangeably. There are two primary approaches for feature selection; the ‘wrapper’ approach and the ‘filter’ approach (Xing, 2002). The wrapper approach evaluates the classifier on different feature subset, and searches in the space of all possible feature subsets using the specific classification strategy. The filter approach does not rely on the underlying classifier, but instead uses a simpler criteria to filter out irrelevant features. Typically the filter approach is faster, while the wrapper approach can fit the specific need of a classifier and obtain better performance. In pursuit of higher classification accuracy, the feature selection method we used here is a wrapper method.

We used a backward stepwise feature selection method that utilizes the alignment to the HMM profiles based on recursive feature elimination (RFE) algorithm, termed *HMM–RFE* (Guyon *et al.*, 2002). The basic procedure of RFE is as follows: train the classifier, eliminate the feature whose contribution to the discrimination is minimal, and repeat iteratively until the stopping criteria is met. It is also possible to eliminate several features in one step, especially when the number of features is large.

To estimate the contribution to discrimination of a specific gene, we note that since the covariance matrix is diagonal, gene-expression levels are independent given the hidden states. Thus, if the states are known, the likelihood can be decomposed into terms involving each gene separately. However, such a decomposition does not exist when the hidden states are unknown. Instead we use a heuristic to approximate this decomposition.

We define the contribution to log odds of a gene *g*, *d*_{g}, as

*c*

_{k}=1) is 1 if

*c*

_{k}=1 and 0 otherwise. See Appendix B for a detailed derivation. Briefly, the equation above uses an estimate of the states ( and ) to compute the discriminative contribution of genes for the two classes.

In each selection step, genes are ranked by the contribution *d*_{g}, and the gene with lowest score is eliminated. Following the elimination step, new HMMs are trained using the remaining genes and the gene selection step is repeated.

In order to determine the final number of selected genes we use internal cross-validation within the training data. Note that internal cross-validation does not utilize the test data in any way. The HMM-RFE algorithm is summarized in the following procedure:

Given

*G*genes, define gene sets with a decreasing number of genes,*G*=*G*_{0}>*G*_{1}>*G*_{2}>···>*G*_{N}, such that*G*_{i}genes are selected at the*i*-th iteration. Initially, the active gene set includes all genes, and*i*=0.At

*i*-th selection, train the HMMs λ^{(1)}and λ^{(2)}using genes in the active gene set.Calculate the discrimination score

*d*_{g}for each gene*g*, using Equation (5). Select*G*_{i}genes with highest scores.Record the cross-validation accuracy of the active gene set.

Set

*i*←*i*+1, repeat Step 2 until*i*=*N*−1.Choose the optimal gene number

*G*^{*}leading to the highest cross-validation accuracy.

Although we use internal cross-validation to determine the optimal gene set, it is not used in gene ranking. The reason is computational. Internal cross-validation for genes would increase the complexity by a factor of *G* (total number of genes) which could be a substantial increase for microarray expression data measuring thousands of genes.

## 4 RESULTS

We first tested our method on simulated data. Next, we applied our method to a clinical dataset measuring MS patients’ response to IFNβ, one of the most common treatments to control MS.

### 4.1 Simulated dataset

The expression of genes in response to treatments often follows a bifurcating pattern diverging as time progresses (Ernst *et al.*, 2007). This is also the case in clinical settings, as can be seen in Figure 2. In that figure we plot the average expressions of four genes in MS patients treated with IFNβ (Baranzini *et al.*, 2005). The patients are divided into two groups, good and poor responders (red and green curves, respectively). As the figure indicates, while these genes display similar levels at the early time points they diverge at the later time points. A classifier that only utilizes the first time point is likely to perform much worse when compared to a classifier that utilizes the entire time series.

To generate the simulated data we have also tried to mimic this type of expression pattern as we describe below. We generated expression profiles for 100 patients. Of these, 50 patients were in class 1 (‘good responders’) and 50 in class 2 (‘poor responders’). 100 genes were measured for each patients, with a maximum of 8 time points per patient. For each gene *g*, we generated the Class 1 response profile by randomly selecting a segment of a sine wave, of length 1.5π between 0 to 4π. Denote this profile as a function, . We selected 10 out of the 100 genes to be differential, the other 90 where assigned the same values for Class 2. For differential genes, the gene-expression profile of poor responders is the good responders’ profile curve plus a piecewise linear function,

*a*

_{g}and the offset

*b*

_{g}are gene-specific parameters.

*a*

_{g}is +5 or −5, and

*b*

_{g}is uniformly selected at random between −0.1 and 0.3.

After the profiles are generated, we simulate patient-specific response rate by randomly choosing a scaling value *s*_{k} between 0.5 to 1.5 for patient *k*. *s*_{k} is used to transform the time series for patient *k* by stretching or shrinking the curves. Thus, the time series profile for each genes of patient *k* is the linearly scaled profile time series. Finally, we add Gaussian noise, with mean 0 and gene-specific variance . Formally,

We tried a number of different values for based on real datasets and all resulted in similar performance.

We compared our HMM-based classification with two baseline classifiers: linear SVM (default parameters of SVM light is used), and the Integrated Bayesian Inference System (IBIS) method of Baranzini *et al.* (2005). IBIS only uses the first time point as we discussed in Section 1.1. We note that we have tried to obtain the code for the Kalman filter SVM of Borgwardt *et al.* (2006) for direct comparison. Unfortunately, the code is not available online. Despite several e-mail requests we were unable to obtain their code and we thus cannot present direct comparison of the two methods.

Classification accuracy of different methods are shown in Figure 3. Due to large amount of noise and limited information at earlier time points, classification using data from these points is close to random, causing difficulties for IBIS. With more time points, HMMs provide satisfactory results. HMMs using the three topologies all outperform SVM from 6 to 8 time points. Overall, the loop model results in highest accuracy with more time points.

For the loop HMM, we performed discriminative training on the HMMs trained generatively after gene selection. Although it is possible to incorporate discriminative training in HMM–RFE, the computation would be much heavier because discriminative training requires more iterations (e.g. 500 iterations) to converge comparing to generative training (about 20 iterations). As can be seen, accuracy is higher except for 2 time points where classification is close to random. Hence discriminative training can further improve the performance as it utilize both positive and negative data.

We also verified whether gene selection found the true differential genes. Since different splits results in different gene selection, we listed the median number of selected genes, and verified the correctness of overlapping genes: genes selected in 90% of the splits, in Table 1. Note that all overlapping genes are correct after 5 time points, and the number of selected genes is small after 6 time points, especially when using 7 and 8 time points resulting in better accuracy.

Time | Accuracy (%) | Median | Number of overlapping genes | Precision of selected genes (%) |
---|---|---|---|---|

2 | 44 | 19 | 1 | 0 |

3 | 47 | 18 | 2 | 0 |

4 | 56 | 16 | 2 | 0 |

5 | 58 | 17 | 2 | 100 |

6 | 73 | 7 | 2 | 100 |

7 | 87 | 2 | 1 | 100 |

8 | 86 | 4 | 2 | 100 |

Time | Accuracy (%) | Median | Number of overlapping genes | Precision of selected genes (%) |
---|---|---|---|---|

2 | 44 | 19 | 1 | 0 |

3 | 47 | 18 | 2 | 0 |

4 | 56 | 16 | 2 | 0 |

5 | 58 | 17 | 2 | 100 |

6 | 73 | 7 | 2 | 100 |

7 | 87 | 2 | 1 | 100 |

8 | 86 | 4 | 2 | 100 |

Time is the number of time points used to construct the HMM model. Accuracy is the best accuracy using these time points (loop HMM using discriminative training). Median is the median number of selected genes. Number of overlapping genes presents the number of genes selected in at least 90% of the training–testing splits. Precision is the percent of overlapping genes that were indeed part of the 10 assigned differentially expressed genes.

### 4.2 MS dataset

We next tested our model using a clinical expression dataset. This dataset contains time series expression data for 70 genes in 52 MS patients treated with IFNβ Baranzini *et al.* (2005). Of the 52 patients, 33 responded well to the treatment (‘good responders’) and 19 did not respond well (‘poor responders’). The 70 genes included were preselected by experts based on relevance to MS. For each patient there are 7 time points, measured every 3 month in the first year following treatment and every 6 month in the second year. Some patients miss certain measurements, especially at the 7th time point, causing an entire measurement to be a missing value.

As we did with the simulated data, we compared our method to the original IBIS algorithm and to linear SVM. We note again that we were unable to compare our method to the Kalman filter SVM of Borgwardt *et al.* (2006) since we could not obtain their code.

The classification accuracy is evaluated using 4-fold cross-validation. For each possible number of time points (2, 3, etc.) we train a new instance of each potential classification model (SVM, HMMs, etc.). Figure 4 presents the results. In that figure we plot the accuracy versus. the number of time points for the different classifiers.

As can be seen, HMM with equal number of states and time points (equivalent to Gaussian Naive Bayes classifier) achieves classification accuracy that is similar to SVM with default parameters (tuning of SVM parameters improves the accuracy, but it is still significantly lower than the other HMMs). In contrast, the other two HMMs that allow for alignment perform much better on this clinical dataset, indicating the alignment is indeed an important issue for time series classification. The loop HMM model performs better than the jump model from 3 time points; a possible explanation is that fewer emission parameters reduced overfitting. The best results when using all data (7 points) were obtained by the loop HMM after discriminative training (85%). In addition, when using 2 or 3 time points the discriminative HMM outperformed the generative model. However, for the other sets of time points the two models (discriminative and generative) achieved similar results.

It is important to note that we do not use the test data in gene selection during training, so that the evaluation would be closer to the performance on new data. The results presented in Figure 4 differ from the results in the original IBIS paper (Baranzini *et al.*, 2005) even though we have followed exactly the same (and simple) procedure as described in that paper. We believe that the reason for this discrepancy is that, while exhaustive search of all triplets and full covariance matrix in IBIS gives very good results on the training data, it may lead to overfitting of the validation accuracy, which becomes much higher than test accuracy.

### 4.3 Selected genes and the advantages of patient alignment

We next examined the selected genes for the different classifiers (trained with different number of time points). Table 2 lists the selected genes for models constructed from the different sets of time points. Again we only list genes selected in at least 90% of the training–testing splits. With more time points the classifiers stabilize between splits leading to more selected genes. We compared our list to a previous list of 12 genes selected by Baranzini *et al.* (2005) based on expression values prior to treatment (first time point). Caspase 10 and Caspase 3 which were also listed in the original paper, are almost always selected regardless of how many time points are being used. However Jak2, IL12Rb2 and RAIDD are only selected using more time points, and are not on the list of 12 genes in that paper.

Time | Accuracy (%) | Median | Selected genes |
---|---|---|---|

2 | 78 | 15 | Caspase 3, Caspase 10, IL-4Ra |

3 | 77 | 13.5 | Caspase 3, Caspase 10, Jak2 |

4 | 81 | 11.5 | Caspase 10, Caspase 2, Jak2 |

5 | 85 | 26 | Caspase 10, MAP3K1, IRF8, Caspase 3, |

Caspase 2, Jak2, IL-4Ra, IL12Rb2 | |||

6 | 84 | 13.5 | Caspase 10, Caspase 3, Jak2, IRF4, |

Caspase 2 | |||

7 | 85 | 23.5 | Caspase 10, Caspase 3, Jak2, IL-4Ra, |

MAP3K1, RAIDD, Caspase 2 |

Time | Accuracy (%) | Median | Selected genes |
---|---|---|---|

2 | 78 | 15 | Caspase 3, Caspase 10, IL-4Ra |

3 | 77 | 13.5 | Caspase 3, Caspase 10, Jak2 |

4 | 81 | 11.5 | Caspase 10, Caspase 2, Jak2 |

5 | 85 | 26 | Caspase 10, MAP3K1, IRF8, Caspase 3, |

Caspase 2, Jak2, IL-4Ra, IL12Rb2 | |||

6 | 84 | 13.5 | Caspase 10, Caspase 3, Jak2, IRF4, |

Caspase 2 | |||

7 | 85 | 23.5 | Caspase 10, Caspase 3, Jak2, IL-4Ra, |

MAP3K1, RAIDD, Caspase 2 |

See Table 1 for description of the time, accuracy and median columns. Selected genes are genes selected in at least 90% of the training–testing splits. Accuracy is based on loop HMM using discriminative training.

These uniquely selected genes are due to the ability of our method to consider later time points, as shown in Figure 5. The left column in Figure 5 plots the mean and variance of gene expression at different time points. Some genes like Jak2 differs in later time points more strongly between the two classes when compared to the first time point. For some genes, the divergence is visible only in the aligned expression models, shown in the right column of Figure 5. To obtain the aligned expression, we use the Viterbi algorithm to align the time points of a patient to the most likely states of the HMM. IL12Rb2 and RAIDD, for example, are more separated between classes in the aligned expression. The large variances (and hence overlap) in unaligned time series could be due to poor responders entering the third state earlier or good responders staying in the second state longer, which is resolved after the alignment. Note that the alignment is on the patient level, based on the expression of all genes. To isolate the effect of alignment we applied a linear SVM to the alignment model determined by the HMM. Unlike the regular SVM that uses the measured values, the alignment SVM uses the average expression of time points aligned to each of the HMM states. As Figure 6 shows, such a classifier leads to much higher accuracy than SVM based on unaligned expressions. However, while this classifier considers the alignment, it ignores the temporal ordering of the states which is why it is outperformed by discriminative HMM, at least in some cases. These results highlight the importance of alignment when working with clinical expression data.

Some of the genes we uniquely identified are also validated by recent complementary studies. IL12RB2, an important autoimmunity gene is expressed in activated T-cells and is a marker of TH1 inflammatory response. Its consistent increase in poor responders suggests lack of response to treatment. This becomes more evident as time goes by leading to maximal difference after a year (Fig. 5). A recent paper found that it was a good marker for lack of response to glatiramer acetate in MS (Grossman *et al.*, 2007), and as our results indicate it might be a good marker to the IFNβ treatment. Another genes we identified, JAK2, is phosphorylated by the activation of the IL12 receptor. Again, this might cause a delayed response leading to stronger differences at later time points.

## 5 DISCUSSION

A major challenge in classifying time series clinical expression data is the varying response rates of individuals. In this article we propose the use of HMMs for this task. HMMs can naturally model non-linear patient-specific response rates. The hidden states represent the temporal clustering of gene expression, and can be interpreted as disease phases. Transition probabilities allows different patients to progress at different rates. To overcome the small number of training examples we have used discriminative HMMs. Unlike generative HMMs, discriminative HMMs can utilize both positive and negative examples when generating a model for each class.

Using both simulated data and clinical expression data of MS patients, we show that HMMs outperforms classifiers that do not take the temporal ordering into account. We further compared three left-right HMM models: the loop model, the equal-length model, and the jump model. Of these, the loop model performs best which is likely the result of the small training data for these types of experiments. Discriminative HMMs improve upon generative HMMs in most, though not all, cases.

In addition to learning discriminative models we also carry out gene selection. The selected genes in simulated datasets correctly contain the truly differential genes. While we do not have the ground truth for the MS dataset, many of the selected genes can be explained based on current knowledge of disease progression.

As more time series expression data accumulates we would like to test our method on additional types of response data. We would also like to extend our model to better represent the interactions between genes. The current diagonal covariance emission model ignores such interactions. When more data becomes available, models that compute more covariance terms can be learned from data leading to better models and improved accuracy. We are also interested in other clinical applications of our HMM, e.g. predicting rejection events for transplant patients. Alignment of time series gene expression between group of genes or species could also be important in other biological experiments.

## ACKNOWLEDGEMENTS

This work was supported in part by NIH grant NO1 AI-5001 and NSF CAREER award 0448453 to ZBJ.

*Conflict of Interest*: none declared.

## REFERENCES

*Proceedings of Pacific. Symposium on Biocomputing (PSB)*

### APPENDIX A

For discriminative training of HMM using MMIE, the smoothing constants are set to twice the minimal value that ensures the probabilities to be valid. Here we show how to calculate the *D*_{T} and *D*_{E} smoothing constants for transition and emission probabilities, respectively. By setting in Equation (4) to be positive, *D*_{T} can be calculated as

By plugging Equation (2) into Equation (3) and setting to be positive, we have a quadratic inequality of *D*_{E},

This quadratic inequality can be solved to obtain a lower bound, and *D*_{E} is set to twice the lower bound.

### APPENDIX B

In gene selection, we need to estimate the contribution to discrimination of each gene. Because the covariance matrix is diagonal, the gene expressions are independent given the hidden states. That is, the probability of a time series gene expression *O*_{k} given a HMM λ^{(1)} and the hidden states *x*_{k} can be decomposed as,

What we need is to decompose the marginalized likelihood *p*(*O*_{k}|λ^{(1)}) into terms involving each gene only, :

Unfortunately, expressions levels for individual genes are not independent once the hidden state are not known, so the above decomposition does not exist. We will use a heuristic to approximate this decomposition. Since the hidden states are unknown, we approximate it by the posterior probabilities, . We approximate the term as the weighted average of the Gaussian emission probabilities, the weights being :

We can approximate the contribution of each gene to the log odds,

Then the total log odds of the correct model versus the incorrect model can be expressed as the sum of contribution of each gene, *d*_{g},

where δ(*c*_{k}=1) is 1 if *c*_{k}=1 and 0 otherwise. Thus we have contribution to log likelihood ratio of a gene *d*_{g}, defined as

## Comments