Bayesian profile regression for clustering analysis involving a longitudinal response and explanatory variables

The identification of sets of co-regulated genes that share a common function is a key question of modern genomics. Bayesian profile regression is a semi-supervised mixture modelling approach that makes use of a response to guide inference toward relevant clusterings. Previous applications of profile regression have considered univariate continuous, categorical, and count outcomes. In this work, we extend Bayesian profile regression to cases where the outcome is longitudinal (or multivariate continuous) and provide PReMiuMlongi, an updated version of PReMiuM, the R package for profile regression. We consider multivariate normal and Gaussian process regression response models and provide proof of principle applications to four simulation studies. The model is applied on budding yeast data to identify groups of genes co-regulated during the Saccharomyces cerevisiae cell cycle. We identify 4 distinct groups of genes associated with specific patterns of gene expression trajectories, along with the bound transcriptional factors, likely involved in their co-regulation process.


Introduction
Understanding gene regulation is a major research challenge in molecular biology (Jacob and Monod, 1961).To this end, a number of model organisms are studied for their relative simplicity in exploring the underlying regulatory mechanisms, including the budding yeast Saccharomyces cerevisiae (Tong et al., 2004).It has been shown that integrating different types of omic data may allow us to further deepen our understanding of the transcription process from DNA to messenger RNA (Savage et al., 2010), including allowing us to identify transcriptional modules (Ihmels et al., 2002): sets of functionally related genes that are regulated by the same transcription factor(s).In this paper, we integrate yeast gene expression time course data with chromatin immunoprecipitation microarray data in order to identify such gene sets.This analysis motivates a new adapted clustering approach.
Clustering is typically formulated as an unsupervised method for identifying homogeneous subgroups in a heterogeneous population (Jain et al., 1999).In addition to the perennial problem of how to determine the appropriate number of clusters (Fraley and Raftery, 2002, Rousseeuw, 1987, Sugar and James, 2003, Tibshirani and Walther, 2005), a key challenge is how to validate a given clustering structure (Brock et al., 2008, Kerr andChurchill, 2001).A common approach is to make use of left-out information to assess if the identified clustering structure provides a relevant stratification of the population (Handl et al., 2005, Yeung et al., 2001).For example, if we had clustered patients on the basis of their blood plasma proteome profiles, then we might check to see if patients allocated to the same cluster also tended to have responses to treatment which are more similar than for patients in different clusters.If this were the case, we might conclude that the identified clustering structure was relevant for the particular aim of identifying clusters associated with differential treatment responses.Of course, assessment of relevance is necessarily context-and application-specific.The left-out information (which we will refer to here as an outcome or response) is typically very closely related to the true aim of the clustering analysis, and implicitly defines the criterion by which a given clustering structure is assessed to be relevant or irrelevant.
When clustering datasets with many variables -some of them possibly not defining a clustering structure (Law et al., 2003, 2004, Tadesse et al., 2005), or when subsets of variables define a variety of different clustering structures (Cui et al., 2007, Guan et al., 2010, Kirk and Richardson, 2021, Li and Shafto, 2011, Niu et al., 2010, 2014) -it may be desirable to make use of (potentially highly informative) outcome information directly, in order to guide inference toward the most relevant clustering structures.That is, we may wish to use the outcome information during the clustering analysis itself, rather than during post-analysis validation.This is one of the principal motivations for Bayesian profile regression (Molitor et al., 2010), a semi-supervised approach for model-based clustering that allows outcome information to be taken into account for the determination of the clustering.Previous applications of profile regression have considered univariate continuous, categorical, or count outcomes (Liverani et al., 2015).In this work, we extend Bayesian profile regression to cases where the outcome is longitudinal (or multivariate continuous) and provide PReMiuMLongi, an updated version of PReMiuM, the R package for profile regression (Liverani et al., 2015).
In the following, we recall the Bayesian profile regression modelling framework in Section 2 and present two extensions to handle a longitudinal outcome in Section 3. In Section 4, we demonstrate through different simulation studies the benefit of integrating a longitudinal outcome in the clustering algorithm and assess the two methods that we propose, in terms of clustering recovery and computation time.Section 5 features the results of the methods applied to budding yeast data, uncovering groups of genes co-regulated during the cell cycle, followed by a discussion in Section 6.

Bayesian profile regression:
A semi-supervised clustering model We suppose that we have data comprising observations on a vector of covariates, x, and outcomes, y.Molitor et al. (2010), which we follow, permitted their model for y to depend upon additional covariates, w, referred to as fixed effects.These fixed effects are covariates that may be predictive of the outcome, but do not directly contribute to the clustering, and which come with associated "global" (i.e.not cluster-specific) parameters, β, for the model linking y and w.The general model considered in Molitor et al. (2010) for C mixture components is then: with π c the mixture weights, φ c the cluster-specific parameters for the density f x , θ c the cluster-specific parameters for the density f y and β the vector of regression parameters in the outcome model.The original formulation of this model, which we also adopt here, is in terms of infinite (specifically Dirichlet process) mixture models; however, we note that the model is equally applicable in the case of finite C. A stick-breaking construction (Pitman, 2002) is adopted for the Dirichlet Process prior set upon the mixture distribution, also known as a GEM or Griffiths-Engen-McCloskey prior (Pitman, 2002), constructed as follows: where α is a positive parameter, for which we adopt a gamma prior.See, for example, Ishwaran and James (2001) and Kalli et al. (2009) for further background, and Liverani et al. (2015) and Hastie et al. (2015) for details of how inference of the π c 's is performed in the Bayesian profile regression model.Thereafter, we define the component allocation variable z taking values in 1, • • • , C. Thus we have π c = P (z = c).
Profile regression can accommodate various types of covariates, by specifying the corresponding cluster-specific distributions in Equation (1).In the case of Q categorical covariates, the probability distribution in cluster c for x i = (x i1 , • • • , x i,Q ) for individual i, i = 1, • • • , N would be written as with z i = c if the i-th individual is allocated to the c-th cluster, φ zi = {φ zi,q,1 , ..., φ zi,q,Eq } where φ zi,q,e is the probability that covariate q ∈ {1, ...Q} takes the value e ∈ {1, ..., E q } in cluster z i = c and 1 {xi,q=e} = 1 if x i,q = e and 0 otherwise.Conjugate Dirichlet priors are adopted for the parameters φ zi,q,e ∼ Dirichlet(a q,e ).Multivariate Gaussian densities can also be used to handle a set of correlated Gaussian covariates, using a multivariate normal prior for the mean vector and an inverse Wishart prior for the covariance matrix.Finally, a product of both types of probability distributions accommodate mixed mixtures of Gaussian and discrete covariates, assuming independence given the cluster allocations.
Several distributions are proposed in the current profile regression model for the outcome, such as Bernoulli, Poisson, Binomial, etc.In the binary case for example, the probability distribution f Y for individual i is given by: where expit(x) = exp(x)  1+exp(x) , θ zi as well as the regression parameters β follow t location-scale distributions.The profile regression model described in Equation ( 1) implicitly assumes that the covariates, x, and outcomes, y, are linked via a shared clustering structure, but are otherwise conditionally independent.The inclusion of fixed effects w in Equation ( 1) allows for the possibility that there are additional nuisance covariates that might be predictive of the observed y, but which we do not wish to include in the clustering analysis.For example, in an analysis of data from a genome wide association study of lung cancer considered in Papathomas et al. (2012), x comprises genetic covariates (SNPs), while w comprises potentially confounding covariates including age, sex, and smoking status.Adopting Bair's terminology (Bair and Tibshirani, 2004), we refer to this as a semi-supervised mixture model, since clustering is guided by an outcome variable.
A variable selection procedure selects the covariates x that contain clustering information.The parameters in Equation (1) are replaced by composite ones defined as: φ * zi,q = γ zi,q φ zi,q + (1 − γ zi,q ) φ 0,q γ zi,q ∼ Bernoulli(ρ q ) (4) where φ 0,q is the empirical proportion of genes with x q = 1 in the whole population and γ zi,q is a relevance indicator for covariate q in cluster z i , with a sparsity inducing hyperprior ρ q ∼ 1 {ωq=0} δ 0 (ρ q ) + 1 {ωq=1} Beta(0.5, 0.5) and ω q ∼ Bernoulli(0.5).Values of the hyperprior ρ q close to 1 indicate a difference in the cluster-specific distributions of variable q compared to its distribution in the whole sample.See Liverani et al. (2015) for more details about the alternative "soft variable selection" procedure, also implemented in PReMiuM and PReMiuMlongi.
The profile regression model is fitted via Markov Chain Monte Carlo (MCMC) using a blocked Gibbs sampler.A representative clustering structure is identified using postprocessing techniques of the MCMC output, based on the posterior similarity matrix which represents the estimated co-clustering probabilities for each pair of individuals, and also quantifies the uncertainty associated with the cluster allocations.We refer the reader to Liverani et al. (2015) for further details.

Clustering guided by a longitudinal outcome
In previous applications of Bayesian profile regression, the outcome y has been assumed to be univariate (Molitor et al., 2010, Papathomas et al., 2012).Here we extend the original model to consider cases in which y comprises longitudinal continuous data.We present the multivariate normal (MVN) and Gaussian process (GP) regression response models in Sections 3.1 and 3.2, respectively.In Section 3.1, we restrict attention to situations in which we have a longitudinal outcome measured on individuals at a common set of time points.In Section 3.2, individuals may have different outcome measurements with specific time points.

Multivariate normal outcome model
We propose to model the outcome as multivariate normal, refering to this as the MVN response model, when all individuals/units have response measurements at a common set of time points.Such a model may also be appropriate when each individual's longitudinal response is summarised through a collection of statistics, as in Hathaway and D'agostino (1993).For flexibility, we make no assumptions about the structure of the covariance matrix in our model, and defer until Section 3.2 the discussion of a model in which we explicitly try to capture the time-ordering of the data.In this section, the vector of M time points, t = {t 1 , . . ., t M }, is assumed to be the same for all individuals, the outcome for the i-th individual is then of the form y i = [y i,1 , . . ., y i,M ] ∈ R M , where y is denotes the value for the response measured on the i-th individual at s-th time point t s and it is assumed that there are no missing outcome values.
For the MVN response model, the component-specific parameters θ c in Equation (1) comprise mean vector, µ c , and covariance matrix, Σ c .In the simple case where there are no confounding covariates w, the density f y (y|θ c ) is simply a multivariate normal density with mean µ c and covariance matrix Σ c .When we have confounding covariates, we assume that they act on the response via a constant shift of the mean.In particular, if w i ∈ R R , then β ∈ R R , and the MVN response model is as follows: where 1 M denotes a vector of ones of length M and we note that β w i is a scalar.To perform inference, we adopt conjugate normal-inverse Wishart (Gelman et al., 2013) priors

Gaussian process outcome model: a Bayesian functional data approach
In this section, we consider a more general setting in which the number of observations and the time points vary across individuals.In this case, we write y i = y i (t i ) = [y i (t i,1 ), . . ., y i (t i,Mi )] with t i the vector of the M i time points associated with the i-th individual, and y i (t) refers to the value of the response for the i-th individual at time t.In the outcome model, we assume that the complete longitudinal trajectory of the i-th individual's response is expressed in the following form: where g i is a continuous function of time; i (t) = i,t ∼ N (0, σ 2 i ) is assumed to be i.i.d.Gaussian noise; and β and w i are the vectors of regression parameters and of the individual covariates, respectively, defined as in the MVN specification.In practice, of course, we only observe y i (t) at a discrete set of time points, {t i,1 , . . ., t i,Mi }; however, this functional approach proves useful when dealing with individuals who have observations at different time points.
If individuals i and j are both in cluster c, then we assume that g i (t) = g j (t) := g (c) (t), where g (c) denotes the function associated with the c-th cluster.Moreover, σ 2 i = σ 2 j := σ 2 c , where σ 2 c denotes the noise variance associated with the c-th cluster.The conditional distribution of y i at time t, given that the i-th individual is allocated to the c-th component, is then: In order to proceed, we could specify a parametric form for the functions g (c) (e.g.we might specify that g (c) is a polynomial of degree d), whose component-specific parameters we would need to infer in order to fit the model.However, as we now describe, here we adopt a Bayesian non-parametric approach, and take a Gaussian process prior for the unknown function g (c) .

Gaussian process priors for unknown functions
A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution (Rasmussen and Williams, 2006).This definition simply means that a (potentially infinite) collection v 1 , v 2 , . . . of random variables defines a Gaussian process if and only if any finite subcollection of the variables is jointly distributed according to a Gaussian distribution.
In GP regression, a GP prior is assumed for the outputs of an unknown function g (c) (Rasmussen and Williams, 2006).This means that we assume a priori that {g (c) (t 1 ), g (c) (t 2 ), . . ., g (c) (t T )} are jointly distributed according to a T -variate Gaussian distribution for any vector of times {t 1 , t 2 , . . ., t T } of finite length T .To fully specify a GP prior, we require a mean function, m (c) , and a covariance function, k (c) , which define the mean vectors and covariance matrices of the Gaussian distributions associated with each finite subcollection of the variables.We write g (c) ∼ GP(m (c) , k (c) ) to indicate that we have assumed a Gaussian process prior with mean function m (c) and covariance function k (c) for the function g (c) , so that: c) , k (c) ) if and only if, for any finite collection {t 1 , t 2 , . . ., t T } of times, we have There are many possibilities for m (c) and k (c) ; see, for example, Rasmussen and Williams (2006).In practice, we take the standard default choice of setting m (c) to be the zero function, so that m (c) (t j ) = 0 for all t j .Note that if one prefers to use a non-zero mean function, the specification with a null mean function can then be applied to the difference between observations and the specified non-zero mean function.We also take k (c) to be the widely-used squared exponential covariance function or Gaussian kernel which is infinitely differentiable (i.e.very smooth), characterized by only two parameters, and has previously provided robust performances in the context of clustering (Strauss et al., 2019).Thus we have: where here the hyperparameters a c and l c are respectively the signal variance and length-scale for the squared exponential covariance function k (c) .Likewise, we define the corresponding function on time vectors: K (c) (t, s) whose output is a matrix with (i, j) th element equal to k (c) (t i , s j ).All hyperparameters (a, l and the variance of the measurement error σ 2 ) must be positive.Following Neal (1999), we deal throughout with the logarithms of these quantities, eliminating the positivity constraint (and thereby making sampling of these quantities more straightforward).For convenience, we adopt independent, standard normal priors for each of log(a), log(l) and log(σ 2 ) as in Kirk et al. (2012a) and McDowell et al. (2018).
In non-parametric Bayesian clustering, the tuning of the variance prior is of particular importance.Depending on the application at hand, standard lognormal priors may be too vague and the initialization may lead to the algorithm getting stuck into states with too few or too many clusters, leading to spurious results.Ross and Dy (2013) proposed to set the hyperparameters using empirical Bayes strategies.The lognormal hyperpriors adopted for the covariance functions K c can be tuned with an empirical variogram of the data (Diggle, 1990), which quantifies the correlation between observations as a function of the time interval and provides empirical estimates of the signal variance and length-scale parameters.In addition, and based on our experience, it may be necessary to constrain the variance of the measurement error, as large values may lead to a small number of clusters and conversely, small values may lead to a high number.This can be diagnosed by unstructured posterior similarity matrices (i.e. with unclear separations).We thus propose a specification where the measurement error variance is proportional to the signal variance, in each cluster: a c = r * σ 2 c , where the r ratio is fixed by the user.Sensitivity analyses are then necessary to validate this variance ratio value.

Likelihood
We adopt a GP prior, with component specific hyperparameters a c and l c , for each unknown function g (c) .Suppose there are n c individuals associated with component c.For notational convenience, we assume that these are individuals 1, . . ., n c , i.e. we assume that z i = c for i = 1, . . ., n c and z i = c for all other i.Recalling that y i = [y i (t i,1 ), . . ., y i (t i,Mi )] and that t i = [t i,1 , . . ., t i,Mi ] denotes the vector of the M i time points associated with the i-th individual, we define the "corrected" vector of observations for the i-th individual to be y i = y i − (β w i )1 Mi .It then follows from our model for the response (Equations ( 7) and ( 8)) and the GP prior, that: where g with K (c) and K ti,tj matrices with dimensions M c × M c and M i × M j , respectively, and individuals i and j in cluster c.
This defines the conditional likelihood given the function g: nc ].Alternatively, marginalising g (c) , we see from Equations ( 10) and ( 11) that We then can express the marginal likelihood: [y 1 , . . .,

Model inference
The model is estimated via MCMC.The DP parameter α is updated using a Metropolis-Hastings step and cluster-specific covariate parameters φ c are updated using Gibbs steps.For the MVN specification of the outcome model, the mean vectors µ c , covariance matrices Σ c are updated using Gibbs sampler.
Regarding the inference of the model with the GP specification, the logarithms of the hyperparameters of the squared exponential covariance function are updated in a Metropolis-within-Gibbs step.Details on the complete Gibbs samplers are given in Section A in the Supplementary Materials.
For the GP specification, we implemented two sampling algorithms of the hyperparameters θ c = (log(a c ), log(l c ), log(σ 2 c )), either marginalising or sampling the g (c) function.
Marginal algorithm In the first case, the hyperparameters are sampled from the following posterior distribution: with y (c) and t (c) the vectors of observations and time points, respectively, of all the individuals in cluster c, and w (c) the corresponding covariate matrix.The likelihood is defined in Equation ( 15) and involves inverting a covariance M c × M c -matrix at each update of either z, β or θ c .To reduce computation time and, to a greater extent, handle large datasets, we used the Woodbury matrix identity (Woodbury, 1950) to update the allocation variable z in the corresponding Gibbs sampling step (Strauss, 2019) (see Sections B.1 and B.2 of the Supplementary Materials).We also implemented an approximation to inverse cluster-specific covariance matrices K (c) using an intermediate grid of regular time points (Snelson and Ghahramani, 2006) (Section B.3 of the Supplementary Materials).This approximation is used if either the grid of time points or its size is pre-specified by the user.In the second case, the grid step is then defined from the range of observed time points.

Conditional algorithm
In the case where this approximation is not sufficient to reduce computation times (either the size of the time grid or the number of time points for all individuals in a same cluster is high), we implemented a second algorithm in which the function g (c) is sampled and the conditional posterior distributions of the hyperparameters θ 1,2 c = (log(a c ), log(l c )) and θ 3 c = log(σ 2 c ) are given by: where the likelihood corresponds to a M c -variate normal density with null mean and diagonal covariance matrix, with diagonal elements equal to σ 2 c .The posterior distribution of g (c) (t * ), with t * any input time vector, is a multivariate Gaussian density with mean and covariance matrices: When the cluster-specific covariance matrices have low dimensions (equal to the sum of the time points of all individuals allocated to the given cluster), the marginalized algorithm will be faster, as hyperparameters are not sampled.However, if these dimensions are high, solving the necessary matrix inversions will become prohibitively time consuming or even encounter numerical difficulties.The computational complexity of matrix inversion being at least O(n 2.37 ) according to state-of-the-art blockwise inversion algorithm (Alman and Williams, 2021), as a rule of thumb, we recommend using the conditional algorithm in the case where the mean number of time points per individual multiplied by the number of individuals is relatively high with respect to the computing power (e.g. higher than 1000 on a laptop with a 2.2 GHz dual core processor).When, in addition to this, time points are irregular across individuals, we recommend using the approximation mentioned above to compute the inverse of the covariance matrices.
We recommend using at least as many iterations as data points for the sampling phase and 10% iterations for the burn-in phase.In general, one should keep increasing the number of iterations while convergence of the MCMC chain is not reached.This convergence (or lack thereof) can be checked in such mixture models via the trace of the global parameters (α, β and the number of non-empty clusters) and the marginal partition posterior defined as P (Z|D) by Hastie et al. (2015).
Finally, the MVN specification does not handle missing outcome values and all individual should have the same number of observations, considered to be collected at the same observation times.However, the GP specification handles missing outcome values and loss to follow-up under the assumption that data are missing either completely at random or at random.The R page for PReMiuMlongi, https: //github.com/premium-profile-regression/PReMiuMlongi,contains both the package and detailed documentation.

Estimated cluster trajectories and individual predictions
A representative partition is obtained using post-processing methods: for each number of clusters from 2 to a specified threshold, a Partitioning Around Medoid (PAM) algorithm is performed on the posterior dissimilarity matrix 1−S, where S is a N ×N -matrix estimating the posterior probability of co-clustering for each pair of individuals.Then the optimal partition is chosen by maximising the average silhouette width across the best PAM partitions, as in Liverani et al. (2015).
The uncertainty on cluster allocations is then integrated into the estimation of the cluster-specific parameters, following Molitor et al. (2010).The GP hyperparameters of the c th cluster of the optimal partition are estimated by considering the N c individuals allocated to it, and averaging over the sampled hyperparameters corresponding to the clusters these individuals were allocated to, at each MCMC iteration: with z * i the cluster of individual i in the optimal partition and θ the hyperparameters corresponding to the cluster z (h) i individual i was allocated to at iteration h.Following the same procedure, we can compute individual longitudinal predictions (or predictions for a given profile of individuals) while accounting for the uncertainty on cluster allocations: at each iteration, the given individual is assigned to a cluster based on his/her posterior membership probabilities given the data; then, the final longitudinal predictions Ŷi are computed by averaging over his/her cluster-specific predictions at each MCMC iteration.In the MVN specification, the predicted trajectory is obtained by with β(h) the vector of estimated fixed effects at iteration h and the estimated mean corresponding to the cluster z (h) i the individual was allocated to, at iteration h.In the GP specification, the predicted trajectory is obtained by the point-wise median of the posterior cluster-specific mean functions computed on a specified time grid t of length n t , over the H MCMC iterations: the posterior mean function (Equation 16) in cluster z (h) i the individual was allocated to at iteration h.The 90% credible intervals were computed by considering the 5 th and 95 th percentiles.Section F of the Supplementary Materials describes how these models have been incorporated into PReMiuMlongi, and the outputs that are generated.

Simulation studies
In this section, we present four simulation studies.The first two demonstrate our motivation in using outcomes in profile regression to guide clustering: first, where the clustering structure in the covariates is weak and, second, where there are multiple clustering structures in the covariates, of which only one has clinical relevance.The third demonstrates the MVN and GP model formulations, comparing them with each other in terms of duration and efficacy in recovering the generated clustering structure.The methods are also assessed as the number of observation times is varied, and as the clustering structures overlap.The fourth study assesses the GP model on unequal time points.
Efficacy in recovering the target clustering structure is assessed via Rand indices (Rand, 1971).The Rand index is a metric ranging from 0 to 1 that reflects the similarity between two clustering structures.The index is 1 for identical structures.The adjusted Rand index corrects for the extent of agreement that would be expected by chance, given the size of the clusterings (Hubert and Arabie, 1985).Hence, it is possible for the adjusted Rand index to take a negative value.In this section, we use the posterior expectation adjusted Rand (PEAR) criterion (Fritsch and Ickstadt, 2009) that corresponds to the adjusted Rand index between the optimal partition obtained from postprocessing the fitted model and the generated partition.This index is computed using the R package mcclust (Fritsch and Ickstadt, 2009).

Inference
In the following simulation studies, we adopt a GEM prior for the mixture distribution (Equation 2), with α ∼ Gamma(2, 1).The covariate model is defined by Equation (3) with φ c,q,e ∼ Dirichlet(1) for each cluster c, covariate q and covariate level e.In Simulations 1, 2 and 3, we use the MVN specification for the outcome model, defined in Equation ( 5) with κ 0 = 0.01 and ν 0 equal to the number of time points M .In Simulations 3 and 4, the outcome model with the GP specification is defined by Equations 10 and 11, adopting a null mean function for the GP prior and a squared exponential function for the variance covariance, with standard log normal priors for the hyperparameters [log(a c ), log(l c ), log(σ 2 c )] ∼ N (0, I 3 ).The initial number of clusters is set to 20.

Simulation study 1: Semi-supervised vs. unsupervised clustering
Here we demonstrate the importance of using outcome data in identifying latent clustering structures.We define two clusters, each with 50 individuals, and simulate covariate data and 0 to 4 outcome variables.We use PReMiuMlongi with an MVN response model and we compare their estimated clustering structures with the generated one.
Our focus is on the difference between inference using no outcome data (M = 0), and inference using M ≥ 1 observation times.Inference using no outcome data corresponds to studies in which, first, clustering is performed on covariate data and, second, cluster memberships are leveraged to explain some outcome.The rationale of profile regression is that, when the purpose of clustering is to group individuals according to their outcomes, using outcome data leads to more meaningful cluster labels.

Design
The covariates are discrete variables generated from a categorical distribution.We set the number of covariates to 10, denoted q = 1, ..., 10.Each covariate q has R q = R = 3 categories, with multinomial R−parameter φ c,q generated as follows: and 1 R are vectors of length R, and α 0 = 0.01.The v parameter influences the cluster separability in terms of covariate profiles.A value of v close to 0 leads to similar profiles, as the probability φ c,q for each covariate q tends towards 1/R for all clusters c.A value of v close to 1 leads to distinct profiles, as the small value of α 0 leads to most people in cluster c having the same value for covariate q, which is unlikely to be the same between the clusters for all ten covariates.We choose v = 0.4 as an intermediate value to demonstrate the added benefit of integrating repeated outcome measures for recovering the clustering.
The outcome is generated from a multivariate normal distribution with means 1 M and 4 • 1 M for the first and second clusters, respectively, and covariance matrices 0.5I M , where I M is the identity matrix of dimension M , the number of observation times.

Results
We ran PReMiuMlongi with the MVN response model and max (2000, 2000M ) iterations each for burn in and sampling.The effect of the number of observation times on the resulting PEAR indices is shown in Figure 1.The x axis represents 5 simulation settings with all individuals having 0 to 4 time points.Each setting is simulated 1000 times.When the outcome is unaccounted for, the PEAR indices are low as the separability parameter between the covariate profiles is relatively low (ν = 0.4).The major PEAR improvement is from zero to one observation time.This is due to the way the longitudinal pattern is simulated: the clustering structure is weak in the covariates and stronger in the outcomes.It is possible to recover, to some extent, the clustering structure in the covariates without use of outcome data.However, inclusion of outcome data improves cluster-structure (Figure 1).

Simulation study 2: Semi-supervised clustering and variable selection
Here we adapt the simulation study of Section 4.2 to a case in which there are two sets of covariates, that correspond to two different clustering structures of the population.The outcome data shares the clustering structure of one set of covariates and is independent from the other one (which we refer to as "alternative clustering").Moreover, we use the variable selection feature of PReMiuMlongi, in order to select the covariates that guide the final clustering.In the absence of outcome data, there is no reason for one set of covariates to be selected over the other one.Outcome data are therefore required for identifying meaningful clustering structures when more than one structure is present in the data according to different covariate subsets.

Design
Covariate data consist of four variables, two of which correspond to a first clustering structure (which we describe as the "alternative clustering", and denote č) and two of which correspond to a second clustering structure, which retains the label c.We simulate covariates as in Simulation 1, while choosing ν = 1 for all covariates, so that each clustering structure has a strong signature when those covariates are viewed alone (see Supplementary Figure 10).Extending the notation from Section 4.2, where each individual has a cluster assignment ži for č and a cluster assignment z i for c, covariate x q,i |ž i = č has parameter φ č,q for q = 1, 2, and x q,i |z i = c has parameter φ c,q for q = 3, 4.
Response data are simulated as in Section 4.2, according to clustering structure c: The response variable is aligned with clustering structure c, and is independent of the alternative clustering structure č.The two covariates with clustering structure c therefore align with the response variable.
Again we vary the number of observation times from 0 to 4 in five simulation scenarios.We use this model to illustrate how inclusion of outcome data influences variable selection, and how these together affect PEAR indices.

Results
Inclusion of response variables enables recovery of the target clustering structure (or generated clustering structure) (Figure 2).The top row of Figure 2 shows density plots of variable selection for each number of observation times.The bottom row represents the corresponding 5 th , 50 th and 95 th percentiles of PEAR indices for the 1000 simulations of each scenario, when comparing the estimated clustering structure with the one that aligns with the outcome.When the response is excluded (number of observation times is zero), covariates 1-4 are not differentially selected, and correspondence between the inferred and generated clustering structures is low (median PEAR index around 0.5).When four time points (or MVN outcomes) are included, covariates one and two are deselected in favour of covariates three and four, resulting in identification of the clustering structure that aligns with the response.
Here we have shown that, for a dataset in which there is more than one clustering structure, use of covariate variable selection with outcome data in profile regression improves recovery of the target clustering structure.

Simulation study 3: Comparison of MVN and longitudinal specifications
To draw out the differences between the two response models, we use each in turn to simulate data, and use both to infer clusters and profiles.For the sake of comparison, we simulate data that are longitudinal, with all individuals having observations at the same time points, so that both the MVN and GP model formulations may be applied to the data generated.We compare the two methods in terms of PEAR indices and the time taken by the algorithm, and then assess the effects of a) the data-generating model, b) the number of observation times, and c) the extent to which the generated clusters are identifiable.

Design
In this section, we generate two clusters of 30 individuals.Covariate data consist of two discrete variables.In order to better highlight the difference between MVN and GP methods and avoid help from the covariates for cluster identification, this simulation study features no cluster structure in the covariate data.
Outcome data for the two clusters are simulated from a mean vector of 10 • 1 M and 10(1 − ξ • j), with j = 0, • • • 10, respectively, with M the number of equally spaced time points and ξ a gradient (or separability parameter) ranging from -1 to 0. However, the covariance matrices are the same for both We consider several scenarios by varying the data-generating model (MVN or GP), the number of observation times M (3 to 6), the gradient ξ (-1 to 0).Some examples of simulated data are given in Figure 3.Both response models are estimated on each scenario, with 5000(M − 2) iterations each in the burn-in phase and in the sampling phase for the MVN response model and 5000 iterations for each phase for the GP one.Each scenario is repeated 1000 times.

Results
Figure 4 summarises the effects of the three variables of interest on the two response models.A small gradient (ξ¡-0.5) between the two generated clusters seems to have a strong negative impact on the MVN response model performances.Also, we observe that the target clustering structure is better recovered by the MVN response model when there are fewer observation times, for a budgeted computation of 10000(M-2) iterations.Since the model dimension increases with more observation times, more iterations of the sampler are required to reach convergence, as illustrated in Supplementary Figure 11.
For the GP response model, the generated clustering structure is well recovered for the GP-generated data, as these data have an inherent structure that the response model can uncover.This response model has only three parameters to infer per cluster regardless of the number of observation times, in comparison with 1 2 M (M − 1) for the MVN response model, so fewer iterations are required for the GP response model.Both models perform better when the clusters are more separable, as we would expect.
Figure 4 summarises the time taken for PReMiuMlongi to run.The MVN response model is much quicker than the GP model, by two orders of magnitude, despite a larger number of iterations in the burn-in phase and the sampler for datasets with more than three observation times.5 displays the individual outcome trajectories for the 200 individuals, coloured according to the cluster allocations.A set of 5 categorical covariates, with 3 categories each, were generated for each individual given the cluster allocations, with the cluster-specific parameters shown in Table 1

Simulation study 4: Handling irregularly spaced time points
The model was then estimated using cluster-specific Gaussian Processes with standard normal hyper priors for the variance logparameters: log((a c , l c , σ 2 c ) ) ∼ N (0, I 3 ).We adopted a Gamma prior for the concentration parameter of the DP, α ∼ Gamma(2, 1), and Dirichlet priors for the cluster-specific parameters of the covariate models φ c,q,e ∼ Dirichlet(1) for cluster c, covariate q and covariate level e.The initial number of clusters was set to 20.

Results
We ran 7 chains on the generated dataset, with 10000 iterations and compared the traces of α, number of clusters (nClus), number of non empty clusters (Figure 6) and posterior similarity matrices in Supplementary Figure 12.The traces of α overlap and show convergence of the 7 chains.The traces of the number of clusters across chains also overlap but have different minimal values, bounded by the number of non-empty clusters, shown on the third plot.Even though the traces of non-empty clusters vary between 3 and 7 across the iterations, the similarity matrices (Supplementary Figure 12) show a good recovering of the generated clustering, represented in the first column of each plot.Posterior similarity matrices are highly contrasted because of the overall clear separability of the clusters.However, the third and fifth chains merge clusters 1 and 5, whose longitudinal trajectories are actually overlapping (Figure 5).This may imply that the chains are stuck in local minima, which is a common challenge in mixture models, and we could improve the algorithm by altering their proposal distribution.Split and merge moves could be added as well in the PReMiuMlongi sampler as in Jain and Neal (Jain and Neal, 2007) but this is beyond the scope of the current work.
We present a similar simulation study with 500 individuals in Section D in the Supplementary Materials, to mimic the size of the data application.

Application
We applied our methodology on budding-yeast data.Yeast is widely studied in genetic research as, like humans, they are eukaryotic organisms, with DNA information enclosed in cell nuclei.Additionally, they are unicellular, and therefore easier to study.The genome of the yeast species Saccharomyces cerevisiae has been entirely sequenced and is publicly available.During the cell cycle, gene expression is regulated by transcription factors, proteins inducing or repressing the transcription of DNA into mRNA information by binding to specific DNA sequences.Similar expression patterns across genes suggest a co-regulation process, indicating that these genes may be regulated by a common set of transcription factors.The protein-DNA interactions can be investigated using chromatin immunoprecipitation (ChIP) microarrays, which identifies the DNA binding sites.The objective was to analyze jointly ChIP and gene expression data, to uncover groups of genes co-regulated during the budding yeast cell cycle and identify the transcription factors involved in this process, in order to better understand the complex relationships between genes.

Data
We analyzed gene expression time course data collected by Granovskaia et al. (2010) and ChIP data provided by Harbison et al. (2004), also analyzed in Kirk et al. (2012a) and Zurauskiene et al. (2016).Gene expression was quantified over time by collecting high-resolution, strand-specific tiling microarray profiles of RNA expression every 5 minutes over three Saccharomyces cerevisiae cell cycles, for a total of 41 measures.The gene measurements were then standardised at the gene level, so that the mean and variance of the expression measures of each gene were 0 and 1, respectively.The binary ChIP data contained binding information for 117 transcription factors, informing which transcription factors bound to which genes, during the cell cycle.We selected 551 genes with periodic expression patterns identified in Granovskaia et al. (2010) for which ChIP data were available and considered the transcription factors which bound to at least one of these genes, totalling 80.To alleviate computational burden, we reduced the set of repeated gene expression measurements to a set of 11 regularly spaced time points.The corresponding trajectories for the 551 genes are depicted in Figure 7.

Model
We analysed the gene expression time course Y and the transcription factors X q , q = 1, ..., 80, specifying the following model: with Y i , t i the vectors of individual repeated measures and time points, respectively, X i,q the binary variable indicating whether transcription factor q binds to gene i, α the concentration parameter of the Dirichlet Process prior influencing the number of clusters, Θ the vector of all parameters, φ c,q the probability that transcription factor q equals 1 in cluster c.Hyperpriors were defined as: g c ∼ GP(m c , K c ) with log(a c ) ∼ N (1.48, 0.5) and log(l c ) ∼ N (5, 0.1) with m c defined as the null function and K c a squared exponential function (defined in Equation ( 9)).
The lognormal hyperpriors adopted for the covariance functions K c were chosen based on an empirical variogram of the data (Diggle, 1990).We constrained the variance of the measurement error, defining it as proportional to the cluster variance: a c = r * σ 2 c with r = 4. Finally, a variable selection approach was used to identify the transcription factors likely involved in the co-regulation process of the genes in each group.

Results
The model was estimated by MCMC with 10000 iterations.Based on the posterior similarity matrix (Figure 9), we identified a final partition of 4 clusters of 109, 206, 113 and 123 genes respectively.Based on a 10% lower threshold for the variable-specific relevance indicator ρ q (Equation 4), the variable selection identified 10 transcription factors which had different distributions across the clusters and were thus likely involved in the co-regulation process: MCM1, MBP1, SWI5, SWI6, SWI4, NDD1, ACE2, FKH2, FKH1 and STE12.The gene expression trajectories in these 4 clusters are depicted in Figure 8, exhibiting distinct patterns with different peak times, occurring in different phases of the cycle.The cell cycle is divided into 4 phases: Gap 1 (G1) during which the cell grows and prepares for duplication, On the left panels, trajectories of the genes in the color associated with the cluster they are allocated to.Dashed vertical black lines delimit the three cell cycles and dashed vertical gray ones delimit the different phases, in the first cycle: M/G1, G1, S, G2, G2/M, M and M/G1 again.On the right panels: clusterspecific profiles with regard to the 10 selected transcription factors: MCM1, MBP1, SWI5, SWI6, SWI4, NDD1, ACE2, FKH2, FKH1 and STE12.In each cell, associated with cluster k = 1, ..., 4 (represented by rows) and transcription factor q = 1, ..., 10 (represented by columns), the black pip represents the empirical proportion of genes with x q = 1 in the whole sample.The turquoise filled square and the red one represent the estimated cluster-specific proportions P (x c,q = 1) and P (x c,q = 0), respectively.Squares are dark filled if the 95% credibility interval of P (x c,q = 1) does not contain the empirical proportion, meaning that the considered category is significantly different in the cluster compared to the whole sample.Red squares are dark filled if the 5th percentile is above the empirical mean, and blue ones are dark filled if the 95th is below.
synthesis (S) where DNA replication occurs, Gap 2 (G2) during which the cell keeps growing to prepare for division, and mitosis (M).The 4 clusters seem to be involved in the M/G1, G1, G1 and G2 phases, respectively.The description of the same clusters regarding all the transcription factors is depicted in Supplementary Figure 17.The right panels of figures 8 and 13 were produced with the R package premiumPlots (https://github.com/simisc/premiumPlots).
We assessed the relevance of the clustering solution using the Gene Ontology Term Overlap (GOTO) scores (Mistry and Pavlidis, 2008).GOTO scores can be interpreted as quantifying the homogeneity of Gene Ontology annotations within clusters of genes: they represent the average number of shared GO annotations for a pair of genes from the same cluster.Having similar values for our clustering compared to previous results on this dataset (see Supplementary Table 2) thus validates biological function homogeneity for our identified clusters.
Finally, we analyzed the same dataset using the MVN specification in Section E.2 of the Supplementary Materials.The results show that the GP specification over the MVN specification allows the apparent underlying clustering structure of the gene expression data to be recovered, notably thanks to its flexibility and reduced number of parameters to estimate, compared to the MVN outcome model with an unstructured covariance matrix.

Convergence diagnosis and sensitivity analysis
We ran two additional chains to assess the convergence of the model, and compared the trace of the concentration parameter α of the Dirichlet processes, as well as the number of clusters across the MCMC 0 0.5 1 Figure 9: Posterior similarity matrix (PSM) representing the probability of pairwise co-clustering for the 551 genes.Each element (i, j) of this 551 × 551-matrix is proportional to the number of times genes i and j were allocated to the same cluster, over the 10000 MCMC iterations.
iterations, shown in Supplementary Figure 18.The cluster-specific parameters cannot be traced as the number of clusters varies from one iteration to another.Supplementary Figure 19 presents the two posterior similarity matrices which are very similar to the one in Figure 9, which also confirms the convergence of the model.
We estimated the same model with different values for the ratio between the signal variance parameter and the measurement error variance r = ac σ 2 c in a sensitivity analysis.All the models were specified in the same way, as described above.We compared in terms of PEAR indices the partitions obtained with different values of r (Supplementary Table 3).We observe very stable results when r varies between 2 and 6, showing little sensitivity to the setting of this parameter.

Discussion
We extended the semi-supervised profile regression approach to uncover the intrinsic clustering structure in a heterogeneous population from a longitudinal outcome and a set of correlated covariates.Outcome patterns and covariate profiles, defined as combinations of covariate values, are linked through a nonparametric clustering structure.A Dirichlet Process prior is used on the mixing distribution to handle an unconstrained number of clusters, also allowing us to quantify the uncertainty in cluster allocations.We implemented two modelling approaches to describe the repeated outcome, based on either a multivariate Gaussian distribution to handle regular time points for all individuals or Gaussian Process regression, accommodating either irregular time points and/or a large number of time points.The cluster-specific parameters are estimated while incorporating the uncertainty in the cluster allocations, and individual marginal predictions can be computed, accounting for the population heterogeneity.The simulation study showed that the MVN modelling recovered better the clustering structure and was robust with a low number of time points, while the GP modelling had better clustering performances with more than 5 or 6 time points, even though it required a longer computation time regardless of the number of time points and separability of the clusters.Applying the model to budding yeast data, we integrated gene expression time course measurements and transcription factor binding data to identify clusters of co-regulated genes and associated transcriptional modules.
In the current version of the algorithm, we adopted an Inverse Wishart prior distribution for the variance covariance matrix in the MVN specification, which is advantageous for its conjugacy.However, with a high number of time points, a Hierarchical Inverse Wishart prior may provide more flexibility to the model (Alvarez et al., 2014).We also implemented a squared exponential kernel for the GP prior covariance function.This is a common choice for such priors, but can lead to some instability when fitting longitudinal data with highly overlapping trajectories.Indeed, the variance hyperprior has an important impact on the final number of non-empty clusters, with high signal variances leading to a small number, and vice versa.In our application, we fixed the variance ratio to ensure that the signal variance is larger (or at least equal) to the measurement variance for each cluster, in conjunction with weakly informative hyperpriors for the signal variance.Additionally, we performed a sensitivity analysis showing the stability of our results with different values for this variance ratio.Duvenaud et al. (2013) considered other base kernels such as linear, quadratic, rational quadratic or periodic ones, and proposed composite kernels defined as combinations of the base ones.Such extension in profile regression could potentially improve the recovering of known structures in the dataset.Besides, conditionally on the cluster allocations, the variability of the outcome is entirely modelled by this covariance function.Cluster-specific fixed effects could refine the cluster identification and help disentangle the effect of a covariate across the different clusters.Finally, hierarchical Gaussian Process regression (Hensman et al., 2013) would allow the handling of multiple longitudinal markers, all correlated through an underlying cluster-specific process.However, the estimation of such models would require a large amount of data to estimate all the levels of the hierarchical GP modelling.
The dataset analysed in this paper contained no missing values, but our model used with the GP specification could also be applied to incomplete data, such as epidemiological cohorts.However, it relies on the assumption that missing outcome data are missing at random.The MVN specification does not handle missing outcome data, but multiple imputation methods (Rubin, 1987, van Buuren andGroothuis-Oudshoorn, 2011) could be considered for a future extension.For handling not-at-random missing data, we intend to extend the algorithm to model jointly a longitudinal outcome and a timeto-event, using a cluster-specific proportional hazards model.The missingness process would then be linked to the outcome through the clusters, which would capture entirely the correlation between the two quantities.Multiple endpoints, such as time to disease onset and disease progression, could also be accommodated by combining a multi-state model, associating specific risks of disease with the different clusters, as in Rouanet et al. (2016).
Finally, profile regression assumes a common clustering structure to all data types integrated in the analysis.More flexible approaches consider different structures for the longitudinal outcome and the covariates, with possible common clusters (Kirk et al., 2012b, Kirk and Richardson, 2021, Savage et al., 2010).This enables the identification of the meaningful clusters with regard to all the biomarkers, while discarding the irrelevant stratifications in the population, possibly linked to the inherent structure of a given set of variables.This would also be an interesting area for development.
Table 2 compares the GOTO scores obtained with PReMiuMlongi and iCluster (Shen et al., 2009), a clustering method allowing integration of multiple datasets (gene expression time course data collected by Granovskaia et al. (2010) and ChIP data provided by Harbison et al. (2004)) using a joint latent variable model.Shen et al. (2009) recommend choosing the number of clusters that minimizes the proportion of deviation score, indicating stronger cluster separability.On this dataset, the proportion of deviation score was minimized with 2 clusters (as shown in Supplementary Materials in Kirk et al. (2012b)).The comparison shows very similar results, the advantage of PReMiuMlongi being that it automatically infers the number of clusters instead of using heuristic approaches a posteriori.In this case, we found 4 clusters as best describing the population heterogeneity.Finally, Table 3 presents the comparison of the clustering structures obtained by iCluster and PRe-MiuMlongi with the GP specification.The correspondence between the clusters is not clear as clusters 1-4 of the latter method are scattered in both clusters 1 and 2 of the former method.The individual evolutions by clusters obtained by iCluster (Supplementary Figure 20) are much more heterogeneous, which demonstrates the improved fit of the PReMiuMlongi clustering with the observed trajectories.

Supplementary Materials for Bayesian profile regression for clustering analysis involving a longitudinal response and explanatory variables
A Conditionals for Gibbs sampling The full joint posterior distribution can be written Of note, the prior distributions parameters are omitted in the remaining of the text, for the sake of brevity.Recall that for the MVN response, This corresponds to the fourth line of Equation ( 20).When marginalising g (c) , the individual outcome variables are not independent from one another given the parameters [θ, z, β].Hence, the individual likelihood in cluster c is computed by conditioning on all the data points in the cluster: c) .with the dimension of K (c) being equal to the number of time points for all individuals allocated to cluster c.The step sizes s θj are updated in the same way as the fixed-effect coefficients in PReMiuMlongi.All step sizes are initially set to 1 and we aim for an acceptance rate of 0.44 (Liverani et al., 2015).
The hyperparameters are resampled via Metropolis-within-Gibbs steps: with probability A θ c,j with probability 1 − A for j=3 Steps a to e (in Section A.2.1) sample g (c)   The step sizes are updated in the same way as the fixed-effect coefficients in PReMiuMlongi.All step sizes are initially set to 1 and we aim for an acceptance rate of 0.44 (Liverani et al., 2015).

B Matrix inversion using Woodbury matrix identity
In this section, we explicit an efficient inversion technique for updating the cluster-specific variance covariance matrices (Equation 9) needed in the likelihood computation (Equation 10) of the Gibbs sampling step for the allocation variable z.In the following, M 0 = K(τ, τ ) denotes the current variance covariance matrix of cluster c, with τ the vector of all the observation time points of the individuals in cluster c, in ascending order, and M new the updated cluster-specific variance covariance matrix, either after adding individual i (sub-section B.1) or removing it (sub-section B.2).The vector of the n i ordered observation time points of individual i is denoted by τ i .In this section, we remove all the subscripts/indices on c for notational convenience.
The Woodbury matrix identity states that, given a square invertible n × n matrix M , an n × p matrix U and an p × n matrix V , provided that (I p + V M −1 U ) is invertible, we get: If M −1 is known, this equation only requires inverting a matrix of dimensions p × p, with p < n.

B.1 Adding a new individual to a cluster
We aim at inverting the updated variance-covariance matrix M new , once individual i has been allocated to cluster c (z i = c).We permute M new such that it can be written as: By applying Equation ( 22) and defining B = M −1 0 K i,0 and A = K i − K i,0 B, we obtain:

B.2 Removing an individual from a cluster
In this case, we aim at inverting M new after removing individual i from M 0 .We permute M 0 so that: We obtain M −1 new using Eq(22) twice, based on the following equations: Also, we obtain: det(M new ) = det(M0) det(A) .

B.3 Sparse Gaussian Process
When the number of time points is large or when the time points are too irregular across the individuals, we propose to approximate the inverses and determinants of GP covariance matrices using a time grid (Snelson and Ghahramani, 2006) in order to reduce computation time.We denote by u a vector of n u regularly spaced time points.The variance covariance matrix of the vector τ of time points in ascending order can then be approximated by: Based on Equation ( 22) and Sylvester's theorem (Akritas et al., 1996) and defining A uτ = (chol(K u,u ) ) −1 K u,τ so that A uτ A uτ = Q τ,τ , we obtain: An example of this approximation can be found in Reid and Wernisch(2016) (Reid and Wernisch, 2016).The data-generating model is MVN, the response model is MVN, the number of time points is six, and the gradient of the second cluster's response mean is -1 (i.e. the responses of the two clusters are most clearly separable).Inference is trialled with 1000, 2000, 4000, 8000, 16000, 32000, 64000 and 128000 iterations in each of the burn-in and sampling phases.As the number of iterations increases, the true underlying clustering structure is better recovered, as evidenced by the increase in PEAR index.

D Supplementary simulation with 500 individuals
We simulated a dataset of 500 individuals to mimic the size of the dataset in the application.The sizes of the generated clusters were 50, 75, 125, 150 and 100, respectively.A set of 5 categorical covariates, with 3 categories each, were generated for each individual given the cluster allocations, with the clusterspecific parameters shown in Supplementary Table 4.The outcome was generated from a cluster-specific Gaussian Processes with mean m c and hyperparameters θ c for each cluster c presented in Supplementary Table 4. Regarding the model inference, we adopted the same priors as in Section 4.5.1.We ran 15 chains on the generated dataset, with 10000 iterations and compared the traces of α, number of clusters (nClus), number of non empty clusters (Supplementary Figure 13).The traces of these three parameters overlap across the 15 chains and the posterior similarity matrices in Supplementary Figure 14 show good recovery of the generated clustering structure.Supplementary Figure 15 presents the generated individual longitudinal data, coloured by the clusters individuals were allocated to in the first chain.Supplementary Figure 16 presents the clusters obtained in the first chain, in terms of cluster sizes, empirical outcome means and covariate profiles.We observe that the later correspond to the generated profiles described in Supplementary Table 4.

E.2 MVN specification
We analyzed the same dataset using the MVN specification.We adopted a gamma prior for α ∼ Gamma(2, 1).For the covariate model, we adopted Dirichlet priors φ c,q,e ∼ Dirichlet(1) for each cluster c, covariate q and covariate level e.For the outcome model, we set κ 0 = 0.01 and ν 0 = 11 as hyperparameters to the inverse Wishart prior.The initial number of clusters was 20.We ran 3000 iterations for burn in and 30000 for sampling.Figure 22: Trajectories of gene expression and covariate profiles for the 4 clusters of the final partition estimated using the MVN specification.On the left panels, trajectories of the genes in the color associated with the cluster they are allocated to.On the right panels: cluster-specific profiles with regard to the selected transcription factors.In each cell, associated with cluster k = 1, ..., 4 (represented by rows) and transcription factor q = 1, ..., 10 (represented by columns), the black pip represents the empirical proportion of genes with x q = 1 in the whole sample.The turquoise filled square and the red one represent the estimated cluster-specific proportions P (X c,q = 1) and P (x c,q = 0), respectively.Squares are dark filled if the 95% credibility interval of P (x c,q = 1) does not contain the empirical proportion, meaning that the considered category is significantly different in the cluster compared to the whole sample.Red squares are dark filled if the 5th percentile is above the empirical mean, and blue ones are dark filled if the 95th is below.
The model is then run on these simulated data using standard lognormal priors for the hyperparameters (a, l, σ 2 ) for the GP outcome model, Dirichlet(1) priors for the covariate model and Gamma(2, 1) prior for α.Below we show default output plots from PReMiuMlongilongi using data generated with these functions.

F.2.2 C++ back end
Within the C++ code for PReMiuMlongi, functions have been added to evaluate likelihoods for the longitudinal outcomes, and a Metropolis-within-Gibbs step that proposes new parameter values, and evaluates them through comparison of their conditional distributions, which have the form p(θ c ) • p(y (c) |t (c) , w (c) , θ c , β).

F.2.3 Outputs
An additional output file is written for the θ parameters, which are sampled at each iteration of the algorithm.These values are used to reconstruct the posterior parameter distributions over g (c) , which are visualised as a function of time with 90% credible intervals shown on either side.The standard deviation is a sum of the standard deviation about the mean throughout all samples and the mean standard deviation of all samples.These plots are realised with and without the data in the background.See Figure 25 for examples.
The same procedure also forms the basis for prediction of a new individual's outcome based on their covariate data.These plots have a similar form; see Figure 26 for an example.The other graphical output, the covariate profile, is unchanged with the exception of the omission of the 'risk' plot (see Figure 27).

Figure 1 :
Figure 1: The effect of number of observation times on PEAR index.Points are median PEAR indices for 1000 simulations with error bars showing 0.05 and 0.95 quantiles.In each simulation, data are generated for 100 individuals belonging to one of two clusters.Each individual has 10 covariates and 0 to 4 time points (or MVN outcomes).

Figure 2 :
Figure 2: The effect of number of observation times on PEAR indices when there are two clustering structures in the covariate data.Top: density plots showing selection for the four covariates.Bars show the weights given to the covariate in each of 1000 simulations, where 0 means unselected and 1 selected.Darker blue indicates higher density.Covariates 3 and 4 align with the outcome data.Bottom: points are median PEAR indices for 1000 simulations with error bars showing 0.05 and 0.95 quantiles.

Figure 3 :
Figure 3: Simulated data.Examples of data generated with the MVN model (left) and the GP model (right).Top line: three data points across the time period 0-10.The mean gradient is -1 for the second (purple) cluster.Bottom line: five data points across the time period 0-10.The mean gradient is -0.5 for the second (purple) cluster

Figure 4 :Figure 5 :
Figure 4: Simulation study results and timings.The first two columns of red heatmaps show PEAR indices following inference with the MVN response model (top row) and the GP response model (bottom row).Data were generated with a MVN model (first column) and a GP model (second column).The last two columns of heatmaps show the corresponding duration, in seconds, of inference.Each square within each heatmap represents a single experimental set-up: the gradient (which indicates how separable the clusters are) varies on the y axis, and the number of observation times varies on the x axis.The colour in the first two columns of heatmaps corresponds to the average PEAR index of 1000 repetitions, where darker red indicates a higher index.In the last two columns, the colour corresponds to the average duration of 1000 repetitions, where darker blue/green indicates a longer time.PReMiuMlongi was run with 5000 iterations each for burn in and sampling for the GP response model.For the MVN response model, the number of iterations for the burn in and sampling were each 5000(M − 2), where M is the number of observation times

Figure 6 :
Figure 6: From left to right: Traces of the α DP parameter, of the number of clusters and number of non-empty clusters for the 7 chains.

Figure 7 :
Figure 7: Standardised gene expression for 551 genes of budding yeast observed over time, in minutes

Figure 8 :
Figure 8: Trajectories of gene expression and covariate profiles for the 4 clusters of the final partition.On the left panels, trajectories of the genes in the color associated with the cluster they are allocated to.Dashed vertical black lines delimit the three cell cycles and dashed vertical gray ones delimit the different phases, in the first cycle: M/G1, G1, S, G2, G2/M, M and M/G1 again.On the right panels: clusterspecific profiles with regard to the 10 selected transcription factors: MCM1, MBP1, SWI5, SWI6, SWI4, NDD1, ACE2, FKH2, FKH1 and STE12.In each cell, associated with cluster k = 1, ..., 4 (represented by rows) and transcription factor q = 1, ..., 10 (represented by columns), the black pip represents the empirical proportion of genes with x q = 1 in the whole sample.The turquoise filled square and the red one represent the estimated cluster-specific proportions P (x c,q = 1) and P (x c,q = 0), respectively.Squares are dark filled if the 95% credibility interval of P (x c,q = 1) does not contain the empirical proportion, meaning that the considered category is significantly different in the cluster compared to the whole sample.Red squares are dark filled if the 5th percentile is above the empirical mean, and blue ones are dark filled if the 95th is below.

Figure 10 :
Figure 10: Simulated outcome and covariate data for the second simulation study (Section 4.3).Each row represents five measurements of one person.Of the four variables, the first two do not align with the outcome (columns 2 & 3).The last two variables align with the outcome (columns 4 & 5).

Figure 11 :
Figure 11: Efficacy of MVN model in recovering a true clustering structure as the number of iterations in PReMiuMlongi changes.This plot corresponds to a single square in Figure 4: the top-right square of the top-left sub-figure.Each point represents the median of 1000 instances of inference of the clustering structure given data generated as described in Section 4.3, and error bars show 0.05 and 0.95 quantiles.The data-generating model is MVN, the response model is MVN, the number of time points is six, and the gradient of the second cluster's response mean is -1 (i.e. the responses of the two clusters are most clearly separable).Inference is trialled with 1000, 2000, 4000, 8000, 16000, 32000, 64000 and 128000 iterations in each of the burn-in and sampling phases.As the number of iterations increases, the true underlying clustering structure is better recovered, as evidenced by the increase in PEAR index.

Figure 12 :
Figure 12: Simulation study 4: Posterior similarity matrices of the 7 chains.The individuals are in the same order to facilitate visual comparison.The true cluster allocations are represented in the left column, near each posterior similarity matrix.
Figure 13: From left to right: Traces of the α DP parameter, of the number of clusters and number of non-empty clusters for the 15 chains.

Figure 14 :
Figure 14: Additional simulation study: Posterior similarity matrices of the 15 chains.The individuals are in the same order to facilitate visual comparison.The true cluster allocations are represented in the left column, near each posterior similarity matrix.

Figure 15 :
Figure15: Generated longitudinal trajectories for 500 individuals (thin lines), coloured by clusters they are allocated to in the first chain, and cluster-specific estimated mean trajectories (thick lines).

Figure 16 :
Figure16: Cluster profiles estimated by the first chain.The first column displays the cluster sizes (bottom), the cluster-specific empirical outcome means (dots in the top panel) and empirical outcome mean in the whole sample (solid line in top panel).The columns 2 to 6 represent the boxplots of the estimated probabilities for covariates 1 to 5 to be equal to 0 (bottom row), 1 (middle row) and 2 (top row) for each cluster.The solid lines represent the empirical proportions of individuals with covariates equal to 0, 1 or 2 in the whole sample.The boxplots are red if the 5th percentile is above the solid line, blue if the 95th is below, and green otherwise.

Figure 17 :
Figure 17: Trajectories of gene expression and covariate profiles for the 4 clusters of the final partition.On the left panels, trajectories of the genes in the color associated with the cluster they are allocated to.Dashed black lines delimit the three cell cycles and gray ones delimit the different phases, in the first cycle: M/G1, G1, S, G2, G2/M, M and M/G1 again.On the right panels: cluster-specific profiles with regard to the 80 transcription factors.

Figure 18 :
Figure 18: Trace of the concentration parameter α (left panel) and number of non-empty clusters (right panel) across the 10000 iterations for the three chains

Figure 19 :
Figure 19: Posterior similarity matrices of chains 2 and 3 obtained by PReMiuMlongi on the Saccharomyces cerevisiae budding yeast data.Rows and columns are ordered in the same way for both matrices

Figure 21 :
Figure 21: Posterior similarity matrix from the model with MVN specification applied to the data presented in the Application section.

Figure 23 :
Figure 23: Default PReMiuMlongi output summarising outcomes for the MVN response model.There is one column per outcome, with the mean shown on the top row and the standard deviation on the bottom.Within each panel, there is one box per cluster.Green colour indicates an overlap of the 90% credible interval with mean value for all individuals, represented by the solid black horizontal line.Red indicates that the 90% credible interval is above the mean value, and blue, below.NB this is the same format of output as is used for MVN covariates.

Figure 24 :
Figure 24: Default PReMiuMlongi output for individual outcome predictions from the MVN response model.There is one column per outcome, with on the top row the box-plots of the posterior distributions for the means of individuals 1 and 2 and on the bottom the box-plots of the posterior distributions of the corresponding standard deviations.Within each panel, there is one box and one colour per individual for whom a prediction is made.

Figure 25 :
Figure 25: Default PReMiuMlongi output summarising estimated cluster-specific outcome trajectories, with the outcome data on the y axis and time on the x axis.Colours identify the separate clusters.Bold lines show the posterior mean function and thin lines show the 90% credible intervals.The right panel shows the individual trajectories (gray thin lines) together with the cluster patterns.

Table 1 :
. Hyperparameters for data generation (with

Table 2 :
GOTO scores associated with the biological process (bp), molecular function (mf), and cellular component (cc) ontologies, for PReMiuMlongi and iCluster methods.

Table 3 :
Cross-table of the clustering structures obtained by PReMiuMlongi with the GP specification (rows) and iCluster (columns).