Bayesian Nonparametric Functional Mixture Estimation for Time-Series Data, With Application to Estimation of State Employment Totals

The U.S. Bureau of Labor Statistics use monthly, by-state employment totals from the Current Population Survey (CPS) as a key input to develop employment estimates for counties within the states. The monthly CPS by-state totals, however, express high levels of volatility that compromise the accuracy of resulting estimates composed for the counties. Typically-employed models for small area estimation produce de-noised, state-level employment estimates by borrowing information over the survey months, but assume independence among the collection of by-state time series, which is typically violated due to similarities in their underlying economies. We construct Gaussian process and Gaussian Markov random field alternative functional prior specifications, each in a mixture of multivariate Gaussian distributions with a Dirichlet process (DP) mixing measure over the parameters of their covariance or precision matrices. Our DP mixture of functions models allow the data to simultaneously estimate a dependence among the months and between states. A feature of our models is that those functions assigned to the same cluster are drawn from a distribution with the same covariance parameters, so that they are similar, but don't have to be identical. We compare the performances of our two alternatives on synthetic data and apply them to recover de-noised, by-state CPS employment totals for data from $2000-2013$.


Introduction
The Current Population Survey (CPS), a household survey administered by the Census Bureau for the Bureau of Labor Statistics (BLS), publishes time-indexed, state-level estimates of variables relating to employment, such as the employment total. The time-indexed employment estimates express a high degree of volatility, so that BLS will apply independently state-specified regression models to reduce the volatility before proportioning these estimates to labor market area or county as part of their local area unemployment survey (LAUS) program. This approach assumes independence among the state-indexed collection of time series.
State administrators frequently express concern that the resulting county-level estimates composed from the state-level regression models are overly volatile, in that they believe much of the month-over-month movement is noise, not signal that they would be able to attribute to underlying economic drivers. So the LAUS program seeks a more accurate means to extract de-noised, state-level estimates of employment totals such that state administrators are readily able to ascribe economic drivers to the resulting county-level employment trends.
Bayesian nonparametric alternatives, such as the Gaussian process construction implemented by Vanhatalo et al. (2014), treat the collection of time series as sums of latent functions and a noise process. The structure of the model is focused to the latent functions, which are generated from a multivariate Gaussian distribution under covariance constructions that regulate the smoothness properties of the resulting functions (Rasmusen and Williams, 2006). Vanhatalo et al. (2014) specify their Gaussian process prior as independent over the states.
Dependence among a set of noisy functions over the states using a Gaussian process prior formulation would require vectorizing the functions into a single function or vector of length equal to the number of states multiplied by the number of time-indexed measurement waves.
This approach would be computationally prohibitive were one to sample the full joint pos-terior distribution as computation time scales with order equal of the cube of the resulting vector length. The rendering of sets of functions into a single vector is also not a natural way to perform inference because the ordering of the states may notably impact both global and local smoothness properties of the resulting vector and this approach assumes a continuous transition between state functions.
We offer an approach that builds on the independent nonparametric Bayesian functional models by probabilistically tying the set of latent functions together under a marginal nonparametric mixture prior, where the mixing measure receives a Dirichlet process (DP) prior (Ferguson, 1973). Our modeling framework is similar to Gelfand et al. (2005), but their approach imposes the Dirichlet process prior directly on the functions, so that co-clustered functions must be exactly the same, which we find to be too restrictive for working with our CPS employment data application. We will, instead, impose the DP prior on the covariance matrix parameters of the generating Gaussian distribution, such that state functions which are co-clustered are drawn from a Gaussian distribution with the same parameters, but are not required to be exactly equal.
We formulate our nonparametric mixture approach using two alternative prior constructions to parametrize the covariance matrix of a multivariate Gaussian prior imposed on the set of latent functions; 1. A Gaussian process prior; 2. An intrinsic Gaussian Markov random field prior. These alternative formulations are developed and their properties contrasted in Section 2. Schemes to sample the posterior distributions under each of our two functional prior alternatives are sketched in Section 3, where we comment on their computational properties. We compare their performance, both in fitting the functions and uncovering a clustering structure used to generate the functions in Section 4. Our nonparametric mixture models are applied for estimation of state-level employment totals obtained from the CPS in Section 5, followed by a concluding discussion in Section 6.

CPS Employment Count Data
Our data are composed for of T = 158 months of direct estimates of employment totals for each of N = 51 states (including the District of Columbia) published by the BLS in the Current Population Survey. Our modeling interest is focused on reducing estimation volatility to improve the precision of the monthly, state-level employment totals. The employment totals are influenced by the underlying economic conditions of the states. We, therefore, expect a correlation of trends in the mix of industrial, service and agricultural economic activities among states, on the one hand, with those expressed in their employment total time series, on the other hand. State policy makers seek context around their reported employment and unemployment levels in order to understand the drivers for the estimated trends. We will later demonstrate that performing inference on clusters of distributions generating the set of state-indexed functions provides such context.
We denote the set of (T = 158) × 1 vectors of survey direct estimates for a collection of N = 51 states with {y i } i=1,...,N . We would like to estimate de-noised, smooth functions, {f i }, from the {y i }. We estimate the dependence among the collection of state employment totals through a prior construction that permits a grouping or clustering of covariance (under a Gaussian process (GP) prior specification) or precision parameters (under an intrinsic Gaussian Markov random field (iGMRF) prior specification), that we index by state. These state-indexed covariance or precision parameters are used to generate the latent {f i }.
We first outline GP and iGMRF formulations with simpler, global covariance and precision parameters, respectively, (θ, κ), not indexed by state, to illustrate the properties of functions generated under these prior specifications. Employing a single set of covariance or precision parameters assumes the functions are exchangeably drawn. We will subsequently extend both of these models using nonparametric mixture formulations over the states for {θ i } and {κ i } under the GP and iGMRF models, respectively, to allow for dependence among the functions.

Gaussian process Model
Our Gaussian process (GP) model is parameterized through the covariance matrix, C (θ), where θ are estimated by the data and control the trend, length scale (or wavelength) and variation in generated functions, {f i }. We specify a GP formulation in the following probability model: where θ = (θ 1 , . . . , θ P ). The GP model specifies a formula for the covariance matrix, C (θ), with, for P = 3. The particular covariance formula we use is known as the rational quadratic, which may be derived as a scale mixture of more commonly-used squared exponential kernels, 1/θ 1 exp (t j − t ) 2 /θ , with the inverse of the length scale parameter, θ −1 , which controls function periodicity, distributed under a Beta distribution with hyperparameters θ 3 , θ −1 2 (Rasmusen and Williams, 2006). The vertical magnitude of a function drawn under the rational quadratic covariance formula is directly controlled by θ 1 , while θ 2 controls the mean length scale or period of the function, and θ 3 controls smooth deviations from θ 2 . As θ 3 ↑ ∞, this formulation converges to a single squared exponential kernel with length-scale, θ 2 . Figure 1 displays a randomly-generated function under each of 3 distinct value settings for for θ = (θ 1 , θ 2 , θ 3 ) of the rational quadratic covariance formula to provide insight into how θ produces distinct features in f . Our choice of the rational quadratic covariance formulation is intended as a parsimonious specification for parameterizing the use of a single covariance matrix that is able to discover multiple length-scales or wavelengths in a function, in lieu of using a set of squared exponential covariance terms (e.g. in an additive formulation) with each term specialized to a particular length scale.
The rational quadratic formula also produces smooth surfaces, {f i }, because they are constrained to be differentiable at all orders. We prefer covariance formulas that render smooth functions, {f i }, because the smoothness property more identifiably separates signal from rough, non-differentiable, noise. A primary goal for this work is to extract de-noised functions from noisy time series. The GP of Equation 1 under the rational quadratic covariance formula is very flexible, allowing the estimation of functions, {f i }, to be non-linear of any order within the space of smooth functions. This GP is equivalent to an infinite dimension polynomial spline, even though the parameterization of the GP is through the covariance matrix, rather than the mean. A linear mean model may be re-parameterized as a zero mean GP by marginalizing over the regression coefficients, which demonstrates that linear surfaces are a subset of the space of smooth functions parameterized in Equation 1 (Neal, 2000a). The structure of Equation 1 is contained in f i , so that the {y i } are assumed to conditionally-independent, given {f i }. Standard errors for the monthly CPS direct estimates were not available to us. We treat the model noise precision, τ , as unknown, such that it is estimated by the data, because this parameter influences the amount of signal, represented by f , extracted from the noisy response, y, and is a feature of the GP model. Our probability model in Equation 1, that specifies a Gamma prior for τ , is fully identified.

Intrinsic Gaussian Markov Random Field Model
An iGMRF prior may be viewed as a probabilistic local smoother composed from differences in function values, which in our case, are indexed by time. A typical iGMRF specification is placed on the first difference approximation to the first derivative, where κ is the earlier discussed precision parameter that determines the (vertical) scale of variation among the first differences. This approach uses nearest neighbors defined from first differences, (f i,j−1 , f i,j+1 ), to encode the time-indexed dependence struc-   ture among the f ij , j = 1, . . . , T for each domain, i ∈ (1, . . . , N ). Using first differences may produce an excessively rough (though continuous) surface that over-fits the data (by making it difficult to separately model signal and noise processes), so we employ a prior construction based on the second difference approximation to the first derivative, This prior on second differences produces the joint distribution, a band-diagonal precision matrix with non-zero entries, (Q j,j−2 , Q j,j−1 , Q j,j , Q j,j+1 , Q j,j+2 ) = (1, −4, 6, −4, 1) for 2 < j < T − 2 for non-boundary parameters (Rue and Held, 2005). Under this construction, R is rank-deficient (of rank T − 2) as the rows sum to 0 since it is composed from second differences, so that the joint distribution is improper; in particular, the prior for the T × 1, f i , is invariant to the addition of any second order polynomial because the prior supplies no information about such polynomials. We may view the joint distribution as the product of a proper distribution on the space of T − 2 differences (by employing the Moore-Penrose pseudo inverse, R − , and |R| as the product of the T − 2 non-zero eigenvalues of Q) and an improper, noninformative prior on the order 2 polynomials. These Gaussian Markov random field priors specified through a precision matrix have the property that f ij ⊥ ⊥ f ik |f i,−jk ↔ R jk = 0, which allows for a parsimonious Q, from which we specify a proper set of full conditional distributions that we use for posterior sampling, where {k : k ∼ j} denotes the set of neighboring time points, {k}, of time, j. The prior mean for each f ij is composed as a weighted average of its order 2 nearest neighbors. Rue and Held (2005) refer to this construction as a random walk prior of order 2 or RW 2(κ). One may equivalently parameterize, R = κ (D − ρΩ), for diagonal weight matrix, D, with (diagonal) entries equal to those of Q. Element, j, of D, counts the number of neighbors of time, j, where the function value at a time point with more neighbors will have a relatively higher precision. Adjacency matrix, Ω, is specified with 0 s on the diagonal and off-diagonal elements are filled with the negative of values from the off-diagonal elements of Q and encode adjacency relationships among the time points (Banerjee et al., 2003). Under this parameterization, −1 < ρ < 1, is the autocorrelation parameter that shrinks the mean to 0, but produces a proper joint prior for f i . The RW 2(κ) prior construction, however, sets ρ = 1. As with The GP construction of Equation 1 includes parameters (θ 2 , θ 3 ) for the length-scale that are estimated from the data, while the iGMRF prior hard codes the length scale in Q, suggesting more estimation flexibility for the GP, which we assess in Section 4.

Accounting for Dependence Among Functions
We introduce an extension of Equation 1, which indexes the GP covariance function parameters, . . , θ iP )}, by state, i ∈ (1, . . . , N ), to permit their probabilistic clustering with, where {θ i } i=1,...,N receive a random distribution prior, G, drawn from a Dirichlet process (DP), specified with a concentration parameter, α, a precision parameter that controls the amount of variation in G around prior mean, G 0 . The base or mean distribution, G 0 , is typically constructed as parametric; in our case, G 0 = P p=1 Ga ( where G is the mixing measure over the P × 1 covariance parameters, θ. We examine the clustering property of the DP by expressing it in the discrete, stick breaking form of Sethuraman (1994), a countably infinite mixture of weighted point masses, where "locations", θ * 1 , . . . , θ * M , index the unique values for the {θ i }, where M ≤ N (states) that we interpret as clusters. The maximum number of clusters assigns each state to its own cluster, which countably increases with N . We record state cluster memberships with s = (s 1 , . . . , s N ) where s i = denotes provides an equivalent parameterization to {θ i } and we recover θ i = θ * s i . We conduct posterior sampling with the cluster locations and assignments, rather than directly sampling {θ i }, as the former produces notably better mixing because it separates reassignments to clusters from updates to the values. The weight, p h ∈ (0, 1), is composed as where v h is drawn from the beta distribution, Be (1, α). This construction provides a prior penalty on the number of mixture components. A higher value for α, however, will generate smaller values for {v h }, and hence, "breaks" off more clusters (with unique locations), {p h }, from the unit stick. We place a further prior, α ∼ G (1, 1), to allow posterior updating in recognition of the relatively strong influence it conveys on the number of clusters formed (Escobar and West, 1995).
The DP construction assumes exchangeability of the (θ i ), a priori, given random measure, G, but the almost surely discrete construction of the DP produces estimates which are not exchangeable, a posteriori. The prior specification for cluster assignments, (s i |s −i ), (under our re-parameterization to {s, (θ * m ) m=1,...,M } achieved by marginalizing over G), induces a uniform probability for co-clustering among the states.
If the likelihood values for two states, j and k, N T (y j |0, C τ (θ * s )) and N T (y k |0, C τ (θ * s )), are both relatively high for assignment to cluster, s j = s k = s, due to underlying similarities in their economies or due to other factors related to their closeness in their geographic locations, then the posterior probability for co-clustering states j and k will be relatively high (versus uniform, a priori ). This posterior estimation mechanism conducts unsupervised (probabilistic) clustering. Sharper (lower posterior variance) estimates over the space of clusterings may be obtained, particularly when the number of observations are higher (than our N = 51), by indexing either the weights or locations in Equation 4 to include predictors in lieu of our unsupervised formulation.
An analogous extension is specified from Equation 2 with, that may be expressed as ..,N . We specify base distribution, G 0 = Ga (1, 0.1), (which generates locations, {κ * m }) as weakly informative with a large variance. We select these hyperparameter settings because they produce a larger prior variance for κ in Equation 5 than for Θ under Equation 3. The parameterization of the precision matrix of each function using only a single parameter (κ) makes the posterior sensitive to the prior specification. As with Equation 3, we sample κ indirectly through cluster assignments, s, for the N × 1 states, and location values The stick breaking formulation illustrated in Equation 4 to induce the unknown measure, G, may be easily generalized beyond the DP by varying the form for the Beta prior assigned to v h ; for example, the Poisson-Dirichlet (PD) alternative of Pitman and Yor (1997) includes an additional "discount" parameter, δ, along with the DP concentration parameter, α, to specify the hyperparameters of the Beta distribution for v h that tends to generate more locations, a priori, inducing a less informative prior for the number of clusters (which is random under DP and PD constructions). Large clusters, however, receive more prior weight under the PD than for the DP. The DP may be extracted as a special case of the PD and both may be located within a larger class of generalized gamma processes (Lijoi et al., 2007). The PD and DP turn out to produce nearly identical posterior results under our data application due to a high level of information in the data, so we focused our exposition on the simpler model.

Computation
The mixtures of Gaussian processes formulation of Equation 3 is far more computationallyintensive than than the mixtures of iGMRFs model presented in Equation 5 because computing the cholesky decomposition of the T × T covariance matrix is O(T 3 ). We report computation time comparisons in Section 4 where we will, however, also demonstrate that the mixtures of GPs is more robust in discovering the correct number of generating clusters than is the mixtures of iGMRFs. So we are motivated to mitigate the computational burden associated to drawing posterior samples under a GP prior on the functions. A typically-used approach to improve computation for a GP probability model employs a sparse approximation to the GP that utilizes some subset of the time points as "inducing" inputs (Vanhatalo et al., 2014). We expect functions, {f i }, for our CPS data to each express multiple length scales because our estimation period encompasses over one hundred months, such that a sparse approximation may fail to capture key features. So we adapt a recently developed posterior sampling algorithm of Wang and Neal (2013) to our mixtures of GPs that generates a sequence of proposals for locations, {θ * m }, in a lower dimensional space using a subset of the T time points to build the GP covariance matrix, C, which reduces computation time. Yet, sample draws under this method will be from the exact posterior distribution, rather than from a sparse approximation.
Please see Appendix A for an exposition of the posterior sampling algorithms used for both the mixtures of GPs and iGMRFs models, which we have implemented in the growfunctions package for R Core Team (2014), using C++ to speed computation. The package is fully documented and available on CRAN for downloading and includes the data used in this paper.

Simulation Study
We conduct a simulation study with the goals to: 1). Assess and compare the accuracy of estimating de-noised functions between our GP and iGMRF formulations; 2). Assess and compare the accuracy of detecting clusters or sub-groups of de-noised functions. We generate each latent synthetic function, f (from which we render the observed time series, y) under different parameterizations than either of our models as a means to assess the relative adaptability and robustness of our estimation models to real-world data generating processes.
Our first simulation focuses on generating relatively complicated surfaces from a mixture of K = 2 length scales specific to each of M = 3 clusters. We randomly (and disjointly) allocate N = 100 domains to M = 3 clusters for T = 158 time points, and generate de-noised functions, {f i }, in each cluster using the addition of two squared-exponential terms, squared exponential term is given by, where (j, ) is the cell indicator in the T × T , C. Term indicator, k ∈ {1, 2}, denotes the covariance term with associated locations, {θ * k,m,1 , θ * k,m,2 }, for a squared exponential covariance (which has two location parameters). Label (k, m, 1) indicates a location parameter that controls the vertical scale for covariance term k in cluster m, whereas (k, m, 2) denotes a location parameter that controls the length scale or degree of smoothness. We recall that the vector, s = (s 1 , . . . , s N ) , s i ∈ (1, . . . , M ), assigns domains to clusters, so that the parameter controlling the vertical scale in covariance term, k = 1, for domain i is θ * 1,s i ,1 . The columns of the matrix, Examining the length scale location parameters, θ * ·,·,2 , in this table reveals that the surfaces in each cluster are formulated from the sum of a long length-scale or smooth term and a short length-scale or rough term. So each surface is drawn from a mixture of two scales. The long length scale locations were generated from, θ * ·,·,2 ∼ Ga (3,2), while the short length-scale locations were drawn from, θ * ·,·,2 ∼ Ga (2,5). The vertical scale parameters for both terms were drawn from, θ * ·,·,1 ∼ Ga (3,3).
The resulting observed surface for domain i, y i ∼ N T f i , τ −1 I T , where the global noise precision parameter, τ , is set to produce a 20% noise-to-signal ratio (based on the average variance of the {f i }).
The first of our two primary inferential goals is to uncover the de-noised latent functions, {f i }, so we want to avoid over-fitting the observed noisy surfaces. We perform estimation after randomly setting 10% of the N T observations to missing as presented to the estimation models to use them as a test set. The GP and iGMRF models will estimate function values, {f ij }, for each of these missing values and we compute a mean-squared prediction error (MSPE) based on the squared difference between the estimated and true function values for the test set. We then normalize the MSPE by the variance of the test set latent functions to produce a normalized MSPE. Figure  The GP fits notably better precisely because it avoids over-fitting.
Our second inferential goal is to utilize the clustering to provide context about how the functions compare among the observed domains (e.g. geographic or industry labels). Both the GP and iGMRF models employ a DP mixture formulation under which a clustering is generated on each posterior sampling iteration that assigns the covariance or precision parameters, respectively, to clusters. Both the number of clusters, M , and the allocations of domains to those clusters, may change on each posterior sampling iteration, so that these samples, taken together, formulate draws from the posterior distribution over the space of clusterings (or partitions). The relative concentration of this distribution may be assessed by examining the degree of similarity in number of clusters and assignment of the domains to those clusters among the posterior draws.
We select one clustering of domains from among the MCMC draws by using the leastsquares algorithm of Dahl (2006) that builds an N × N matrix of pairwise clustering probabilities to summarize the posterior draws and selects that clustering which is "closest" to this  We see that the GP model well reproduces the generating cluster with a mis-clustering error of only 2%. The iGMRF model discovers the 3 clusters, but also creates echoes of the third cluster, producing a total of 5 clusters and a resulting mis-clustering error of 20%.  Perhaps we expect the GP model with a rational quadratic covariance formulation to perform relatively better than the iGMRF model since the rational quadratic is designed for estimation under varying scales. So our next simulated procedure generates latent functions from a model closer to the iGMRF. Recall that that the T × T precision matrix of GMRF may be expressed as, R i = κ * s i (D − ρΩ), where we set the strength of correlation parameter, ρ = 1 (which produces a rank-deficient precision matrix), to produce our iGMRF formulation.
If we set 0 < ρ < 1, we recover a full rank precision matrix, but at the expense of shrinking the conditional mean values to 0. We draw, to construct our next synthetic dataset, under a second order specification for {D, Ω}, as described in Section 2.3 using the same N, T, M as for the first simulation. Locations, {κ * m } m=1,...,3 , are generated from a Ga (1, 1). As before, we set τ to produce a 20% noise-to-signal ratio. The resulting functions will be similar to those that would be generated under an iGMRF. Figure 4 demonstrates that the GP formulation (under the rational quadratic covariance formulation) tends to produce a smoother fit, making it easier to separate signal from noise. The fit performances are very similar between the GP and iGMRF models, with normalized MSPE values of 0.157 and 0.159, respectively. Even though both models deliver similar fit accuracy, Figure 5 reveals that the GP formulation continues to do a better job of discovering the true clustering.
Again, both models discover the framework of the true clustering, though the iGMRF model tends to echo one of the clusters. The mis-clustering rates here are 0% and 12% for the GP and iGMRF models, respectively, indicating that they both do a good job, but the GP model does well even when the underlying true functions are generated from a different process than a GP. The GP employs 3 parameters in it's covariance formula, giving it the flexibility to detect both the large vertical scale differences that differentiate the clusters and to model length scale variations within a function. The iGMRF does better here than under the first simulation precisely because the clusters are primarily differentiated by (vertical) scale.
We conclude our simulation study by estimating the GP formulation of Equation 1, where we exclude the modeling of clusters among {f i }. We seek to examine the out-of-sample fit performance when we ignore the dependence of the time-indexed collection of functions among the domains. We compare MSPE fit performance between including and excluding the clus-

CPS Employment Application
We return to our motivating application for which we focus on reducing the noise-induced volatility in the T = 158 month series of employment totals for N = 51 states (including the District of Columbia) reported from the CPS. Each state's observed employment series is standardized to remove the overall magnitude in order to facilitate comparisons across states based on similarities in the shapes and patterns expressed. States seek context for changes in employment totals as they filter down to the county-level estimates, so we will perform inference on the distribution over clusterings to understand which states express similarities in their employment patterns.
We fit the GP mixture model of Equation 3 to the CPS data. Figure 6 presents the allocation of estimated latent functions, {f i }, into assigned clusters of the clustering selected under the least squares criterion of Dahl (2006). We recall that the least squares algorithm  Table 1 lists the 38 states assigned to the left-hand cluster of Figure 6 and the 13 states assigned to the right-hand cluster. Each panel in Figure 7 presents

Discussion
Our task was to estimate latent, state-indexed functions that separate away noise-induced volatility present in our observed CPS data to improve the efficiency and interpretability of re-sulting county-level employment statistics constructed from the state estimates. We developed nonparametric mixture formulations that simultaneously estimate the latent, state-indexed functions and allow the data to discover a general dependence structure among them. Our simulation study results demonstrated that failing to account for a dependence structure among states in the estimation model deteriorates the ability to uncover the true latent functions.
Our DP mixture of GPs or iGMRFs, outlined in Equations 3 and 5, employ an unsuper- Our methodology may be extended by employing domain level predictors to help determine the distribution over clusterings with an approach such as the probit stick-breaking process of Rodríguez and Dunson (2011). This approach would replace the single unknown measure, G,

A.1 Gaussian Process Sequential Scan
We sample the parameters of the mixtures of GPs model, λ = (Θ * , s, α, τ ), in a sequential scan from the full conditionals after marginalizing over the latent, de-noised functions, {f i }, which is easy to do and leads to slightly better mixing. We must, however, co-sample the {f i } in the case of intermittent missingness-at-random in {y i } in order to allow the sampling of missing values from their posterior predictive distribution. We first outline our sequential scan in the case where we marginalize out the latent functions and conclude this section by highlighting changes required in the case we co-sample them. The mixtures of GPs model specification is highly non-conjugate. We carefully configure blocks of parameters to permit application of posterior sampling approaches designed to produce robust chain mixing under non-conjugate specifications.
We adapt a Metropolis-Hastings algorithm of Wang and Neal (2013) designed to speed computation for sampling each θ * pm (and also, τ ) by introducing a lower-dimensional temporary space where the likelihood (e.g. the T × T , Gaussian process covariance matrix, C) is approximated using a subset of the T time-points. Starting with the previously sampled value,x 0 , (for θ * pm or τ ) from the full-dimensional or exact space, the sampling algorithm builds a sequence of proposed transitions by first "stepping up" in the temporary space using increasingly coarse, "tempered" transitions, Ẑ 1 ,Ẑ 2 , . . . ,Ẑ n , that generate computationally-fast approximations for C. These approximations use fewer observations in each step; for example, we apply n = 2 transitions in the lowerdimensional space and use 100 of the T = 158 time points to formulateẐ 1 and then 60 time points forẐ 2 . This sequence of coarser transitions is followed by transition steps that employ progressively finer distributions (in reverse order), Ž 2 ,Ž 1 that "step down" to guide the chain back towards the full dimensional space until we conclude the sequence by proposingx 0 fromŽ 1 to be evaluated in the full-dimensional space. We use univariate slice sampling of Neal (2000b) to accomplish these (reversible) transitions in the lower-dimensional space. The proposal (reversible sequence of steps) is then accepted with (for n = 2 tempered transitions) probability of move formulation, min π 1 (x 0 )π 2 (x 1 )π 1 (x 1 ) where "x" denotes the proposals, and "π" posterior kernel evaluations, for each θ * pm with subscript 0 pertaining to the exact space and (1,2) to the successively coarser transition distributions in the temporary space in the sequential order of application. If our lower dimensional approximations are relatively good, this approach will speed chain convergence by producing draws of lower autocorrelation since each proposal includes a sequence of moves generated in the temporary space. Since the moves in the temporary space are executed with fast approximations, Wang and Neal (2013) show that this algorithm has the potential to substantially reduce computation time, as compared to the usual Metropolis-Hastings algorithm, for drawing an equivalent effective sample size.
The probability of move formulation evaluates the proposals on the full space of size T , however, so that the resulting sampled draws are from the exact posterior distribution, rather than from a sparse approximation.
Our adaptation configures Wang and Neal (2013) to apply to clusters of Gaussian processes (rather than to just a single GP) by exploiting the between cluster posterior independence of the {θ * pm }.
2. Sample cluster assignments, s = (s 1 , . . . , s N ): We marginalize over G in Equation 3, that results in the Pólya urn representation of Blackwell and MacQueen (1973), to sample s from its full conditionals, where n −i,s = i =i 1(s(i ) = s) is the number of states, excluding state i, assigned to cluster s, so that states are assigned to an existing cluster with probability proportional to its "popularity". The posterior assigns an state (through s i ) to a new cluster with probability proportional to αd 0 under the mixture prior in the case of a conjugate formulation. The conjugate specification requires the likelihood to be integrable in closed form with respect to the base distribution, G 0 , to compute d 0 = N (y|θ, . . .) G 0 (dθ), which is not the case under a (nonconjugate) GP construction. So we employ the auxiliary Gibbs sampler formulation of Neal (2000a) and sample c * ∈ N locations from base distribution, G 0 , ahead of any assigned observations (e.g. not yet linked to any state), to define h = M − + c * candidate clusters in an augmented space. We then draw s i from this augmented space, where any location not assigned states (over a set of draws for s) is dropped. This procedure effectively performs a Monte Carlo integration of the likelihood with respect to the base distribution. See Savitsky and Vannucci (2010) for a detailed example of a DP implementation on a GP. The larger is the tuning parameter, c * , the lower is the autocorrelation of the resulting posterior draws (though computation time increases because we are sampling with respect to more cluster locations). Good mixing is typically achieved with c * set equal to 2 or 3. We note that the number of clusters, (1995) that formulates the posterior for α as a mixture of two Gamma distributions with the mixture component, η, drawn from a beta distribution. The algorithm is facilitated by the conditional independence of α from data, Y, given cluster assignments, s, (that implies number of clusters, M ).
4. Draw f i as post-processing step from its posterior predictive distribution: Use the Gaussian joint distribution for (f i , y i ) to formulate the conditional posterior predictive Gaussian density, where 5. In the case of missing data values, we co-sample {f i } (rather than marginalizing over them) after the cluster locations in a Gibbs scan from, where T × 1, e i = f i τ y i and φ i = τ I T + C θ * where shape hyperparameter, a 1 = 0.5T N +a and rate hyperparameter, and (a, b) are the prior hyperparameter settings. Lastly, we replace y i with f i under co-sampling of f i in Equations 6 and 7 for sampling cluster locations, Θ * , and cluster memberships, s.

A.2 Intrinsic Gaussian Markov Random Field Gibbs Scan
Posterior sampling from the mixtures of iGMRFs employs a Gibbs scan over λ = ({f i }, {κ * m }, s, α, τ ), from a set of conjugate full conditionals in the following steps: 1. Sample latent functions, {f ij } i=1,...,N ; j=1,...,T in a Gibbs steps from the full conditionals, where e ij = τ y ij + Q jj κ * s if ij , withf ij = − 1 Q jj k:k∼j Q jk f ik and φ ij = τ + Q jj κ * s i . , where a 2 is as defined above and b 2,i is term i in the sum that composes b 2 , defined above.
Finally, we sample the DP concentration parameter, α, and the model noise precision, τ , in the same fashion as for the GP. Unlike the mixtures of GPs, we specify this model in a conjugate formulation that allows for fast sampling.