-
PDF
- Split View
-
Views
-
Cite
Cite
Louisa A. Nolan, Markus O. Harva, Ata Kabán, Somak Raychaudhury, A data-driven Bayesian approach for finding young stellar populations in early-type galaxies from their ultraviolet–optical spectra, Monthly Notices of the Royal Astronomical Society, Volume 366, Issue 1, February 2006, Pages 321–338, https://doi.org/10.1111/j.1365-2966.2005.09868.x
- Share Icon Share
Abstract
Efficient predictive models and data analysis techniques for the analysis of photometric and spectroscopic observations of galaxies are not only desirable, but also required, in view of the overwhelming quantities of data becoming available. We present the results of a novel application of Bayesian latent variable modelling techniques, where we have formulated a data-driven algorithm that allows one to explore the stellar populations of a large sample of galaxies from their spectra, without the application of detailed physical models. Our only assumption is that the galaxy spectrum can be expressed as a linear superposition of a small number of independent factors, each a spectrum of a stellar subpopulation that cannot be individually observed. A probabilistic latent variable architecture that explicitly encodes this assumption is then formulated, and a rigorous Bayesian methodology is employed for solving the inverse modelling problem from the available data. A powerful aspect of this method is that it formulates a density model of the spectra, based on which we can handle observational errors. Further, we can recover missing data both from the original set of spectra which might have incomplete spectral coverage of each galaxy, or from previously unseen spectra of the same kind.
We apply this method to a sample of 21 ultraviolet-optical spectra of well-studied early-type galaxies, for which we also derive detailed physical models of star formation history (i.e. age, metallicity and relative mass fraction of the component stellar populations). We also apply it to synthetic spectra made up of two stellar populations, spanning a large range of parameters. We apply four different data models, starting from a formulation of principal component analysis (PCA), which has been widely used. We explore alternative factor models, relaxing the physically unrealistic assumption of Gaussian factors, as well as constraining the possibility of negative flux values that are allowed in PCA, and show that other models perform equally well or better, while yielding more physically acceptable results. In particular, the more physically motivated assumptions of our rectified factor analysis enable it to perform better than PCA, and to recover physically meaningful results.
We find that our data-driven Bayesian modelling allows us to identify those early-type galaxies that contain a significant stellar population that is ≲1-Gyr old. This experiment also concludes that our sample of early-type spectra showed no evidence of more than two major stellar populations differing significantly in age and metallicity. This method will help us to search for such young populations in a large ensemble of spectra of early-type galaxies, without fitting detailed models, and thereby to study the underlying physical processes governing the formation and evolution of early-type galaxies, particularly those leading to the suppression of star formation in dense environments. In particular, this method would be a very useful tool for automatically discovering various interesting subclasses of galaxies, for example, post-starburst or E+A galaxies.
1 Introduction
The determination of the star formation history of early-type galaxies has important implications for the still-controversial issue of their formation and evolution. Are the bulk of their stars assembled at high redshift, followed by predominantly passive evolution, or are they formed from low-redshift disc-disc merging? Or, as it seems likely from the variety of detailed elliptical morphology which is being revealed via projects, e.g. those involving integral field spectrographs such as SAURON, does the class of ellipticals contain subclasses, and are these dependent on the method of their formation, or on their local environment (Bacon et al. 2001; Emsellem et al. 2004)? We would like to know how these subclasses depend on local environment. In particular, we wish to know what mechanisms are responsible for the well-known morphology-density relationship (e.g. Dressler 1980), which shows that the proportion of early-type galaxies increases as the proportion of late-type galaxies decreases in regions with higher galaxy density. Furthermore, the recent interest in E+A galaxies (e.g. Tran et al. 2003), in which the signature of a recent (≲1-Gyr) starburst is seen, but no significant ongoing star formation, suggests that the relation between the evolution of these galaxies and their constituent stellar populations is not well understood.
However, until recently, only small ensembles of galaxy spectra, spanning a large useful range, were available and the analysis of a large statistical sample of stellar populations of galaxies was not possible. With the recently completed two-degree Field Galaxy Redshift Survey (2dFGRS) and the ongoing Sloan Digital Sky Survey (SDSS) or 6dF Galaxy Survey (6dFGS), a few million galaxy spectra will soon be available. The detailed physical modelling of stellar populations is numerically expensive, so the timely extraction of useful knowledge, such as the characteristics of the star formation history of galaxies, from these data will largely depend on developing appropriate data analysis tools that are able to complement more specialized astrophysical studies, in particular by automating parts of the analyses. Some work has already been carried out using an approach which incorporates assumptions from physical models to a greater or lesser extent (e.g. Heavens, Jimenez & Lahav 2000; Kauffmann et al. 2003; Fernandes et al. 2005; Ocvirk et al. 2005a,b; Solorio et al. 2005).
With the increase of data availability, significant research has recently been focusing on the use of data-driven methods of astrophysical data analysis. Some of these attempt to analyse galaxy properties from spectra, independently from any physical model. The usefulness of the application of principal component analysis (PCA), a classical multivariate statistical analysis technique, has been exploited and demonstrated in various scenarios (see e.g. Connolly et al. 1995; Folkes, Lahav & Maddox 1996; Ronen, Aragon-Salamanca & Lahav 1999; Madgwick et al. 2003a, b). Indeed, PCA is an efficient tool for data compression, and as such, suitable for providing a compressed representation for subsequent physical analysis. In a different line of research, the automated partitioning of spectra based on their morphological shapes (Slonim, Somerville & Lahav 2001) has also been explored with reasonable success.
Here, we explore the application of a similar technique to a different problem, namely that of the data-driven identification of the spectra of distinct stellar subpopulations in the spectrum of a galaxy. Our only assumption is that the galaxy spectrum can be expressed as a linear superposition of a small number of independent factors, each a spectrum of a stellar subpopulation that cannot be individually observed. We would like to implement an optimal feature transformation that would reveal interpretable and meaningful structural properties (factors) from a large sample of galaxy spectra. This would allow fast, automated extraction of some key physical properties from potentially very large data sets of the kind the Virtual Observatory would offer.
Preliminary results based on a simple non-orthogonal projection method have indicated that an identification of interpretable component spectra may be achievable in a purely data-driven manner (Kabán, Nolan & Raychaudhury 2005). In this work, we follow a more rigorous methodology, explicitly encoding the above hypothesis into appropriate probabilistic data models. We then employ a Bayesian formalism to solve the inverse modelling problem, as well as to objectively assess to what extent such a hypothesis is supported by the data, in terms of both physical interpretability and the ability for predictive generalization to previously unseen data of the same type.
The Bayesian framework allows us to automate several functional requirements within a single statistically sound formal framework, where all modelling assumptions may be explicitly encoded. Tasks ranging from inference and prediction to inverse modelling and data explanation, data compression, as well as model comparison are all accomplished in a flexible manner, even in conditions of missing values and measurement uncertainties.
In this study, we start from a set of synthetic spectra of stellar populations of known age and metallicity from the single stellar population models of Jimenez et al. (2004). We also have assembled a set of observed ultraviolet (UV)-optical spectra of early-type galaxies, to which we have fitted two-component stellar population models, and thus we have some knowledge of their star formation history (i.e. ages, metallicities and mass fractions of the component stellar populations). We then perform our data-driven Bayesian analysis on both synthetic and observed spectra, and compare the derived parameters with those obtained from detailed astrophysical modelling.
In Section 2, we describe the set of 21 observed early-type galaxy spectra that we analyse here, and describe the synthetic stellar population spectra. Section 3 details the Bayesian modelling of the data. Our results are presented and discussed in Section 4. In Section 5, we discuss the significance of this work on characterizing E+A galaxies. Our conclusions are outlined in Section 6. Appendix A gives the detailed mathematical framework required to implement our methodology, specifically for the data model estimation of the variational Bayesian rectified factor analysis — a flexible data model for the factorization of positive data that we developed.
2 The Data
2.1 Synthetic stellar population models



2.2 Observed early-type galaxy spectra
We have also compiled a set of 21 early-type galaxy spectra from the archives for this study, across the wavelength range 2500-8000 Å, although not all objects have data covering the complete range. The spectra of the galaxies in the sample are shown in Fig. 1. As the galaxy sample was assembled from those nearby early-type galaxies with both UV and optical spectra, from various archives, it does not represent a statistical sample. The sources for the data are listed in Table 1, together with the telescopes with which the optical spectra were observed. The UV spectra were observed using the International Ultraviolet Explorer (IUE), except for NGC 3605 and 5018, which were observed with the Faint Object Spectrograph (FOS) on the Hubble Space Telescope (HST).

Table of references and optical telescopes for the observed early-type galaxy spectra.
Since these spectra are compiled from sources with varying spectral coverage, the resulting data matrix has some missing values. The regions of incomplete wavelength coverage are obvious in Fig. 1. It is one of the strengths of our analysis that these missing data regions can be recovered; this is described in Sections 3.1.3, 3.4 and 3.5. Another of the advantages of the probabilistic framework adopted here is that, in contrast with classical non-probabilistic methods, the observational errors on each data point can be taken into account (Section 3.1.2).
As the data matrix on which the analysis is performed must have the same wavelength binning for all the galaxies, the spectra are de-redshifted and binned to the same wavelength resolution as the synthetic stellar population spectra (∼20 Å; the observed spectra have wavelength resolution which varies from ∼5-18 Å). For each galaxy, the various spectral sections, taken in turn, are normalized to unity in the overlap regions, and spliced, to create a single continuous spectrum for each object. Although the different spectral pieces for any one object were observed separately, and with different telescopes, it should be noted that the apertures used were similar, and therefore each section probes a similar part of each galaxy, and the shape of the continua in the overlap regions agree well.
3 Data Modelling
Models are simplifications of reality representing relevant aspects while glossing over irrelevant details. Data models are constructed from experimental evidence (data) and available prior knowledge. Once a data model is estimated (which can be done off-line), it offers a tool for making inferences and predictions in a consistent framework.
In Bayesian data modelling, all information is encoded in probability distributions (Bernardo & Smith 1994). The joint probability distribution of the observed data and other unobservable variables of interest (such as factors or component stellar populations of the representation as well as the parameters of the mixing process, etc.) need to be designed. The unobservable variables in the model are often referred to as hidden or latent variables.
3.1 Designing and building an appropriate latent variable architecture
3.1.1 The independent factor representation model of the spectrum
The spectrum of a galaxy or of a stellar subpopulation is essentially a vector, over a binned wavelength range, representing the flux per unit wavelength bin. Consider a set of N spectra of early-type galaxies, each measured at T different wavelength bins. We denote by the vector formed by the flux values at the wavelength bin indexed by t, as measured across the N spectra. The N×T matrix formed by these vectors may be referred to as X and single elements of this matrix will be denoted as xnt. Similar notational convention will also apply to other variables used.



The K components are assumed to be statistically independent, which is a standard assumption of independent factor models. The role of the function ƒ here is a technically convenient way of imposing certain constraints on the components ƒ(s) in order to facilitate the estimation of interpretable factors. This will be detailed in Section 3.1.4
As in nearly all data modelling applications of factor models, we hope that the feature transformation produced by the mapping A will reveal important structural characteristics of the collection of spectra. It is part of our data modelling design to investigate various ways of facilitating this, as will be detailed in Section 3.1.4
3.1.2 Dealing with measurement errors in the data model

3.1.3 Dealing with missing values in the data model

Having completed the general structural specification of the proposed probabilistic model, in the following subsection, we will make more specific assumptions about distributional form of the latent spectral representation that emerges from such probabilistic models.
3.1.4 The distributional forms employed in the data model
In order to be able to infer the parameters of the factor model (2), the form of the distributions of the involved variables, which may have a key impact on the subsequent interpretability of the resulting probabilistic model (MacKay 2003), need to be defined. That such assumptions are made explicit is indeed an important feature of the Bayesian framework adopted here. An important advantage of this approach is that of the several alternative forms that could be assumed for these distributional forms (we could call these ‘hypotheses’), the one best supported by the available data can be determined automatically from the data evidence. More details can be found below, in Section 3.3.
Before proceeding, we note that such assumptions are present in all automated analysis techniques, although in the case of non-probabilistic methods they are implicit and fixed. In particular, it has been shown that the widely used technique of principal component analysis (PCA) implicitly assumes a standard Gaussian representation and isotropic Gaussian noise (Tipping & Bishop 1999). Although PCA has been useful in a number of ways in astrophysical data analysis (Connolly et al. 1995; Folkes et al. 1996; Ronen et al. 1999; Madgwick et al. 2003a, b), there is no guarantee that the assumptions it makes are appropriate in our case. In particular, the assumption of a Gaussian distribution for the spectral elements in our present problem would imply that they are equally likely to have positive and negative flux values. This may make the physical interpretation of the spectra of the individual stellar populations (factors) problematic. Indeed, the physical validity of the isotropic structure of the Gaussian noise imposed by PCA has been questioned by Madgwick et al. (2003b).
It is therefore worthwhile to consider alternative factor models to relax such unrealistic constraints, and to introduce more justifiable constraints (such as the positivity of the factors). This is what we do in this section.
In this study we consider four factor models with differing assumptions made regarding the distributional forms employed. These are summarized in Table 2. The first three of these refer to factor models existing in the literature, and the fourth one has been recently developed by us (Harva & Kabán 2005). All factor models have been implemented in a Bayesian formulation (the so-called variational Bayesian framework, see details in Section 3.3) in order to ensure a consistent formal framework. The four factor models are described below.

VB-PCA The first row of Table 2 describes the variational Bayesian PCA (VB-PCA) data model (Bishop 1999). Due to the widespread use of PCA, we will consider it as the baseline hypothesis. As already mentioned, it assumes that the spectral factors of the representation follow a Gaussian distribution and there is no positivity constraint. This is summarized in the columns p(s) and ƒ(s), respectively, of the first row of Table 2. Further, it assumes that the mixing coefficients of the factors (i.e. the elements of the matrix A of equation 2) follow a standard Gaussian and finally that the noise term has an isotropic variance structure (i.e. is the same for all n).
VB-FA In Table 2, the variational Bayesian factor analysis (VB-FA) data model (Bishop 1999) relaxes the constraint of the isotropic structure of the Gaussian noise imposed by PCA. Indeed, such a constraint may be physically unjustified (Madgwick et al. 2003b). All other assumptions made are in turn identical with those of VB-PCA.
VB-PFA The above two models do not require the individual factors, which are spectra of the stellar subpopulations, to be non-negative, which they need to be in order to be physically acceptable. The remaining two models considered here do so. The third row of Table 2 describes the variational Bayesian positive factor analysis (VB-PFA), a data model developed by Miskin (2000), that, in contrast to VB-PCA and VB-FA, imposes the positivity constraint. It does so by assuming that each individual spectrum p(s) (factor) of the representation follows a zero-mode rectified Gaussian distribution (essentially a Gaussian restricted to its positive half and renormalized, defined in Section A4). VB-PFA also restricts the mixing coefficients p(a) to be positive, assumed to be standard rectified Gaussians, and maintains the same structural form for the noise variance as VB-FA.
The positivity constraint just introduced is meant to facilitate the interpretability of the factors as component spectra. Let us observe, however, that in VB-PFA, the 0 mode of p(s) would mean that the recovered components would have a low flux close to zero occupying most of the spectrum. Although there has been a great deal of interest in factor models that have such sparsely distributed factors for a variety of other purposes (Lee & Seung 2001; Miskin & Mackay 2001; Hinton & Ghahramani 1997; Socci, Lee & Seung 1998; Frey & Hinton 1999), including image- and text-related applications as well as neurobiological modelling, such a sparsity assumption may be restrictive and often unjustified for the modelling of stellar population spectra. However, as shown in Harva & Kabán (2005), modifying VB-PFA to relax the 0-mode assumption is technically impossible. Thus, we developed a different way of imposing the positivity constraint, which also permits the mode of p(s) to be estimated from the data in a flexible manner. We did not discard the hypothesis of VB-PFA either, since it has been applied to stellar population spectra in the literature (Miskin 2000) and we do not wish to discard any existing hypotheses based on possibly subjective intuitions.
VB-RFA The last row of Table 2 describes the variational Bayesian rectified factor analysis (VB-RFA) data model (Harva & Kabán 2005), where the positivity constraint over the factors is ensured through the non-differentiable function ƒ(s) = max(0, s). With this solution, the factors are now ƒ(s) and they are guaranteed to be positive. Then the variables s become some dummy arguments of ƒ(), which are now safely modelled as Gaussians (column p(s) of fourth row of Table 2). Therefore, both the means and variances of these Gaussians can now be estimated from the data. The resulting model allows more flexibility regarding the shape of the factors compared with VB-PFA, while ensuring the positivity of the factors. Finally, the mixing coefficients p(a) are assumed to be distributed as standard rectified Gaussians. Note that in this context the zero-mode constraint is appropriate since the sparsity of the mixing matrix A translates to assuming a smooth basis transformation. The technical novelty and computational advantages of the VB-RFA approach are detailed in Harva & Kabán (2005).
As it should now be clear, our scope is neither to apply any off-the-shelf data analysis method in an unjustified manner, nor to rely on subjective intuitions, as both of these approaches may easily lead to biased results. Instead, we have constructed four different hypotheses which we consider equally likely a priori. The Bayesian framework allows us to decide which hypothesis is best supported by the data.
3.2 The overall model architecture
The architecture for any of the four data models defined above can be graphically represented as in Fig. 2. This is the joint density of the observational data and all other variables of interest. Nodes denote random variables, rectangles refer to observed variables (i.e. derived from observational data), and circles and hexagons refer to unobservable variables that we are hoping to infer from the data through the formulated probabilistic model. Hexagons denote hidden variables on the top of the hierarchy.1 Arrows signify generative causal dependency relations.

Graphical model of the proposed latent variable architecture (i.e. the data model). Nodes represent random variables in the data model, and arrows indicate causal relationships between variables. Out of the nodes, rectangles refer to observed variables: measured fluxes, y and known measurement errors, vy. Circles and hexagons refer to latent (hidden) variables: clean flux, x, along with their variance parameters, vx, mixing coefficients, a, and components, s, along with their component-wise means, ms, and variance parameters, vs). Hexagons indicate the latent variables on the top of this hierarchy, for which further hierarchical priors may be considered.
We now outline the variational Bayesian estimation procedure associated with this architecture.
3.3 Estimation of a data model


The model estimation algorithm then consists of iteratively updating each parameter's posterior statistics in turn, while keeping all other posterior statistics fixed, until a convergence criterion is achieved. For each update, the evidence bound (5) is guaranteed not to decrease (Jordan et al. 1999; MacKay 2003), and convergence to a local optimum is guaranteed. It can also be shown that due to the fully factorial posterior approximation (6), all updates are local, i.e. requiring posterior statistics of the so-called Markov blanket only. That is, for updating any of the variable nodes, the posterior statistics of only its children, parents and coparents are needed. This has been exploited in the efficient implementation realized in Valpola et al. (2003), Bishop et al. (2002) and Winn & Bishop (2005), and has also been adopted in this work. The scaling of the resulting variational Bayesian estimation algorithm is thus multilinear in the number of factors, observations and wavelength bins of the model. These can also be easily followed from the further details given in Section A1.
The evidence bound (5) can be used both for monitoring the convergence of the algorithm and, more importantly, for comparison of different solutions or models. The quantities required for computing this are given in Section A2.
3.4 Using a data model to recover missing elements of the data

3.5 Using a data model for recovering missing elements from new measurements
Due to the fully generative nature of the adopted framework, once the data model parameters are estimated from a training data set, inferences and predictions can be made for new, previously unseen measurements. This involves performing the inference for those terms of the variational posterior q(θ) that are conditional on the type of new measurements (new galaxy spectra in the same range or various subranges of wavelength for the same galaxies). Using these, as well as the remainder of terms of q(θ) as estimated from the training set, we can then proceed to computing (7). Results will be presented in Section 4.3.
4 Results
One of the strengths of the Bayesian evidence framework employed in this study is that the selection of the best-supported hypothesis can be automatically achieved, in a quantitative manner, based on the data evidence. This is known as model selection.
4.1 Model selection results: how many major stellar populations do we need?
Here, we seek a criterion to examine which data model performs the best, as well as to find the dimensionality of the representation space. We would like to have a compact model (small number of factors) both to avoid over-fitting the data, and also for the results to be physically interpretable. At the same time, we would like to keep an adequate number of factors to guarantee a good approximation of the data. Apart from the interpretability issue, the Bayesian evidence framework can be used to automatically select the optimal model and its order from those investigated. Our preferences, motivated by the purpose of the application (e.g. data interpretation, missing value prediction), expressed as prior belief, would then be used to weight the evidence obtained from the data, in order to make a final Bayesian decision on the order of the model.
Fig. 3 shows the negative of the log evidence for the four data models (strictly, the negative log evidence bound), versus the model order, K (i.e. the number of components), for the observed spectra (Fig. 1). It can be seen that the VB-RFA model outperforms the other three models. We also see that increasing the number of factors beyond two does not significantly increase the evidence, which suggests that the bulk of spectra of an early-type galaxy can be reasonably well and compactly described by two factors or components, each interpretable as a stellar subpopulation. This is useful for input into more detailed physical models of early-type galaxy spectra. We will also see in Section 4.2 that the K= 2 models produce two factors that are both physically meaningful, while increasing the model order to K= 3 does not bring any further interpretable factors.

The negative log evidence for the variational Bayesian analyses of the observed spectra, as a function of the number of factors (components), i.e. stellar subpopulations. The performance of VB-PCA is clearly inferior. VB-RFA has the highest log evidence, with no significant improvement in increasing K, the number of components, from two to three. This shows that the spectra of early-type galaxies used here can be safely modelled as the sum of the spectra of two distinct major stellar components, and the inclusion of more components is not required by the data.
4.2 Physical interpretation of the results
Here, we compare the results of the data-driven method outlined so far with an entirely independent determination of the star formation history of our 21 early-type galaxies.
In this more conventional approach, we have fitted two-component detailed physical models, constructed from the single stellar population models of Jimenez et al. (2004), to the observed long-baseline spectra of our sample, as shown in Fig. 1. The ages, metallicities and relative mass fractions of the component stellar populations are allowed as free parameters. We construct the two-component spectra as the linear superposition of the two-component spectra (1), and allow ages 0.01-14.0 Gyr, and metallicities Z= 0.01, 0.2, 0.5, 1.0, 1.5, 2.5 and 5.0 Z⊙. Again, the relative mass fractions have to be jointly constrained, mi+mj= 1, but mi is allowed to vary from 0 to 1, in steps of 0.02. A minimum-?2 fit was used to determine the best-fitting values of mi, Zi, ti, mj, Zj and tj. The whole parameter space was searched, with the best-fitting parameter values corresponding to the point on the agei-agej-Zi-Zj-mi-mj grid with the minimum calculated ?2. Details of the fitting process are given in Nolan (2002) and Nolan et al. (2003).
4.2.1 Analysis of the observed spectra with the two-component data model
In Fig. 4, synthetic spectra (Jimenez et al. 2004) of young (0.7 Gyr, left-hand panel) and old (10 Gyr, right-hand panel) stellar populations are shown, for metallicities 0.2, 1.0 and 2.5 Z⊙, in each of the upper panels, from top to bottom, respectively. Below these are shown the spectra of the individual components, recovered from the four analysis methods described above, for both the observed galaxy spectra (middle panels), and linear superpositions of pairs of synthetic spectra (bottom panels).

Analysis of the two-component data model. Top: Synthetic stellar population spectra (Jimenez et al. 2004). Top left-hand panel: Spectra of a population of age 0.7 Gyr, where metallicity, from bottom to top, is Z= 0.2, 1.0 and 2.5 Z⊙; top right-hand panel: age = 10 Gyr, same metallicities. The dotted lines mark some of the absorption features in the spectrum which are typically strong in young stellar populations, and the dashed lines mark some of the absorption features which are typically strong in old, metal-rich stellar populations. From left- to right-hand side, the absorption line species are: Mg ii (2799 Å), H? (3970 Å), Hδ (4102 Å), Hγ (4340 Å), Hβ (4861 Å), Mgb (5175 Å), NaD (5893 Å), Hα (6563 Å), TiO (7126 Å). Middle: The two components found from the various different linear independent basis transformation analyses, with the number of components K= 2, of the observed early-type galaxy spectra. The methods are, from bottom to top: VB-FA, VB-PFA, VB-PCA and VB-RFA. The recovered spectra are convincingly disentangled into one component (middle left-hand panel) with young stellar population features (Mg ii, H?, Hδ, Hγ, Hβ, Hα: dotted lines) and shape, and a second (middle right-hand panel) with the features (Mgb, NaD, TiO: dashed lines) and shape of an old, high-metallicity stellar population. Bottom: The two components found from the same analyses, but from the linear superposition of two synthetic stellar populations. The various curves refer to the same parameters as in the middle panels. The UV spectrum of the ‘young’ component recovered here rises more steeply than for the same component recovered from the observed spectra, due to the inclusion, in the models, of a very young (0.03-Gyr) synthetic stellar population, which is not seen in our observed spectra.
In general, the similarity between the recovered components, and the corresponding model spectra, is striking, for all four methods. In the younger components displayed on the left-hand panel of Fig. 5, the 4000-Å break, typical of a mature stellar population (≳1 Gyr), is not seen, but strong Balmer absorption line features [H? (3970 Å), Hδ (4102 Å), Hγ (4340 Å), Hβ (4861 Å)] can be clearly identified. These are typical features of the A stars which dominate the emission from young (<1-Gyr) stellar populations. In the second (older) component, however, the 4000-Å break is apparent, as are other features typical of mature, metal-rich stellar populations, e.g. absorption features corresponding to Mgb (5175 Å), NaD (5893 Å) and TiO (7126 Å). We can rule out either component being representative predominantly of either active galactic nucleus emission or dust absorption as the characteristic shapes and features of these spectra would be very different from those we recover.

Scatter plots showing the correlation of (top) the age of the younger stellar population and (bottom) the mass fraction of the smaller stellar population determined from two-component fitting to the observed spectra with the weight of the first component (a1) of the VB-PCA (left-hand panel) and VB-RFA (right-hand panel) linear basis transformation analyses of the observed spectra, with K= 2. A high value of a1 clearly corresponds to a substantial young (≲1-Gyr) stellar population. The open circles are for those galaxies with a secondary population of age less than 1 Gyr. These plots also show that the worst-performing (VB-PCA) and best-performing (VB-RFA) algorithms in this case yield almost identical results.
However, the VB-PCA algorithm seems to ‘find’ a spurious subcomponent with younger features (middle left-hand panel) that has some features from old, metal-rich stellar populations (notably NaD).
In Fig. 5, we show results for VB-RFA method (right-hand panels), which has the highest log evidence values, together with the VB-PCA method (left-hand panels), which has the lowest, for comparison. We plot the correlation between the first (‘younger’) component (a1) and the age of the younger stellar population (age2, top panels) and the mass fraction of the secondary stellar population component (m2/Mgal), as determined from the two-component model-fitting described above (bottom panels). The open points are those galaxies which contain a significantly young stellar population (≲1 Gyr).
Fig. 6 plots the results of the Spearman's rank correlation test for the weight of the first component a1 and the age, metallicity and mass fraction of the two fitted stellar populations for the observed spectra with K= 2. A value of the Spearman's rank correlation coefficient |RS| > 0.5 indicates a strong correlation between the parameters. It can be seen that there is a strong correlation between a1 and the ages of the younger stellar populations, and also between a1 and the stellar mass fractions (m1/Mgal, m2/Mgal) of the component stellar populations. The correlation between a1 and age1 is weak, and there is no correlation between a1 and the metallicities of the component stellar populations. The trends of the correlations with a2 follow similar, but opposite, trends to those of a1. This is not generally true for factor analysis, but arises here as a consequence of the nature of our data set.

Spearman's rank correlation coefficient (RS) for the first component (a1) of the four linear basis transformation analyses with K= 2, of the observed spectra and the best-fitting parameters of the two-component synthetic stellar population fits to the data: solid — age of the older population; outline — age of the younger population; light hatched — metallicity of the older population; light cross-hatched — metallicity of the younger population; dark hatched — mass fraction of the smaller population; dark cross-hatched — mass fraction of the dominant population. mod RS > 0.5 indicates a strong correlation.
The rest-frame spectra of four of the galaxies in our sample, together with the reconstructed data model and the two constituent data model components are presented in Fig. 7. The data model convincingly reproduces the observed spectra.

Rest-frame spectra of four of the 21 galaxies in our sample (thick black line), with the reconstructed data model and the two constituent data model components (thin lines) superimposed. The data model convincingly reproduces the observed spectra.
4.2.2 Analysis of the synthetic spectra with the two-component data model
Here, we present the results of the analysis of the two-component synthetic spectra, constructed as described in Section 2.1. They represent every possible combination of the assigned ages, metallicities and relative stellar mass fractions, and thus allow us to investigate a wider range of known age and metallicity combinations than the observed spectra. It should be noted that not all combinations (for example, those with two components <1-Gyr old) represent what we might expect to find in the total population of real early-type galaxies, although there could be local regions within these galaxies where, for example, young stars dominate the stellar population. We wanted to cover the complete range of possibilities in order to avoid making any assumptions about the stellar populations of early-types, which, as revealed even in our small sample of 21 spectra, can be quite varied.
In this experiment, no noise has been added to the synthetic spectra. This is because the noise from variations in the spectral shape caused by differences in the metallicity is higher than any noise contributions which represent measurement errors.
The bottom panels of Fig. 4 show the spectra of the components recovered using the same four analysis methods as used for the observed spectra. Again, the first component has a shape and features that are similar to those of a young stellar population, and the second component, those of a mature population. The shape of the UV part of the spectrum, together with the featureless nature of the spectrum above ∼5000 Å, are typical of very young stellar populations (around a few ×10 Myr). These populations are present in the synthetic data set, but not in significant quantities in the observed data set, hence the differences between the recovered data components.
In Fig. 8, as in Fig. 5, we see that a high value of a1 (the weight of the ‘younger’ component from the Bayesian data model) corresponds to the presence of a <1-Gyr stellar population. Fig. 9 plots the Spearman's rank correlation test results for a1 and the age, metallicity and mass fraction of the synthetic populations. Here, there is an even stronger correlation between a1 and the age of the younger population than in the case of the observed spectra, and also a strong correlation between a1 and the age of the older population, but no significant correlation between a1 and the mass fraction of the secondary population. However, in this exercise with synthetic spectra, we allowed only two values for m2/Mgal (0.25 and 0.5), which does not really allow for a meaningful determination of the correlation coefficient.

Scatter plots showing the correlation of (top) the age of the younger stellar population and (bottom) the mass fraction of the smaller stellar population determined from the synthetic spectra fitting with the weight of the first component of the VB-PCA and VB-RFA linear basis transformation analyses methods (a1), and K= 2. The open circles are for those galaxies with a secondary population of age less than 1 Gyr. A high value of a1 clearly corresponds to a significant young (≪1-Gyr) stellar population.

Spearman's rank correlation coefficient (RS) for the first component (a1) of the various linear basis transformation analyses with K= 2, of the two-component synthetic spectra and the best-fitting parameters of the two-component synthetic stellar population fits to the data: solid — age of the older population; outline — age of the younger population; light hatched — metallicity of the older population; light cross-hatched — metallicity of the younger population; dark hatched — mass fraction of the smaller population; dark cross-hatched — mass fraction of the dominant population. mod RS > 0.5 indicates a strong correlation.
These results confirm that the value of a1 is a clear indicator of the presence of a significant young stellar population, and is particularly unambiguous in discovering components with very young (<1-Gyr) stellar populations, as we found in Section 4.2.1
4.2.3 Analysis of the three-component data model
Fig. 3 shows that increasing the number of component spectra in the Bayesian models from two to three does not significantly increase the evidence for any data model. Indeed, for the positive factor analysis, the evidence decreases as K increases from 2 to 3. These results suggest that the bulk of the stellar content of early-type galaxies can be represented by two components, one which represents contributions from old, metal-rich stellar populations, the other representing contributions from young populations (Fig. 10). It is not easy to find a physical interpretation for the third component. Indeed, the third component is probably a combination of residuals from the sum of the spectra of two populations, combined with correlated components of observational errors.

Analysis of the three-component data model. Top panel: Synthetic stellar population spectra (Jimenez et al. 2004); top left-hand panel: For a stellar population of age 0.7 Gyr, with from bottom to top, metallicity Z= 0.2, 1.0 and 2.5 Z⊙; top right-hand panel: age = 10 Gyr, same metallicities. The dotted lines mark some of the absorption features in the spectrum which are typically strong in young stellar populations, and the dashed lines mark some of the absorption features which are typically strong in old, metal-rich stellar populations. From left- to right-hand side, the absorption line species are: Mg ii (2799 Å), H? (3970 Å), Hδ (4102 Å), Hγ (4340 Å), Hβ (4861 Å), Mgb (5175 Å), NaD (5893 Å), Hα (6563 Å), TiO (7126 Å). Middle and bottom panels: The three components found from the various different linear independent basis transformation analyses of the model stellar populations, with K= 3. The methods are, from bottom to top: VB-FA, VB-PFA, VB-PCA and VB-RFA. The first two recovered spectra are broadly identified with the features of a young and old stellar population in the same way as the K= 2 components. The third component (bottom) possibly contains ‘residuals’, resulting from the fact that the first two components are not a complete detailed reconstruction of the true stellar population components.
The results of the analysis performed on the observed spectra are presented in Fig. 11. The components a1 and a2 correlate significantly with the age of the younger population (age2), except for the VB-RFA, where the correlation is with a3 rather than a1. The relative mass fraction of the two stellar population components (m1 and m2) also correlate with a1 and a2, and, for the VB-RFA, with a3 as well.

Spearman's rank correlation coefficient (RS) for the first component (a1) of the various linear basis transformation analyses with K= 3, of the observed spectra and the best-fitting parameters of the two-component synthetic stellar population fits to the data: solid — age of the older population; outline — age of the younger population; light hatched — metallicity of the older population; light cross-hatched — metallicity of the younger population; dark hatched — mass fraction of the smaller population; dark cross-hatched — mass fraction of the dominant population. mod RS > 0.5 indicates a strong correlation.
For the synthetic two-population spectra, age2 correlates strongly with a1 and a2, for all the analysis methods, and also with a3 for the VB-RFA. However, there is no strong correlation (mod RS > 0.5) for any of the data models with the relative mass fraction of the two populations, although the metallicity of the younger population (Z2) correlates strongly with a3 in the case of the VB-PCA and VB-FA methods.
The differences between the correlation coefficients for the observed and synthetic spectra most likely reflect the different distributions of age, metallicity and mass fraction in the two spectral samples. For example, the observed spectra are all dominated by high-metallicity components (2.5 and 5.0 Z⊙), with small contributions from very low-metallicity populations (0.01 Z⊙). Only one of the galaxies contains stellar populations with intermediate metallicities. In contrast, the synthesized spectra can take the whole range of combinations of the seven allowed metallicities.
4.2.4 Interpretation of the eigencomponents of PCA
Although we have managed to identify components with PCA here, it is a well-known property of PCA that the solution it provides is rotation invariant. That is, the eigencomponents which have equal variances can be multiplied by an arbitrary orthogonal matrix to produce another equally good solution (in terms of satisfying the same objective function). Therefore PCA is not guaranteed to achieve a successful separation of the underlying independent source signals that we are looking for. Methods that take into account higher-order dependencies need to be used for that purpose (Hyvärinen, Karhunen & Oja 2001).
The only ‘lucky’ case where PCA can identify components is if the underlying components have suitably different variances. This is true for the work presented here, but there is no reason why this would hold for data sets of galaxy spectra in general. Therefore, even if PCA has been successful here, it comes with no guarantees.
Another way of looking at this problem is through the probabilistic formalism. PCA's implicit assumption (as shown in detail in Tipping & Bishop 1999) is that the components it will find are Gaussian. Making this assumption explicit, by the probabilistic formalism that we adopted, highlights the lack of plausibility of this assumption: the spectra that we are looking for are not Gaussian, instead they are positive valued only, and have a rather skewed distribution. This renders the use of PCA very problematic in applications that require a clear interpretation of the components.
In turn, by adopting the probabilistic formalism where any assumption in the data model must be made explicit, we are able to specify the distribution that corresponds to components that we are looking for. Therefore, the chances that we can find what we are looking for, and not something else, are considerably increased. The ultimate conclusion is of course made based on visual inspection of the recovered components, made by human experts.
4.3 The prediction of missing values
As a more objective criterion for the assessment of our data modelling compared to the results obtained by the fitting of detailed spectral models, here we investigate the predictive ability of the modelling approach when faced with missing elements of the spectra.
We begin with demonstrating the recovered values for the missing elements along with their predicted uncertainties (posterior variances), as obtained from VB-RFA on the small real data set, compared with those from the best-fitting physical model. This is shown on the scatter plot in Fig. 12. The correlation coefficient between the posterior mean predictions of the data model and the values imputed from the best-fitting physical model is 0.989, the signal-to-noise ratio being 9.3. This is a strong indication that the data-driven analysis and the physical analysis are well matched.

Predictions for the missing values provided by the VB-RFA model versus the physical model. Values of the x-radii of the ellipses represent errors (one standard deviation) obtained from data imputation using synthetic physical models. The y-radius values correspond to one standard deviation of the posteriors of x, as obtained using VB-RFA. There are no missing values in the observed spectra in the central region of the wavelength range considered here, which accounts for the gap in the plotted data.
However, in the case of the real data, the true values of the missing entries are unknown. Therefore, in the following section, we employ synthetic data to perform a more thorough and controlled objective evaluation of the predictive capabilities of our probabilistic data model model. Missing entries were simulated, and the true values were used for evaluation only. The synthetic data set was more diverse than the real one, as a variety of parameter combinations that can be simulated, in this process, may not appear in real spectra. Therefore, a wealth of situations were created on which the generalization abilities of the proposed method were tested. A global and fixed noise level of σ2nt= 0.01 was used in the data models in these experiments.3 We studied the prediction performance for both two- and three-component models (K= 2, K= 3) on new, previously unseen measurements, in the two scenarios as reported below. In this setting, the evidence bound peaked at K= 3, as can be seen in Figs 13 and 14.

Predictive performance from previously unseen spectra using a VB-RFA model with K= 2 (bottom) and K= 3 (top), plotted against the percentage of missing entries from the test set.

Predictive performance from previously unseen wavelength bins using a VB-RFA model with K= 2 (bottom) and K= 3 (top). As before, the x-axis indicates the percentage of missing entries from the test set.
4.3.1 Predictions for previously unseen spectra in the same wavelength range
As the spectra of more and more galaxies become available and are added to public archives, the ability to make predictions for previously unseen spectra, that have a small percentage of missing values within the same overall wavelength range, is an important practical requirement.
It is therefore desirable to be able to infer reliable estimates of missing values, based on a few measurements and the probabilistic representation already created from previously seen spectra. In this scenario, Fig. 13 shows the prediction results obtained by VB-RFA when varying the percentage of missing entries in the test set. The predictions are surprisingly accurate up to a high percentage of missing entries among the previously unseen spectra. Effectively, the high redundancy across the spectra, due to the fact that mature galaxies are very similar to each other, makes this scenario relatively easy to handle.
4.3.2 Predictions for previously unseen wavelength bins
The second scenario investigated here is when the predictions are to be made for missing elements for previously unseen values of wavelength, albeit within the same overall wavelength range.
This is a more difficult task, since the flux varies significantly across different wavelength ranges. We conducted experiments in which the overall wavelength range for each spectrum was split randomly into training sets (regions without missing values in our case, though they need not be so) and test sets, and each model was trained on the training set. Then, to ensure that we retain the missing-at-random (Ghahramani & Jordan 1994) assumption, for randomly picked ranges in the test set, ‘missing’ values were predicted based on the trained model and the prediction errors were averaged over all missing values. This procedure was repeated 10 times with 50 different combinations of training and test sets for each spectrum.
The results are shown in Fig. 14. The percentage of missing values was varied between experiments as is indicated on the x-axis. With a complete training set, the predictions are still pretty accurate even when up to 60 per cent of the bins from the test set are missing.
5 Finding E+A Galaxies
So far, we established generic data-driven procedures for finding evidence of more than one stellar subpopulations, significantly differing in age and metallicity-related characteristic features, in the integrated spectra of galaxies. Here, we show an application to a particular data-mining project which would benefit from this method, once large sets of galaxy spectra are available for public use.
The determination of the star formation history of early-type galaxies has important implications for the much-debated issue of their formation and evolution. Observational evidence suggests that at least some elliptical galaxies already have the bulk of their stars at high redshift (z > 2) (e.g. Renzini 1998; Nolan et al. 2003), but in hierarchical structure-formation models, ellipticals can form at lower redshift from late-type merging (e.g. White & Rees 1978; Kauffmann, White & Guiderdoni 1993). A vital link in understanding how early-type evolution is driven, and under what environmental conditions, is the study of E+A galaxies.
The spectra of these galaxies are characterized by strong hydrogen Balmer absorption lines together with a lack of significant [O ii]3272-Å emission, implying a recent (≲1-Gyr) starburst, but no significant ongoing star formation. The overall shape of the spectral energy distribution of these galaxies is that expected for an early-type galaxy. They may represent an intermediate stage between disc-dominated rotationally supported systems and spheroid-dominated, pressure-supported systems.
The physical mechanisms responsible for the triggering and subsequent truncation of star formation in E+A galaxies are currently unclear. They may be the result of disc-disc merging (or galaxy interactions), or cluster-related phenomenon, e.g. ram-pressure stripping, tidal stripping. However, although at intermediate redshift, it has been suggested that E+A-type galaxies are prevalent in clusters (e.g. Tran et al. 2003), which argues for a cluster-related mechanism, at low redshift (z∼ 0.1), E+A galaxies occur in lower-density environments, which is more indicative of merging or interaction processes (Zabludoff et al. 1996; Blake et al. 2004). Statistical studies of the environments, luminosities and detailed morphologies of these galaxies, across a range of redshifts, are necessary in order to uncover the physical processes governing their star formation history.
The possible relationship between early-type evolution and E+A galaxies has not as yet been well studied, as E+A systems are rare (<1 per cent of the overall zero-redshift galaxy population, Zabludoff et al. 1996). However, recent and ongoing large-scale surveys such as the SDSS and the 2dFGRS offer the opportunity to study these galaxies in a statistically meaningful way, if they can be reliably identified.
In this paper we have developed a rapid, automated method which allows us to identify those early-type galaxies containing a significant young (<1-Gyr old) population; these galaxies are E+A galaxies. Fig. 15 shows a typical E+A galaxy spectrum, together with its constituent population spectra. Unsurprisingly, given the diagnostic abilities of our components analyses, the component spectra look very similar to the spectra recovered from the components analyses.

Top: the optical spectrum of an A3V star; second from top: spectrum of a 12-Gyr stellar population; third from top: combined spectrum of the A star plus 12-Gyr stellar population, a typical E+A spectrum. Bottom: The observed spectrum of NGC 7252, from our sample, which contains a very young (<1-Gyr) stellar population.
Formiggini & Brosch (2004) have analysed the IUE UV spectra of a sample of normal galaxies using PCA. One of their components was associated with a young stellar population, and another with a mature population, consistent with our results. However, they were unable to correlate the principal components with the optical morphology of the galaxies. As we have prior knowledge of the component stellar populations in our galaxy sample, we have been able to make a more quantitative interpretation of our results: if the value of a1 (the weight of the ‘young’ component) for a particular early-type galaxy is >0.5, there is a significant <1-Gyr-old stellar population present in that galaxy.
Hence, our components analysis is able to rapidly identify post-starburst galaxies, without the uncertainties inherent in the measurement of the equivalent widths of the Balmer absorption and [O ii] emission lines. [O ii] emission in disc-dominated galaxies may suffer from dust obscuration, and Hγ and Hβ are subject to emission filling. These, together with the high signal-to-noise-ratio requirements, introduce significant uncertainties in the measurement of these lines. In contrast, as our method utilizes all the information present in the long-baseline spectra, i.e. both the features present and the overall shape of the spectra, it is a more robust diagnostic of recent star forming activity.
Using the wealth of data available in the SDSS and the 2dFGRS, with signal-to-noise ratio as good or better than the data we analyse here, we may map precisely in what environments star formation is suppressed (e.g. in groups or clusters), by rapidly and robustly identifying those galaxies hosting only mature (>1-Gyr) stellar populations. This allows us to investigate whether galaxy transformation (from star forming to passive galaxies) occurs via, e.g. ram-pressure stripping, tidal stripping, harassment or suppression of galaxy-galaxy interactions. Although the wavelength ranges of the SDSS and 2dF spectra are shorter than those we have used, we can use a training set of data to identify the linear components (as suggested below), which can then be applied to the SDSS and 2dF data.
With the higher-resolution spectra of these surveys, and higher-resolution model spectra that are available (e.g. Bruzual & Charlot 2003), we expect the accuracy of our statistical estimates to increase, as we have more independent samples (wavelength bins) of the mixed components. Whether or not the number of components, K, would increase would depend on the data set used. Both the existence of additional components and the interpretation of any further components can only be explored by carrying out the experiment. However, we would not expect the increase in information that we obtain from higher-resolution spectra to lead to a decrease in K.
We intend to extend our Bayesian modelling technique to identify other classes of galaxies, by using appropriately selected training sets of galaxy spectra and wavelength ranges. For example, starburst galaxies may be identified by the Wolf-Rayet ‘bump’ around ∼4600-4900 Å, which is a characteristic feature of the emission from the Wolf-Rayet stars in starburst galaxies. By first applying the Bayesian analysis to an appropriate training set of spectra, which include the spectral range of this feature, identification of these galaxies in a large sample should be possible.
6 Conclusions
Using rigorous Bayesian modelling, we have successfully developed a time-efficient method for the analysis of the spectra of early-type galaxies. At around a few seconds per galaxy for this analysis, its speed compares very well with work by other authors (e.g. Heavens et al. 2000; Fernandes et al. 2005). By comparing the results of this analysis with the results of our unique two-component synthetic stellar population fitting to the long-baseline data, and with two-component synthetic spectra, we have shown that the independent components analysis is physically interpretable.
Two linear components are sufficient to describe the bulk of the stellar content of early-type galaxies. One of the components represents a young (<1-Gyr) stellar population, and the other an old, metal-rich population. The relative contribution of these two components allows us to robustly identify which early-type galaxies host significant young stellar populations, i.e. which are post-starburst galaxies.
Of the four variational Bayesian methods investigated here, we find the variational Bayesian rectified factor analysis to be the most useful, since it has the greatest flexibility in the shape of the allowed distribution, whilst retaining the astronomically motivated requirement that the flux values in each component spectrum are non-negative. This method achieved the highest evidence and the highest correlation between the component weights and the population parameters determined from spectral fitting. The method allows the observational errors to be included in the formulation, and has the additional powerful attribute that it can be used to recover missing regions of spectral coverage. The strength of the correlation between the posterior mean predictions of the data model, and the values imputed from the best-fitting physical model, is a strong indication that the data-driven analysis and the physical analysis are well matched.
We intend to apply our analysis to the early-type galaxies, in the 2dF, 6dF and SDSS catalogues, as a useful tool to investigate their evolution with environment and redshift. We have shown that in addition to efficiently finding early-type galaxies with significant young stellar populations, this method would also be an efficient tool to find other distinctive subclasses, such as E+A galaxies, which is an important missing link in current population scenarios of the formation of early-type galaxies.
In this work, we use model and observed spectra between 2500 and 8000 Å. However, to be useful as a data-mining tool for z∼ 0 galaxies in the SDSS, 2dF and 6dF archives, the Bayesian analysis has to work efficiently in the 3700-8000 Å range equally well. Our next step will be to estimate data models that will learn, from the current data sets that include UV spectra, to work on much larger, solely optical data sets.
In future studies, we therefore intend to:
- (i)
estimate the data models for the 3700-8000 Å range;
- (ii)
apply our technique to 2dF, 6dF, SDSS catalogued spectra, to identify E+A galaxies and their environments;
- (iii)
experiment with identifying starburst galaxies;
- (iv)
consider the robustness of our data models with respect to uncertainties in the calibration and response of the 2dF/6dF/SDSS fibres, and the effect of random continuum errors, via analyses of simulated data with additional noise factors;
- (v)
extend the models to deal with stochastic censoring, i.e. non-ignorable (versus random) missing data.
Acknowledgment
MOH was supported by a Paul & Yuanbi Ramsay research award at the School of Computer Science of the University of Birmingham.
References
In the bayesblocks-based implementation (Valpola et al. 2003) adopted here, a maximal flexibility is ensured by further formulating hierarchical priors on all variables and
, as follows:
, with parameters common to all components of the variable, and the parameters
and
will then have vague priors in the form of a zero-mean Gaussian having a large value for variance. The definition is analogous for
and
. These additional variables have been omitted from the diagram in Fig. 2 for simplicity, as they are part of the relatively standard technicalities concerning a full Bayesian approach (Raiko et al. submitted) rather than an essential part of the application-specific model design.
All algorithms investigated here have been implemented as part of the bayesblocks software (Valpola et al. 2003; Raiko et al. submitted) and may be made freely available upon publication.
Here, σnt can be used to both control the compactness of the representation and simulate a level of measurement uncertainly similar to that encountered in real data.
Appendix
Appendix A: Implementation Details of the Vb-Rfa Data Model
A1 Data model estimation
Posterior updates for q(skt)









Note the non-trivial variational posterior (A2) that resulted from the free-form approximation, a bimodal distribution (A6), which is essentially a mixture of two rectified Gaussians.




Posterior updates forq(msk)








Posterior updates forq(ank)



Posterior updates forq(xnt)


These provide the required posterior expectation . Here, the indicator variable bnt= 1 if ynt is observed and zero if it is missing.
Posterior updates for variance parameters

Here, and
are computed by numerical optimization (Valpola et al. 2004; Raiko et al. submitted). Alternatively, a direct parameterization may be used with Gamma priors and Gamma posteriors may be sufficient for this model.
A2 The log evidence bound

The p and q terms for Gaussians will be given only once, while those that are different will be detailed separately.
p and q terms of Gaussian variables




p and q terms of ank

The q term of skt

A3 Missing value prediction






A4 Required properties of ![graphic]()
