Abstract

Efficient predictive models and data analysis techniques for the analysis of photometric and spectroscopic observations of galaxies are not only desirable, but also required, in view of the overwhelming quantities of data becoming available. We present the results of a novel application of Bayesian latent variable modelling techniques, where we have formulated a data-driven algorithm that allows one to explore the stellar populations of a large sample of galaxies from their spectra, without the application of detailed physical models. Our only assumption is that the galaxy spectrum can be expressed as a linear superposition of a small number of independent factors, each a spectrum of a stellar subpopulation that cannot be individually observed. A probabilistic latent variable architecture that explicitly encodes this assumption is then formulated, and a rigorous Bayesian methodology is employed for solving the inverse modelling problem from the available data. A powerful aspect of this method is that it formulates a density model of the spectra, based on which we can handle observational errors. Further, we can recover missing data both from the original set of spectra which might have incomplete spectral coverage of each galaxy, or from previously unseen spectra of the same kind.

We apply this method to a sample of 21 ultraviolet-optical spectra of well-studied early-type galaxies, for which we also derive detailed physical models of star formation history (i.e. age, metallicity and relative mass fraction of the component stellar populations). We also apply it to synthetic spectra made up of two stellar populations, spanning a large range of parameters. We apply four different data models, starting from a formulation of principal component analysis (PCA), which has been widely used. We explore alternative factor models, relaxing the physically unrealistic assumption of Gaussian factors, as well as constraining the possibility of negative flux values that are allowed in PCA, and show that other models perform equally well or better, while yielding more physically acceptable results. In particular, the more physically motivated assumptions of our rectified factor analysis enable it to perform better than PCA, and to recover physically meaningful results.

We find that our data-driven Bayesian modelling allows us to identify those early-type galaxies that contain a significant stellar population that is ≲1-Gyr old. This experiment also concludes that our sample of early-type spectra showed no evidence of more than two major stellar populations differing significantly in age and metallicity. This method will help us to search for such young populations in a large ensemble of spectra of early-type galaxies, without fitting detailed models, and thereby to study the underlying physical processes governing the formation and evolution of early-type galaxies, particularly those leading to the suppression of star formation in dense environments. In particular, this method would be a very useful tool for automatically discovering various interesting subclasses of galaxies, for example, post-starburst or E+A galaxies.

1 Introduction

The determination of the star formation history of early-type galaxies has important implications for the still-controversial issue of their formation and evolution. Are the bulk of their stars assembled at high redshift, followed by predominantly passive evolution, or are they formed from low-redshift disc-disc merging? Or, as it seems likely from the variety of detailed elliptical morphology which is being revealed via projects, e.g. those involving integral field spectrographs such as SAURON, does the class of ellipticals contain subclasses, and are these dependent on the method of their formation, or on their local environment (Bacon et al. 2001; Emsellem et al. 2004)? We would like to know how these subclasses depend on local environment. In particular, we wish to know what mechanisms are responsible for the well-known morphology-density relationship (e.g. Dressler 1980), which shows that the proportion of early-type galaxies increases as the proportion of late-type galaxies decreases in regions with higher galaxy density. Furthermore, the recent interest in E+A galaxies (e.g. Tran et al. 2003), in which the signature of a recent (≲1-Gyr) starburst is seen, but no significant ongoing star formation, suggests that the relation between the evolution of these galaxies and their constituent stellar populations is not well understood.

However, until recently, only small ensembles of galaxy spectra, spanning a large useful range, were available and the analysis of a large statistical sample of stellar populations of galaxies was not possible. With the recently completed two-degree Field Galaxy Redshift Survey (2dFGRS) and the ongoing Sloan Digital Sky Survey (SDSS) or 6dF Galaxy Survey (6dFGS), a few million galaxy spectra will soon be available. The detailed physical modelling of stellar populations is numerically expensive, so the timely extraction of useful knowledge, such as the characteristics of the star formation history of galaxies, from these data will largely depend on developing appropriate data analysis tools that are able to complement more specialized astrophysical studies, in particular by automating parts of the analyses. Some work has already been carried out using an approach which incorporates assumptions from physical models to a greater or lesser extent (e.g. Heavens, Jimenez & Lahav 2000; Kauffmann et al. 2003; Fernandes et al. 2005; Ocvirk et al. 2005a,b; Solorio et al. 2005).

With the increase of data availability, significant research has recently been focusing on the use of data-driven methods of astrophysical data analysis. Some of these attempt to analyse galaxy properties from spectra, independently from any physical model. The usefulness of the application of principal component analysis (PCA), a classical multivariate statistical analysis technique, has been exploited and demonstrated in various scenarios (see e.g. Connolly et al. 1995; Folkes, Lahav & Maddox 1996; Ronen, Aragon-Salamanca & Lahav 1999; Madgwick et al. 2003a, b). Indeed, PCA is an efficient tool for data compression, and as such, suitable for providing a compressed representation for subsequent physical analysis. In a different line of research, the automated partitioning of spectra based on their morphological shapes (Slonim, Somerville & Lahav 2001) has also been explored with reasonable success.

Here, we explore the application of a similar technique to a different problem, namely that of the data-driven identification of the spectra of distinct stellar subpopulations in the spectrum of a galaxy. Our only assumption is that the galaxy spectrum can be expressed as a linear superposition of a small number of independent factors, each a spectrum of a stellar subpopulation that cannot be individually observed. We would like to implement an optimal feature transformation that would reveal interpretable and meaningful structural properties (factors) from a large sample of galaxy spectra. This would allow fast, automated extraction of some key physical properties from potentially very large data sets of the kind the Virtual Observatory would offer.

Preliminary results based on a simple non-orthogonal projection method have indicated that an identification of interpretable component spectra may be achievable in a purely data-driven manner (Kabán, Nolan & Raychaudhury 2005). In this work, we follow a more rigorous methodology, explicitly encoding the above hypothesis into appropriate probabilistic data models. We then employ a Bayesian formalism to solve the inverse modelling problem, as well as to objectively assess to what extent such a hypothesis is supported by the data, in terms of both physical interpretability and the ability for predictive generalization to previously unseen data of the same type.

The Bayesian framework allows us to automate several functional requirements within a single statistically sound formal framework, where all modelling assumptions may be explicitly encoded. Tasks ranging from inference and prediction to inverse modelling and data explanation, data compression, as well as model comparison are all accomplished in a flexible manner, even in conditions of missing values and measurement uncertainties.

In this study, we start from a set of synthetic spectra of stellar populations of known age and metallicity from the single stellar population models of Jimenez et al. (2004). We also have assembled a set of observed ultraviolet (UV)-optical spectra of early-type galaxies, to which we have fitted two-component stellar population models, and thus we have some knowledge of their star formation history (i.e. ages, metallicities and mass fractions of the component stellar populations). We then perform our data-driven Bayesian analysis on both synthetic and observed spectra, and compare the derived parameters with those obtained from detailed astrophysical modelling.

In Section 2, we describe the set of 21 observed early-type galaxy spectra that we analyse here, and describe the synthetic stellar population spectra. Section 3 details the Bayesian modelling of the data. Our results are presented and discussed in Section 4. In Section 5, we discuss the significance of this work on characterizing E+A galaxies. Our conclusions are outlined in Section 6. Appendix A gives the detailed mathematical framework required to implement our methodology, specifically for the data model estimation of the variational Bayesian rectified factor analysis — a flexible data model for the factorization of positive data that we developed.

2 The Data

2.1 Synthetic stellar population models

We perform the linear independent transformation analyses on a set of two-component model stellar population spectra. We use the single stellar population models of Jimenez et al. (2004), which have a Salpeter initial mass function, across the same wavelength range (2500-8000 Å) as the observed spectra, and create a two-component model flux, F2pop,λ. This is defined as
(1)
where F2pop,λ,t is the model flux per unit wavelength in the bin centred on wavelength λ, which is the sum of the two single-metallicity, single-age model fluxes graphic and graphic, which have metallicities and ages, Zi, ti and Zj, tj, respectively. Here mi, mj are the fractional contributions (by stellar mass) of each component to the total model population. We set mi+mj= 1, where mi can take values 0.25 or 0.5. Ages assigned for ti, tj are 0.03, 1, 3, 7 and 12 Gyr, and the metallicities are 0.004, 0.01, 0.02, 0.03 and 0.05, making a total of 50 two-component model spectra.

2.2 Observed early-type galaxy spectra

We have also compiled a set of 21 early-type galaxy spectra from the archives for this study, across the wavelength range 2500-8000 Å, although not all objects have data covering the complete range. The spectra of the galaxies in the sample are shown in Fig. 1. As the galaxy sample was assembled from those nearby early-type galaxies with both UV and optical spectra, from various archives, it does not represent a statistical sample. The sources for the data are listed in Table 1, together with the telescopes with which the optical spectra were observed. The UV spectra were observed using the International Ultraviolet Explorer (IUE), except for NGC 3605 and 5018, which were observed with the Faint Object Spectrograph (FOS) on the Hubble Space Telescope (HST).

Rest-frame spectra of the 21 galaxies in our sample. See Table 1 and Section 2 for details. The spectra are arbitrarily shifted along the flux axis for the sake of clarity.
Figure 1.

Rest-frame spectra of the 21 galaxies in our sample. See Table 1 and Section 2 for details. The spectra are arbitrarily shifted along the flux axis for the sake of clarity.

Table of references and optical telescopes for the observed early-type galaxy spectra.
Table 1.

Table of references and optical telescopes for the observed early-type galaxy spectra.

Since these spectra are compiled from sources with varying spectral coverage, the resulting data matrix has some missing values. The regions of incomplete wavelength coverage are obvious in Fig. 1. It is one of the strengths of our analysis that these missing data regions can be recovered; this is described in Sections 3.1.3, 3.4 and 3.5. Another of the advantages of the probabilistic framework adopted here is that, in contrast with classical non-probabilistic methods, the observational errors on each data point can be taken into account (Section 3.1.2).

As the data matrix on which the analysis is performed must have the same wavelength binning for all the galaxies, the spectra are de-redshifted and binned to the same wavelength resolution as the synthetic stellar population spectra (∼20 Å; the observed spectra have wavelength resolution which varies from ∼5-18 Å). For each galaxy, the various spectral sections, taken in turn, are normalized to unity in the overlap regions, and spliced, to create a single continuous spectrum for each object. Although the different spectral pieces for any one object were observed separately, and with different telescopes, it should be noted that the apertures used were similar, and therefore each section probes a similar part of each galaxy, and the shape of the continua in the overlap regions agree well.

3 Data Modelling

Models are simplifications of reality representing relevant aspects while glossing over irrelevant details. Data models are constructed from experimental evidence (data) and available prior knowledge. Once a data model is estimated (which can be done off-line), it offers a tool for making inferences and predictions in a consistent framework.

In Bayesian data modelling, all information is encoded in probability distributions (Bernardo & Smith 1994). The joint probability distribution of the observed data and other unobservable variables of interest (such as factors or component stellar populations of the representation as well as the parameters of the mixing process, etc.) need to be designed. The unobservable variables in the model are often referred to as hidden or latent variables.

3.1 Designing and building an appropriate latent variable architecture

3.1.1 The independent factor representation model of the spectrum

The spectrum of a galaxy or of a stellar subpopulation is essentially a vector, over a binned wavelength range, representing the flux per unit wavelength bin. Consider a set of N spectra of early-type galaxies, each measured at T different wavelength bins. We denote by graphic the vector formed by the flux values at the wavelength bin indexed by t, as measured across the N spectra. The N×T matrix formed by these vectors may be referred to as X and single elements of this matrix will be denoted as xnt. Similar notational convention will also apply to other variables used.

As stated above, the hypothesis of our model is that each of the N observations can be represented as a superposition of K < N latent underlying component spectra or factors graphic, which are not observable by direct measurements but only through an unknown linear mapping graphic. Formally, this can be summarized as follows.
(2)
In (2), the first term of the right-hand side is the so-called systematic component and ε is the noise term or the stochastic component of this model, both of which will be further detailed below. The noise term ε is assumed to be a zero-mean independent and identically distributed Gaussian, with variance evxn. The diagonal structure of the covariance accounts for the notion that all dependencies that exist in xt should be explained by the underlying hidden factors.

The K components are assumed to be statistically independent, which is a standard assumption of independent factor models. The role of the function ƒ here is a technically convenient way of imposing certain constraints on the components ƒ(s) in order to facilitate the estimation of interpretable factors. This will be detailed in Section 3.1.4

As in nearly all data modelling applications of factor models, we hope that the feature transformation produced by the mapping A will reveal important structural characteristics of the collection of spectra. It is part of our data modelling design to investigate various ways of facilitating this, as will be detailed in Section 3.1.4

3.1.2 Dealing with measurement errors in the data model

Classical non-probabilistic data analysis tools do not offer the flexibility required for taking into consideration the uncertainty that is associated with all real measurements. It is thus an important practical advantage of the probabilistic framework we adopt here that it allows us to take such measurement errors into account as an integral part of the data model. This is achieved simply by making the ‘clean’ vectors xt in (2) become hidden variables of the additional error model
(3)
where ent is a zero-mean Gaussian noise term with known standard deviations σnt, for each individual measurement n= 1 : N, t= 1 : T.

3.1.3 Dealing with missing values in the data model

Our probabilistic framework also allows us to treat missing values in a principled and straightforward manner under the assumption that they are missing at random (Ghahramani & Jordan 1994; Chan, Lee & Sejnowski 2003). Splitting each datum vector into missing and observed parts, i.e. yTt= (yoTt, ymTt), the missing entries simply marginalize out:
(4)
The superscript ‘o’ will be omitted hereafter for simplicity, noting that unless otherwise stated, yt will refer to yot and missing entries are discounted in all the update equations that follow.

Having completed the general structural specification of the proposed probabilistic model, in the following subsection, we will make more specific assumptions about distributional form of the latent spectral representation that emerges from such probabilistic models.

3.1.4 The distributional forms employed in the data model

In order to be able to infer the parameters of the factor model (2), the form of the distributions of the involved variables, which may have a key impact on the subsequent interpretability of the resulting probabilistic model (MacKay 2003), need to be defined. That such assumptions are made explicit is indeed an important feature of the Bayesian framework adopted here. An important advantage of this approach is that of the several alternative forms that could be assumed for these distributional forms (we could call these ‘hypotheses’), the one best supported by the available data can be determined automatically from the data evidence. More details can be found below, in Section 3.3.

Before proceeding, we note that such assumptions are present in all automated analysis techniques, although in the case of non-probabilistic methods they are implicit and fixed. In particular, it has been shown that the widely used technique of principal component analysis (PCA) implicitly assumes a standard Gaussian representation and isotropic Gaussian noise (Tipping & Bishop 1999). Although PCA has been useful in a number of ways in astrophysical data analysis (Connolly et al. 1995; Folkes et al. 1996; Ronen et al. 1999; Madgwick et al. 2003a, b), there is no guarantee that the assumptions it makes are appropriate in our case. In particular, the assumption of a Gaussian distribution for the spectral elements in our present problem would imply that they are equally likely to have positive and negative flux values. This may make the physical interpretation of the spectra of the individual stellar populations (factors) problematic. Indeed, the physical validity of the isotropic structure of the Gaussian noise imposed by PCA has been questioned by Madgwick et al. (2003b).

It is therefore worthwhile to consider alternative factor models to relax such unrealistic constraints, and to introduce more justifiable constraints (such as the positivity of the factors). This is what we do in this section.

In this study we consider four factor models with differing assumptions made regarding the distributional forms employed. These are summarized in Table 2. The first three of these refer to factor models existing in the literature, and the fourth one has been recently developed by us (Harva & Kabán 2005). All factor models have been implemented in a Bayesian formulation (the so-called variational Bayesian framework, see details in Section 3.3) in order to ensure a consistent formal framework. The four factor models are described below.

Distributional assumptions for the various data models.
Table 2.

Distributional assumptions for the various data models.

VB-PCA The first row of Table 2 describes the variational Bayesian PCA (VB-PCA) data model (Bishop 1999). Due to the widespread use of PCA, we will consider it as the baseline hypothesis. As already mentioned, it assumes that the spectral factors of the representation follow a Gaussian distribution and there is no positivity constraint. This is summarized in the columns p(s) and ƒ(s), respectively, of the first row of Table 2. Further, it assumes that the mixing coefficients of the factors (i.e. the elements of the matrix A of equation 2) follow a standard Gaussian and finally that the noise term has an isotropic variance structure (i.e. graphic is the same for all n).

VB-FA In Table 2, the variational Bayesian factor analysis (VB-FA) data model (Bishop 1999) relaxes the constraint of the isotropic structure of the Gaussian noise imposed by PCA. Indeed, such a constraint may be physically unjustified (Madgwick et al. 2003b). All other assumptions made are in turn identical with those of VB-PCA.

VB-PFA The above two models do not require the individual factors, which are spectra of the stellar subpopulations, to be non-negative, which they need to be in order to be physically acceptable. The remaining two models considered here do so. The third row of Table 2 describes the variational Bayesian positive factor analysis (VB-PFA), a data model developed by Miskin (2000), that, in contrast to VB-PCA and VB-FA, imposes the positivity constraint. It does so by assuming that each individual spectrum p(s) (factor) of the representation follows a zero-mode rectified Gaussian distribution (essentially a Gaussian restricted to its positive half and renormalized, defined in Section A4). VB-PFA also restricts the mixing coefficients p(a) to be positive, assumed to be standard rectified Gaussians, and maintains the same structural form for the noise variance graphic as VB-FA.

The positivity constraint just introduced is meant to facilitate the interpretability of the factors as component spectra. Let us observe, however, that in VB-PFA, the 0 mode of p(s) would mean that the recovered components would have a low flux close to zero occupying most of the spectrum. Although there has been a great deal of interest in factor models that have such sparsely distributed factors for a variety of other purposes (Lee & Seung 2001; Miskin & Mackay 2001; Hinton & Ghahramani 1997; Socci, Lee & Seung 1998; Frey & Hinton 1999), including image- and text-related applications as well as neurobiological modelling, such a sparsity assumption may be restrictive and often unjustified for the modelling of stellar population spectra. However, as shown in Harva & Kabán (2005), modifying VB-PFA to relax the 0-mode assumption is technically impossible. Thus, we developed a different way of imposing the positivity constraint, which also permits the mode of p(s) to be estimated from the data in a flexible manner. We did not discard the hypothesis of VB-PFA either, since it has been applied to stellar population spectra in the literature (Miskin 2000) and we do not wish to discard any existing hypotheses based on possibly subjective intuitions.

VB-RFA The last row of Table 2 describes the variational Bayesian rectified factor analysis (VB-RFA) data model (Harva & Kabán 2005), where the positivity constraint over the factors is ensured through the non-differentiable function ƒ(s) = max(0, s). With this solution, the factors are now ƒ(s) and they are guaranteed to be positive. Then the variables s become some dummy arguments of ƒ(), which are now safely modelled as Gaussians (column p(s) of fourth row of Table 2). Therefore, both the means and variances of these Gaussians can now be estimated from the data. The resulting model allows more flexibility regarding the shape of the factors compared with VB-PFA, while ensuring the positivity of the factors. Finally, the mixing coefficients p(a) are assumed to be distributed as standard rectified Gaussians. Note that in this context the zero-mode constraint is appropriate since the sparsity of the mixing matrix A translates to assuming a smooth basis transformation. The technical novelty and computational advantages of the VB-RFA approach are detailed in Harva & Kabán (2005).

As it should now be clear, our scope is neither to apply any off-the-shelf data analysis method in an unjustified manner, nor to rely on subjective intuitions, as both of these approaches may easily lead to biased results. Instead, we have constructed four different hypotheses which we consider equally likely a priori. The Bayesian framework allows us to decide which hypothesis is best supported by the data.

3.2 The overall model architecture

The architecture for any of the four data models defined above can be graphically represented as in Fig. 2. This is the joint density of the observational data and all other variables of interest. Nodes denote random variables, rectangles refer to observed variables (i.e. derived from observational data), and circles and hexagons refer to unobservable variables that we are hoping to infer from the data through the formulated probabilistic model. Hexagons denote hidden variables on the top of the hierarchy.1 Arrows signify generative causal dependency relations.

Graphical model of the proposed latent variable architecture (i.e. the data model). Nodes represent random variables in the data model, and arrows indicate causal relationships between variables. Out of the nodes, rectangles refer to observed variables: measured fluxes, y and known measurement errors, vy. Circles and hexagons refer to latent (hidden) variables: clean flux, x, along with their variance parameters, vx, mixing coefficients, a, and components, s, along with their component-wise means, ms, and variance parameters, vs). Hexagons indicate the latent variables on the top of this hierarchy, for which further hierarchical priors may be considered.
Figure 2.

Graphical model of the proposed latent variable architecture (i.e. the data model). Nodes represent random variables in the data model, and arrows indicate causal relationships between variables. Out of the nodes, rectangles refer to observed variables: measured fluxes, y and known measurement errors, vy. Circles and hexagons refer to latent (hidden) variables: clean flux, x, along with their variance parameters, vx, mixing coefficients, a, and components, s, along with their component-wise means, ms, and variance parameters, vs). Hexagons indicate the latent variables on the top of this hierarchy, for which further hierarchical priors may be considered.

We now outline the variational Bayesian estimation procedure associated with this architecture.

3.3 Estimation of a data model

Each of the described four data models is estimated using the variational Bayesian methodology described in this section. The specific expressions obtained for the VB-RFA data model are provided in Appendix A. To make the notation concise, we will refer to the latent variables by θ and to the data by Y. Since handling the posterior distribution p(θ|Y) is intractable, we resort to a variational scheme (Jordan et al. 1999; Lappalainen 1999; Attias 2000; Raiko et al. submitted), where an approximative distribution q(θ) is fitted to the true posterior. This is done by constructing a lower bound of the log evidence, based on Jensen's inequality,
(5)
where 〈·〉q denotes expectation with respect to q.
This inequality holds for any distribution q(·) due to the convexity of the logarithm. A fully factorial approximation of the posterior (Bishop 1999; Lappalainen 1999; Attias 2000; Bishop, Spiegelhalter & Winn 2002; MacKay 2003; Raiko et al. submitted) will be employed here, which is also known to lead to estimation algorithms that can be implemented efficiently by using local computations only (Lappalainen 1999; Bishop et al. 2002; Valpola et al. 2003; Raiko et al. submitted). The best fully factorial approximate posterior (referred to as the variational posterior) is computed by functional maximization of (5), which can be shown (Jordan et al. 1999; Lappalainen 1999; MacKay 2003) to be equivalent to minimizing the Kullback-Leibler divergence (Cover & Thomas 1991) of the variational posterior from the true one. The fully factorial variational posterior of the data model in Fig. 2 is simply of the form
(6)
Therefore, (5) further decomposes into one plus twice as many terms as there are latent variables in the data model and the optimization with respect to the individual variables is separately performed. The variational posteriors thus obtained, along with the statistics that are required for an implementation,2 are given in Section A1 for the VB-RFA model. Some of these posterior computations are non-trivial. Readers interested in more detailed derivations are referred to Harva & Kabán (2005).

The model estimation algorithm then consists of iteratively updating each parameter's posterior statistics in turn, while keeping all other posterior statistics fixed, until a convergence criterion is achieved. For each update, the evidence bound (5) is guaranteed not to decrease (Jordan et al. 1999; MacKay 2003), and convergence to a local optimum is guaranteed. It can also be shown that due to the fully factorial posterior approximation (6), all updates are local, i.e. requiring posterior statistics of the so-called Markov blanket only. That is, for updating any of the variable nodes, the posterior statistics of only its children, parents and coparents are needed. This has been exploited in the efficient implementation realized in Valpola et al. (2003), Bishop et al. (2002) and Winn & Bishop (2005), and has also been adopted in this work. The scaling of the resulting variational Bayesian estimation algorithm is thus multilinear in the number of factors, observations and wavelength bins of the model. These can also be easily followed from the further details given in Section A1.

The evidence bound (5) can be used both for monitoring the convergence of the algorithm and, more importantly, for comparison of different solutions or models. The quantities required for computing this are given in Section A2.

3.4 Using a data model to recover missing elements of the data

We have already stressed the advantage of the probabilistic Bayesian framework adopted here in terms of the ability of the resulting data models to predict and recover incomplete data. Recovering missing values can be achieved by computing the posterior of the missing entries given the observed entries:
(7)
An approximate mean value of this posterior provides the imputation for the missing entry. This is discussed in Section A3.

3.5 Using a data model for recovering missing elements from new measurements

Due to the fully generative nature of the adopted framework, once the data model parameters are estimated from a training data set, inferences and predictions can be made for new, previously unseen measurements. This involves performing the inference for those terms of the variational posterior q(θ) that are conditional on the type of new measurements (new galaxy spectra in the same range or various subranges of wavelength for the same galaxies). Using these, as well as the remainder of terms of q(θ) as estimated from the training set, we can then proceed to computing (7). Results will be presented in Section 4.3.

4 Results

One of the strengths of the Bayesian evidence framework employed in this study is that the selection of the best-supported hypothesis can be automatically achieved, in a quantitative manner, based on the data evidence. This is known as model selection.

4.1 Model selection results: how many major stellar populations do we need?

Here, we seek a criterion to examine which data model performs the best, as well as to find the dimensionality of the representation space. We would like to have a compact model (small number of factors) both to avoid over-fitting the data, and also for the results to be physically interpretable. At the same time, we would like to keep an adequate number of factors to guarantee a good approximation of the data. Apart from the interpretability issue, the Bayesian evidence framework can be used to automatically select the optimal model and its order from those investigated. Our preferences, motivated by the purpose of the application (e.g. data interpretation, missing value prediction), expressed as prior belief, would then be used to weight the evidence obtained from the data, in order to make a final Bayesian decision on the order of the model.

Fig. 3 shows the negative of the log evidence for the four data models (strictly, the negative log evidence bound), versus the model order, K (i.e. the number of components), for the observed spectra (Fig. 1). It can be seen that the VB-RFA model outperforms the other three models. We also see that increasing the number of factors beyond two does not significantly increase the evidence, which suggests that the bulk of spectra of an early-type galaxy can be reasonably well and compactly described by two factors or components, each interpretable as a stellar subpopulation. This is useful for input into more detailed physical models of early-type galaxy spectra. We will also see in Section 4.2 that the K= 2 models produce two factors that are both physically meaningful, while increasing the model order to K= 3 does not bring any further interpretable factors.

The negative log evidence for the variational Bayesian analyses of the observed spectra, as a function of the number of factors (components), i.e. stellar subpopulations. The performance of VB-PCA is clearly inferior. VB-RFA has the highest log evidence, with no significant improvement in increasing K, the number of components, from two to three. This shows that the spectra of early-type galaxies used here can be safely modelled as the sum of the spectra of two distinct major stellar components, and the inclusion of more components is not required by the data.
Figure 3.

The negative log evidence for the variational Bayesian analyses of the observed spectra, as a function of the number of factors (components), i.e. stellar subpopulations. The performance of VB-PCA is clearly inferior. VB-RFA has the highest log evidence, with no significant improvement in increasing K, the number of components, from two to three. This shows that the spectra of early-type galaxies used here can be safely modelled as the sum of the spectra of two distinct major stellar components, and the inclusion of more components is not required by the data.

4.2 Physical interpretation of the results

Here, we compare the results of the data-driven method outlined so far with an entirely independent determination of the star formation history of our 21 early-type galaxies.

In this more conventional approach, we have fitted two-component detailed physical models, constructed from the single stellar population models of Jimenez et al. (2004), to the observed long-baseline spectra of our sample, as shown in Fig. 1. The ages, metallicities and relative mass fractions of the component stellar populations are allowed as free parameters. We construct the two-component spectra as the linear superposition of the two-component spectra (1), and allow ages 0.01-14.0 Gyr, and metallicities Z= 0.01, 0.2, 0.5, 1.0, 1.5, 2.5 and 5.0 Z. Again, the relative mass fractions have to be jointly constrained, mi+mj= 1, but mi is allowed to vary from 0 to 1, in steps of 0.02. A minimum-?2 fit was used to determine the best-fitting values of mi, Zi, ti, mj, Zj and tj. The whole parameter space was searched, with the best-fitting parameter values corresponding to the point on the agei-agej-Zi-Zj-mi-mj grid with the minimum calculated ?2. Details of the fitting process are given in Nolan (2002) and Nolan et al. (2003).

4.2.1 Analysis of the observed spectra with the two-component data model

In Fig. 4, synthetic spectra (Jimenez et al. 2004) of young (0.7 Gyr, left-hand panel) and old (10 Gyr, right-hand panel) stellar populations are shown, for metallicities 0.2, 1.0 and 2.5 Z, in each of the upper panels, from top to bottom, respectively. Below these are shown the spectra of the individual components, recovered from the four analysis methods described above, for both the observed galaxy spectra (middle panels), and linear superpositions of pairs of synthetic spectra (bottom panels).

Analysis of the two-component data model. Top: Synthetic stellar population spectra (Jimenez et al. 2004). Top left-hand panel: Spectra of a population of age 0.7 Gyr, where metallicity, from bottom to top, is Z= 0.2, 1.0 and 2.5 Z⊙; top right-hand panel: age = 10 Gyr, same metallicities. The dotted lines mark some of the absorption features in the spectrum which are typically strong in young stellar populations, and the dashed lines mark some of the absorption features which are typically strong in old, metal-rich stellar populations. From left- to right-hand side, the absorption line species are: Mg ii (2799 Å), H? (3970 Å), Hδ (4102 Å), Hγ (4340 Å), Hβ (4861 Å), Mgb (5175 Å), NaD (5893 Å), Hα (6563 Å), TiO (7126 Å). Middle: The two components found from the various different linear independent basis transformation analyses, with the number of components K= 2, of the observed early-type galaxy spectra. The methods are, from bottom to top: VB-FA, VB-PFA, VB-PCA and VB-RFA. The recovered spectra are convincingly disentangled into one component (middle left-hand panel) with young stellar population features (Mg ii, H?, Hδ, Hγ, Hβ, Hα: dotted lines) and shape, and a second (middle right-hand panel) with the features (Mgb, NaD, TiO: dashed lines) and shape of an old, high-metallicity stellar population. Bottom: The two components found from the same analyses, but from the linear superposition of two synthetic stellar populations. The various curves refer to the same parameters as in the middle panels. The UV spectrum of the ‘young’ component recovered here rises more steeply than for the same component recovered from the observed spectra, due to the inclusion, in the models, of a very young (0.03-Gyr) synthetic stellar population, which is not seen in our observed spectra.
Figure 4.

Analysis of the two-component data model. Top: Synthetic stellar population spectra (Jimenez et al. 2004). Top left-hand panel: Spectra of a population of age 0.7 Gyr, where metallicity, from bottom to top, is Z= 0.2, 1.0 and 2.5 Z; top right-hand panel: age = 10 Gyr, same metallicities. The dotted lines mark some of the absorption features in the spectrum which are typically strong in young stellar populations, and the dashed lines mark some of the absorption features which are typically strong in old, metal-rich stellar populations. From left- to right-hand side, the absorption line species are: Mg ii (2799 Å), H? (3970 Å), Hδ (4102 Å), Hγ (4340 Å), Hβ (4861 Å), Mgb (5175 Å), NaD (5893 Å), Hα (6563 Å), TiO (7126 Å). Middle: The two components found from the various different linear independent basis transformation analyses, with the number of components K= 2, of the observed early-type galaxy spectra. The methods are, from bottom to top: VB-FA, VB-PFA, VB-PCA and VB-RFA. The recovered spectra are convincingly disentangled into one component (middle left-hand panel) with young stellar population features (Mg ii, H?, Hδ, Hγ, Hβ, Hα: dotted lines) and shape, and a second (middle right-hand panel) with the features (Mgb, NaD, TiO: dashed lines) and shape of an old, high-metallicity stellar population. Bottom: The two components found from the same analyses, but from the linear superposition of two synthetic stellar populations. The various curves refer to the same parameters as in the middle panels. The UV spectrum of the ‘young’ component recovered here rises more steeply than for the same component recovered from the observed spectra, due to the inclusion, in the models, of a very young (0.03-Gyr) synthetic stellar population, which is not seen in our observed spectra.

In general, the similarity between the recovered components, and the corresponding model spectra, is striking, for all four methods. In the younger components displayed on the left-hand panel of Fig. 5, the 4000-Å break, typical of a mature stellar population (≳1 Gyr), is not seen, but strong Balmer absorption line features [H? (3970 Å), Hδ (4102 Å), Hγ (4340 Å), Hβ (4861 Å)] can be clearly identified. These are typical features of the A stars which dominate the emission from young (<1-Gyr) stellar populations. In the second (older) component, however, the 4000-Å break is apparent, as are other features typical of mature, metal-rich stellar populations, e.g. absorption features corresponding to Mgb (5175 Å), NaD (5893 Å) and TiO (7126 Å). We can rule out either component being representative predominantly of either active galactic nucleus emission or dust absorption as the characteristic shapes and features of these spectra would be very different from those we recover.

Scatter plots showing the correlation of (top) the age of the younger stellar population and (bottom) the mass fraction of the smaller stellar population determined from two-component fitting to the observed spectra with the weight of the first component (a1) of the VB-PCA (left-hand panel) and VB-RFA (right-hand panel) linear basis transformation analyses of the observed spectra, with K= 2. A high value of a1 clearly corresponds to a substantial young (≲1-Gyr) stellar population. The open circles are for those galaxies with a secondary population of age less than 1 Gyr. These plots also show that the worst-performing (VB-PCA) and best-performing (VB-RFA) algorithms in this case yield almost identical results.
Figure 5.

Scatter plots showing the correlation of (top) the age of the younger stellar population and (bottom) the mass fraction of the smaller stellar population determined from two-component fitting to the observed spectra with the weight of the first component (a1) of the VB-PCA (left-hand panel) and VB-RFA (right-hand panel) linear basis transformation analyses of the observed spectra, with K= 2. A high value of a1 clearly corresponds to a substantial young (≲1-Gyr) stellar population. The open circles are for those galaxies with a secondary population of age less than 1 Gyr. These plots also show that the worst-performing (VB-PCA) and best-performing (VB-RFA) algorithms in this case yield almost identical results.

However, the VB-PCA algorithm seems to ‘find’ a spurious subcomponent with younger features (middle left-hand panel) that has some features from old, metal-rich stellar populations (notably NaD).

In Fig. 5, we show results for VB-RFA method (right-hand panels), which has the highest log evidence values, together with the VB-PCA method (left-hand panels), which has the lowest, for comparison. We plot the correlation between the first (‘younger’) component (a1) and the age of the younger stellar population (age2, top panels) and the mass fraction of the secondary stellar population component (m2/Mgal), as determined from the two-component model-fitting described above (bottom panels). The open points are those galaxies which contain a significantly young stellar population (≲1 Gyr).

Fig. 6 plots the results of the Spearman's rank correlation test for the weight of the first component a1 and the age, metallicity and mass fraction of the two fitted stellar populations for the observed spectra with K= 2. A value of the Spearman's rank correlation coefficient |RS| > 0.5 indicates a strong correlation between the parameters. It can be seen that there is a strong correlation between a1 and the ages of the younger stellar populations, and also between a1 and the stellar mass fractions (m1/Mgal, m2/Mgal) of the component stellar populations. The correlation between a1 and age1 is weak, and there is no correlation between a1 and the metallicities of the component stellar populations. The trends of the correlations with a2 follow similar, but opposite, trends to those of a1. This is not generally true for factor analysis, but arises here as a consequence of the nature of our data set.

Spearman's rank correlation coefficient (RS) for the first component (a1) of the four linear basis transformation analyses with K= 2, of the observed spectra and the best-fitting parameters of the two-component synthetic stellar population fits to the data: solid — age of the older population; outline — age of the younger population; light hatched — metallicity of the older population; light cross-hatched — metallicity of the younger population; dark hatched — mass fraction of the smaller population; dark cross-hatched — mass fraction of the dominant population. mod RS > 0.5 indicates a strong correlation.
Figure 6.

Spearman's rank correlation coefficient (RS) for the first component (a1) of the four linear basis transformation analyses with K= 2, of the observed spectra and the best-fitting parameters of the two-component synthetic stellar population fits to the data: solid — age of the older population; outline — age of the younger population; light hatched — metallicity of the older population; light cross-hatched — metallicity of the younger population; dark hatched — mass fraction of the smaller population; dark cross-hatched — mass fraction of the dominant population. mod RS > 0.5 indicates a strong correlation.

The rest-frame spectra of four of the galaxies in our sample, together with the reconstructed data model and the two constituent data model components are presented in Fig. 7. The data model convincingly reproduces the observed spectra.

Rest-frame spectra of four of the 21 galaxies in our sample (thick black line), with the reconstructed data model and the two constituent data model components (thin lines) superimposed. The data model convincingly reproduces the observed spectra.
Figure 7.

Rest-frame spectra of four of the 21 galaxies in our sample (thick black line), with the reconstructed data model and the two constituent data model components (thin lines) superimposed. The data model convincingly reproduces the observed spectra.

4.2.2 Analysis of the synthetic spectra with the two-component data model

Here, we present the results of the analysis of the two-component synthetic spectra, constructed as described in Section 2.1. They represent every possible combination of the assigned ages, metallicities and relative stellar mass fractions, and thus allow us to investigate a wider range of known age and metallicity combinations than the observed spectra. It should be noted that not all combinations (for example, those with two components <1-Gyr old) represent what we might expect to find in the total population of real early-type galaxies, although there could be local regions within these galaxies where, for example, young stars dominate the stellar population. We wanted to cover the complete range of possibilities in order to avoid making any assumptions about the stellar populations of early-types, which, as revealed even in our small sample of 21 spectra, can be quite varied.

In this experiment, no noise has been added to the synthetic spectra. This is because the noise from variations in the spectral shape caused by differences in the metallicity is higher than any noise contributions which represent measurement errors.

The bottom panels of Fig. 4 show the spectra of the components recovered using the same four analysis methods as used for the observed spectra. Again, the first component has a shape and features that are similar to those of a young stellar population, and the second component, those of a mature population. The shape of the UV part of the spectrum, together with the featureless nature of the spectrum above ∼5000 Å, are typical of very young stellar populations (around a few ×10 Myr). These populations are present in the synthetic data set, but not in significant quantities in the observed data set, hence the differences between the recovered data components.

In Fig. 8, as in Fig. 5, we see that a high value of a1 (the weight of the ‘younger’ component from the Bayesian data model) corresponds to the presence of a <1-Gyr stellar population. Fig. 9 plots the Spearman's rank correlation test results for a1 and the age, metallicity and mass fraction of the synthetic populations. Here, there is an even stronger correlation between a1 and the age of the younger population than in the case of the observed spectra, and also a strong correlation between a1 and the age of the older population, but no significant correlation between a1 and the mass fraction of the secondary population. However, in this exercise with synthetic spectra, we allowed only two values for m2/Mgal (0.25 and 0.5), which does not really allow for a meaningful determination of the correlation coefficient.

Scatter plots showing the correlation of (top) the age of the younger stellar population and (bottom) the mass fraction of the smaller stellar population determined from the synthetic spectra fitting with the weight of the first component of the VB-PCA and VB-RFA linear basis transformation analyses methods (a1), and K= 2. The open circles are for those galaxies with a secondary population of age less than 1 Gyr. A high value of a1 clearly corresponds to a significant young (≪1-Gyr) stellar population.
Figure 8.

Scatter plots showing the correlation of (top) the age of the younger stellar population and (bottom) the mass fraction of the smaller stellar population determined from the synthetic spectra fitting with the weight of the first component of the VB-PCA and VB-RFA linear basis transformation analyses methods (a1), and K= 2. The open circles are for those galaxies with a secondary population of age less than 1 Gyr. A high value of a1 clearly corresponds to a significant young (≪1-Gyr) stellar population.

Spearman's rank correlation coefficient (RS) for the first component (a1) of the various linear basis transformation analyses with K= 2, of the two-component synthetic spectra and the best-fitting parameters of the two-component synthetic stellar population fits to the data: solid — age of the older population; outline — age of the younger population; light hatched — metallicity of the older population; light cross-hatched — metallicity of the younger population; dark hatched — mass fraction of the smaller population; dark cross-hatched — mass fraction of the dominant population. mod RS > 0.5 indicates a strong correlation.
Figure 9.

Spearman's rank correlation coefficient (RS) for the first component (a1) of the various linear basis transformation analyses with K= 2, of the two-component synthetic spectra and the best-fitting parameters of the two-component synthetic stellar population fits to the data: solid — age of the older population; outline — age of the younger population; light hatched — metallicity of the older population; light cross-hatched — metallicity of the younger population; dark hatched — mass fraction of the smaller population; dark cross-hatched — mass fraction of the dominant population. mod RS > 0.5 indicates a strong correlation.

These results confirm that the value of a1 is a clear indicator of the presence of a significant young stellar population, and is particularly unambiguous in discovering components with very young (<1-Gyr) stellar populations, as we found in Section 4.2.1

4.2.3 Analysis of the three-component data model

Fig. 3 shows that increasing the number of component spectra in the Bayesian models from two to three does not significantly increase the evidence for any data model. Indeed, for the positive factor analysis, the evidence decreases as K increases from 2 to 3. These results suggest that the bulk of the stellar content of early-type galaxies can be represented by two components, one which represents contributions from old, metal-rich stellar populations, the other representing contributions from young populations (Fig. 10). It is not easy to find a physical interpretation for the third component. Indeed, the third component is probably a combination of residuals from the sum of the spectra of two populations, combined with correlated components of observational errors.

Analysis of the three-component data model. Top panel: Synthetic stellar population spectra (Jimenez et al. 2004); top left-hand panel: For a stellar population of age 0.7 Gyr, with from bottom to top, metallicity Z= 0.2, 1.0 and 2.5 Z⊙; top right-hand panel: age = 10 Gyr, same metallicities. The dotted lines mark some of the absorption features in the spectrum which are typically strong in young stellar populations, and the dashed lines mark some of the absorption features which are typically strong in old, metal-rich stellar populations. From left- to right-hand side, the absorption line species are: Mg ii (2799 Å), H? (3970 Å), Hδ (4102 Å), Hγ (4340 Å), Hβ (4861 Å), Mgb (5175 Å), NaD (5893 Å), Hα (6563 Å), TiO (7126 Å). Middle and bottom panels: The three components found from the various different linear independent basis transformation analyses of the model stellar populations, with K= 3. The methods are, from bottom to top: VB-FA, VB-PFA, VB-PCA and VB-RFA. The first two recovered spectra are broadly identified with the features of a young and old stellar population in the same way as the K= 2 components. The third component (bottom) possibly contains ‘residuals’, resulting from the fact that the first two components are not a complete detailed reconstruction of the true stellar population components.
Figure 10.

Analysis of the three-component data model. Top panel: Synthetic stellar population spectra (Jimenez et al. 2004); top left-hand panel: For a stellar population of age 0.7 Gyr, with from bottom to top, metallicity Z= 0.2, 1.0 and 2.5 Z; top right-hand panel: age = 10 Gyr, same metallicities. The dotted lines mark some of the absorption features in the spectrum which are typically strong in young stellar populations, and the dashed lines mark some of the absorption features which are typically strong in old, metal-rich stellar populations. From left- to right-hand side, the absorption line species are: Mg ii (2799 Å), H? (3970 Å), Hδ (4102 Å), Hγ (4340 Å), Hβ (4861 Å), Mgb (5175 Å), NaD (5893 Å), Hα (6563 Å), TiO (7126 Å). Middle and bottom panels: The three components found from the various different linear independent basis transformation analyses of the model stellar populations, with K= 3. The methods are, from bottom to top: VB-FA, VB-PFA, VB-PCA and VB-RFA. The first two recovered spectra are broadly identified with the features of a young and old stellar population in the same way as the K= 2 components. The third component (bottom) possibly contains ‘residuals’, resulting from the fact that the first two components are not a complete detailed reconstruction of the true stellar population components.

The results of the analysis performed on the observed spectra are presented in Fig. 11. The components a1 and a2 correlate significantly with the age of the younger population (age2), except for the VB-RFA, where the correlation is with a3 rather than a1. The relative mass fraction of the two stellar population components (m1 and m2) also correlate with a1 and a2, and, for the VB-RFA, with a3 as well.

Spearman's rank correlation coefficient (RS) for the first component (a1) of the various linear basis transformation analyses with K= 3, of the observed spectra and the best-fitting parameters of the two-component synthetic stellar population fits to the data: solid — age of the older population; outline — age of the younger population; light hatched — metallicity of the older population; light cross-hatched — metallicity of the younger population; dark hatched — mass fraction of the smaller population; dark cross-hatched — mass fraction of the dominant population. mod RS > 0.5 indicates a strong correlation.
Figure 11.

Spearman's rank correlation coefficient (RS) for the first component (a1) of the various linear basis transformation analyses with K= 3, of the observed spectra and the best-fitting parameters of the two-component synthetic stellar population fits to the data: solid — age of the older population; outline — age of the younger population; light hatched — metallicity of the older population; light cross-hatched — metallicity of the younger population; dark hatched — mass fraction of the smaller population; dark cross-hatched — mass fraction of the dominant population. mod RS > 0.5 indicates a strong correlation.

For the synthetic two-population spectra, age2 correlates strongly with a1 and a2, for all the analysis methods, and also with a3 for the VB-RFA. However, there is no strong correlation (mod RS > 0.5) for any of the data models with the relative mass fraction of the two populations, although the metallicity of the younger population (Z2) correlates strongly with a3 in the case of the VB-PCA and VB-FA methods.

The differences between the correlation coefficients for the observed and synthetic spectra most likely reflect the different distributions of age, metallicity and mass fraction in the two spectral samples. For example, the observed spectra are all dominated by high-metallicity components (2.5 and 5.0 Z), with small contributions from very low-metallicity populations (0.01 Z). Only one of the galaxies contains stellar populations with intermediate metallicities. In contrast, the synthesized spectra can take the whole range of combinations of the seven allowed metallicities.

4.2.4 Interpretation of the eigencomponents of PCA

Although we have managed to identify components with PCA here, it is a well-known property of PCA that the solution it provides is rotation invariant. That is, the eigencomponents which have equal variances can be multiplied by an arbitrary orthogonal matrix to produce another equally good solution (in terms of satisfying the same objective function). Therefore PCA is not guaranteed to achieve a successful separation of the underlying independent source signals that we are looking for. Methods that take into account higher-order dependencies need to be used for that purpose (Hyvärinen, Karhunen & Oja 2001).

The only ‘lucky’ case where PCA can identify components is if the underlying components have suitably different variances. This is true for the work presented here, but there is no reason why this would hold for data sets of galaxy spectra in general. Therefore, even if PCA has been successful here, it comes with no guarantees.

Another way of looking at this problem is through the probabilistic formalism. PCA's implicit assumption (as shown in detail in Tipping & Bishop 1999) is that the components it will find are Gaussian. Making this assumption explicit, by the probabilistic formalism that we adopted, highlights the lack of plausibility of this assumption: the spectra that we are looking for are not Gaussian, instead they are positive valued only, and have a rather skewed distribution. This renders the use of PCA very problematic in applications that require a clear interpretation of the components.

In turn, by adopting the probabilistic formalism where any assumption in the data model must be made explicit, we are able to specify the distribution that corresponds to components that we are looking for. Therefore, the chances that we can find what we are looking for, and not something else, are considerably increased. The ultimate conclusion is of course made based on visual inspection of the recovered components, made by human experts.

4.3 The prediction of missing values

As a more objective criterion for the assessment of our data modelling compared to the results obtained by the fitting of detailed spectral models, here we investigate the predictive ability of the modelling approach when faced with missing elements of the spectra.

We begin with demonstrating the recovered values for the missing elements along with their predicted uncertainties (posterior variances), as obtained from VB-RFA on the small real data set, compared with those from the best-fitting physical model. This is shown on the scatter plot in Fig. 12. The correlation coefficient between the posterior mean predictions of the data model and the values imputed from the best-fitting physical model is 0.989, the signal-to-noise ratio being 9.3. This is a strong indication that the data-driven analysis and the physical analysis are well matched.

Predictions for the missing values provided by the VB-RFA model versus the physical model. Values of the x-radii of the ellipses represent errors (one standard deviation) obtained from data imputation using synthetic physical models. The y-radius values correspond to one standard deviation of the posteriors of x, as obtained using VB-RFA. There are no missing values in the observed spectra in the central region of the wavelength range considered here, which accounts for the gap in the plotted data.
Figure 12.

Predictions for the missing values provided by the VB-RFA model versus the physical model. Values of the x-radii of the ellipses represent errors (one standard deviation) obtained from data imputation using synthetic physical models. The y-radius values correspond to one standard deviation of the posteriors of x, as obtained using VB-RFA. There are no missing values in the observed spectra in the central region of the wavelength range considered here, which accounts for the gap in the plotted data.

However, in the case of the real data, the true values of the missing entries are unknown. Therefore, in the following section, we employ synthetic data to perform a more thorough and controlled objective evaluation of the predictive capabilities of our probabilistic data model model. Missing entries were simulated, and the true values were used for evaluation only. The synthetic data set was more diverse than the real one, as a variety of parameter combinations that can be simulated, in this process, may not appear in real spectra. Therefore, a wealth of situations were created on which the generalization abilities of the proposed method were tested. A global and fixed noise level of σ2nt= 0.01 was used in the data models in these experiments.3 We studied the prediction performance for both two- and three-component models (K= 2, K= 3) on new, previously unseen measurements, in the two scenarios as reported below. In this setting, the evidence bound peaked at K= 3, as can be seen in Figs 13 and 14.

Predictive performance from previously unseen spectra using a VB-RFA model with K= 2 (bottom) and K= 3 (top), plotted against the percentage of missing entries from the test set.
Figure 13.

Predictive performance from previously unseen spectra using a VB-RFA model with K= 2 (bottom) and K= 3 (top), plotted against the percentage of missing entries from the test set.

Predictive performance from previously unseen wavelength bins using a VB-RFA model with K= 2 (bottom) and K= 3 (top). As before, the x-axis indicates the percentage of missing entries from the test set.
Figure 14.

Predictive performance from previously unseen wavelength bins using a VB-RFA model with K= 2 (bottom) and K= 3 (top). As before, the x-axis indicates the percentage of missing entries from the test set.

4.3.1 Predictions for previously unseen spectra in the same wavelength range

As the spectra of more and more galaxies become available and are added to public archives, the ability to make predictions for previously unseen spectra, that have a small percentage of missing values within the same overall wavelength range, is an important practical requirement.

It is therefore desirable to be able to infer reliable estimates of missing values, based on a few measurements and the probabilistic representation already created from previously seen spectra. In this scenario, Fig. 13 shows the prediction results obtained by VB-RFA when varying the percentage of missing entries in the test set. The predictions are surprisingly accurate up to a high percentage of missing entries among the previously unseen spectra. Effectively, the high redundancy across the spectra, due to the fact that mature galaxies are very similar to each other, makes this scenario relatively easy to handle.

4.3.2 Predictions for previously unseen wavelength bins

The second scenario investigated here is when the predictions are to be made for missing elements for previously unseen values of wavelength, albeit within the same overall wavelength range.

This is a more difficult task, since the flux varies significantly across different wavelength ranges. We conducted experiments in which the overall wavelength range for each spectrum was split randomly into training sets (regions without missing values in our case, though they need not be so) and test sets, and each model was trained on the training set. Then, to ensure that we retain the missing-at-random (Ghahramani & Jordan 1994) assumption, for randomly picked ranges in the test set, ‘missing’ values were predicted based on the trained model and the prediction errors were averaged over all missing values. This procedure was repeated 10 times with 50 different combinations of training and test sets for each spectrum.

The results are shown in Fig. 14. The percentage of missing values was varied between experiments as is indicated on the x-axis. With a complete training set, the predictions are still pretty accurate even when up to 60 per cent of the bins from the test set are missing.

5 Finding E+A Galaxies

So far, we established generic data-driven procedures for finding evidence of more than one stellar subpopulations, significantly differing in age and metallicity-related characteristic features, in the integrated spectra of galaxies. Here, we show an application to a particular data-mining project which would benefit from this method, once large sets of galaxy spectra are available for public use.

The determination of the star formation history of early-type galaxies has important implications for the much-debated issue of their formation and evolution. Observational evidence suggests that at least some elliptical galaxies already have the bulk of their stars at high redshift (z > 2) (e.g. Renzini 1998; Nolan et al. 2003), but in hierarchical structure-formation models, ellipticals can form at lower redshift from late-type merging (e.g. White & Rees 1978; Kauffmann, White & Guiderdoni 1993). A vital link in understanding how early-type evolution is driven, and under what environmental conditions, is the study of E+A galaxies.

The spectra of these galaxies are characterized by strong hydrogen Balmer absorption lines together with a lack of significant [O ii]3272-Å emission, implying a recent (≲1-Gyr) starburst, but no significant ongoing star formation. The overall shape of the spectral energy distribution of these galaxies is that expected for an early-type galaxy. They may represent an intermediate stage between disc-dominated rotationally supported systems and spheroid-dominated, pressure-supported systems.

The physical mechanisms responsible for the triggering and subsequent truncation of star formation in E+A galaxies are currently unclear. They may be the result of disc-disc merging (or galaxy interactions), or cluster-related phenomenon, e.g. ram-pressure stripping, tidal stripping. However, although at intermediate redshift, it has been suggested that E+A-type galaxies are prevalent in clusters (e.g. Tran et al. 2003), which argues for a cluster-related mechanism, at low redshift (z∼ 0.1), E+A galaxies occur in lower-density environments, which is more indicative of merging or interaction processes (Zabludoff et al. 1996; Blake et al. 2004). Statistical studies of the environments, luminosities and detailed morphologies of these galaxies, across a range of redshifts, are necessary in order to uncover the physical processes governing their star formation history.

The possible relationship between early-type evolution and E+A galaxies has not as yet been well studied, as E+A systems are rare (<1 per cent of the overall zero-redshift galaxy population, Zabludoff et al. 1996). However, recent and ongoing large-scale surveys such as the SDSS and the 2dFGRS offer the opportunity to study these galaxies in a statistically meaningful way, if they can be reliably identified.

In this paper we have developed a rapid, automated method which allows us to identify those early-type galaxies containing a significant young (<1-Gyr old) population; these galaxies are E+A galaxies. Fig. 15 shows a typical E+A galaxy spectrum, together with its constituent population spectra. Unsurprisingly, given the diagnostic abilities of our components analyses, the component spectra look very similar to the spectra recovered from the components analyses.

Top: the optical spectrum of an A3V star; second from top: spectrum of a 12-Gyr stellar population; third from top: combined spectrum of the A star plus 12-Gyr stellar population, a typical E+A spectrum. Bottom: The observed spectrum of NGC 7252, from our sample, which contains a very young (<1-Gyr) stellar population.
Figure 15.

Top: the optical spectrum of an A3V star; second from top: spectrum of a 12-Gyr stellar population; third from top: combined spectrum of the A star plus 12-Gyr stellar population, a typical E+A spectrum. Bottom: The observed spectrum of NGC 7252, from our sample, which contains a very young (<1-Gyr) stellar population.

Formiggini & Brosch (2004) have analysed the IUE UV spectra of a sample of normal galaxies using PCA. One of their components was associated with a young stellar population, and another with a mature population, consistent with our results. However, they were unable to correlate the principal components with the optical morphology of the galaxies. As we have prior knowledge of the component stellar populations in our galaxy sample, we have been able to make a more quantitative interpretation of our results: if the value of a1 (the weight of the ‘young’ component) for a particular early-type galaxy is >0.5, there is a significant <1-Gyr-old stellar population present in that galaxy.

Hence, our components analysis is able to rapidly identify post-starburst galaxies, without the uncertainties inherent in the measurement of the equivalent widths of the Balmer absorption and [O ii] emission lines. [O ii] emission in disc-dominated galaxies may suffer from dust obscuration, and Hγ and Hβ are subject to emission filling. These, together with the high signal-to-noise-ratio requirements, introduce significant uncertainties in the measurement of these lines. In contrast, as our method utilizes all the information present in the long-baseline spectra, i.e. both the features present and the overall shape of the spectra, it is a more robust diagnostic of recent star forming activity.

Using the wealth of data available in the SDSS and the 2dFGRS, with signal-to-noise ratio as good or better than the data we analyse here, we may map precisely in what environments star formation is suppressed (e.g. in groups or clusters), by rapidly and robustly identifying those galaxies hosting only mature (>1-Gyr) stellar populations. This allows us to investigate whether galaxy transformation (from star forming to passive galaxies) occurs via, e.g. ram-pressure stripping, tidal stripping, harassment or suppression of galaxy-galaxy interactions. Although the wavelength ranges of the SDSS and 2dF spectra are shorter than those we have used, we can use a training set of data to identify the linear components (as suggested below), which can then be applied to the SDSS and 2dF data.

With the higher-resolution spectra of these surveys, and higher-resolution model spectra that are available (e.g. Bruzual & Charlot 2003), we expect the accuracy of our statistical estimates to increase, as we have more independent samples (wavelength bins) of the mixed components. Whether or not the number of components, K, would increase would depend on the data set used. Both the existence of additional components and the interpretation of any further components can only be explored by carrying out the experiment. However, we would not expect the increase in information that we obtain from higher-resolution spectra to lead to a decrease in K.

We intend to extend our Bayesian modelling technique to identify other classes of galaxies, by using appropriately selected training sets of galaxy spectra and wavelength ranges. For example, starburst galaxies may be identified by the Wolf-Rayet ‘bump’ around ∼4600-4900 Å, which is a characteristic feature of the emission from the Wolf-Rayet stars in starburst galaxies. By first applying the Bayesian analysis to an appropriate training set of spectra, which include the spectral range of this feature, identification of these galaxies in a large sample should be possible.

6 Conclusions

Using rigorous Bayesian modelling, we have successfully developed a time-efficient method for the analysis of the spectra of early-type galaxies. At around a few seconds per galaxy for this analysis, its speed compares very well with work by other authors (e.g. Heavens et al. 2000; Fernandes et al. 2005). By comparing the results of this analysis with the results of our unique two-component synthetic stellar population fitting to the long-baseline data, and with two-component synthetic spectra, we have shown that the independent components analysis is physically interpretable.

Two linear components are sufficient to describe the bulk of the stellar content of early-type galaxies. One of the components represents a young (<1-Gyr) stellar population, and the other an old, metal-rich population. The relative contribution of these two components allows us to robustly identify which early-type galaxies host significant young stellar populations, i.e. which are post-starburst galaxies.

Of the four variational Bayesian methods investigated here, we find the variational Bayesian rectified factor analysis to be the most useful, since it has the greatest flexibility in the shape of the allowed distribution, whilst retaining the astronomically motivated requirement that the flux values in each component spectrum are non-negative. This method achieved the highest evidence and the highest correlation between the component weights and the population parameters determined from spectral fitting. The method allows the observational errors to be included in the formulation, and has the additional powerful attribute that it can be used to recover missing regions of spectral coverage. The strength of the correlation between the posterior mean predictions of the data model, and the values imputed from the best-fitting physical model, is a strong indication that the data-driven analysis and the physical analysis are well matched.

We intend to apply our analysis to the early-type galaxies, in the 2dF, 6dF and SDSS catalogues, as a useful tool to investigate their evolution with environment and redshift. We have shown that in addition to efficiently finding early-type galaxies with significant young stellar populations, this method would also be an efficient tool to find other distinctive subclasses, such as E+A galaxies, which is an important missing link in current population scenarios of the formation of early-type galaxies.

In this work, we use model and observed spectra between 2500 and 8000 Å. However, to be useful as a data-mining tool for z∼ 0 galaxies in the SDSS, 2dF and 6dF archives, the Bayesian analysis has to work efficiently in the 3700-8000 Å range equally well. Our next step will be to estimate data models that will learn, from the current data sets that include UV spectra, to work on much larger, solely optical data sets.

In future studies, we therefore intend to:

  • (i)

    estimate the data models for the 3700-8000 Å range;

  • (ii)

    apply our technique to 2dF, 6dF, SDSS catalogued spectra, to identify E+A galaxies and their environments;

  • (iii)

    experiment with identifying starburst galaxies;

  • (iv)

    consider the robustness of our data models with respect to uncertainties in the calibration and response of the 2dF/6dF/SDSS fibres, and the effect of random continuum errors, via analyses of simulated data with additional noise factors;

  • (v)

    extend the models to deal with stochastic censoring, i.e. non-ignorable (versus random) missing data.

Acknowledgment

MOH was supported by a Paul & Yuanbi Ramsay research award at the School of Computer Science of the University of Birmingham.

References

Attias
H.
,
2000
, in
Solla
S.
Leen
T. K.
Muller
K.-L.
, eds,
Advances in Neural Information Processing Systems
Vol.
12
.
MIT Press
,
Cambridge, MA
, p.
209

Bacon
R.
et al. ,
2001
,
MNRAS
,
326
,
23

Bernardo
J. M.
Smith
A. F.
,
1994
,
Bayesian Theory
.
Wiley
,
New York

Bica
E.
Alloin
D.
,
1987
,
A&AS
,
70
,
281

Bica
E.
Alloin
D.
,
1987
,
A&A
,
186
,
49

Bishop
C. M.
,
1999
, in
Proc. 9th Int. Conf. on Artificial Neural Networks (ICANN), Vol. 1. IEEE
, p.
509

Bishop
C. M.
Spiegelhalter
D.
Winn
J.
,
2002
, in
Becker
S.
Thrun
S.
Obermeyer
K.
, eds,
Advances in Neural Information Processing Systems
Vol.
15
.
MIT Press
,
Cambridge MA
, p.
793

Blake
C.
et al. ,
2004
,
MNRAS
,
355
,
713

Bruzual
G.
Charlot
S.
,
2003
,
MNRAS
,
344
,
1000

Chan
K.
Lee
T.-W.
Sejnowski
T. J.
,
2003
,
Neural Comput.
,
15
,
1991

Connolly
A. J.
Szalay
A. S.
Bershady
M. A.
Kinney
A. L.
Calzetti
D.
,
1995
,
AJ
,
110
,
1071

Cover
T.
Thomas
J.
,
1991
,
Elements of Information Theory
.
Wiley
,
New York

Dressler
A.
,
1980
,
ApJ
,
236
,
351

Emsellem
E.
et al. ,
2004
,
MNRAS
,
352
,
721

Fernandes
R. C.
Mateus
A.
Sodré
L.
Stasi၄ska
G.
Gomes
J. M.
,
2005
,
MNRAS
,
358
,
363

Folkes
S.R.
Lahav
O.
Maddox
S.J.
,
1996
,
MNRAS
,
383
,
651

Formiggini
L.
Brosch
N.
,
2004
,
MNRAS
,
350
,
1067

Frey
B. J.
Hinton
G. E.
,
1999
,
Neural Comput.
,
11
,
193

Ghahramani
Z.
Jordan
M.
,
1994
,
Technical Report CBCL Paper No. 108
,
Massachusetts Institute of Technology
,
Cambridge, MA

Harva
M.
,
2004
,
MSc thesis
,
Helsinki University of Technology
,
Helsinki

Harva
M.
Kabán
A.
,
2005
, in
Proc. IEEE Int. Conf. Neural Networks
, in press, p.
185

Heavens
A. F.
Jimenez
R.
Lahav
O.
,
2000
,
MNRAS
,
317
,
965

Hinton
G. E.
Ghahramani
Z.
,
1997
,
Phil. Trans. R. Soc. B.
,
352
,
1177

Hyvärinen
A.
Karhunen
J.
Oja
E.
,
2001
,
Independent Component Analysis
.
Wiley
,
New York

Jansen
R. A.
Fabricant
D.
Franx
M.
Caldwell
N.
,
2000
,
ApJS
,
126
,
331

Jimenez
R.
MacDonald
J.
Dunlop
J. S.
Padoan
P.
Peacock
J. A.
,
2004
,
MNRAS
,
349
,
240

Jordan
M.
Ghahramani
Z.
Jaakkola
T.
Saul
L.
,
1999
,
Mach. Learn.
,
37
,
183

Kabán
A.
Nolan
L.A.
Raychaudhury
S.
,
2005
, in
Kargupta
H.
Srivastava
J.
Kamath
C.
Goodman
A.
, eds,
Proc. SIAM Int. Conf. on Data Mining (SIAM DM05)
.
SIAM Press
,
Newport Beach CA
, p.
183
, preprint (astroph/0505059)

Kauffmann
G.
White
S. D. M.
Guiderdoni
B.
,
1993
,
MNRAS
,
264
,
201

Kauffmann
G.
et al. ,
2003
,
MNRAS
,
341
,
33

Lappalainen
H.
,
1999
, in
Proc. Int. Workshop on Independent Component Analysis and Signal Separation (ICA'99). Aussois, France
, p.
7

Lee
D.
Seung
S.
,
2001
,
Adv. Neural Inf. Process. Syst.
,
13
,
556

MacKay
D. J. C.
,
2003
,
Information Theory, Inference, and Learning Algorithms
.
Cambridge Univ. Press
,
Cambridge

Madgwick
D. S.
Somerville
R.
Lahav
O.
Ellis
R.
,
MNRAS
,
2003
,
343
,
871

Madgwick
D. S.
et al. ,
2003
,
ApJ
,
599
,
997

Miskin
J.
,
2000
,
PhD thesis
,
Univ. Cambridge

Miskin
J.
Mackay
D. J. C.
,
2001
, in
Roberts
S.
Everson
R.
, eds,
Independent Component Analysis: Principles and Practice
.
Cambridge Univ. Press
,
Cambridge
, p.
209

Nolan
L. A.
,
2002
,
PhD thesis
,
Univ. Edinburgh

Nolan
L. A.
Dunlop
J. S.
Jimenez
R.
Heavens
A. F.
,
2003
,
MNRAS
,
341
,
464

Ocvirk
P.
Pichon
C.
Lancon
A.
Thiebaut
A.
,
2005
,
MNRAS
, in press (astro-ph/0505209)

Ocvirk
P.
Pichon
C.
Lancon
A.
Thiebaut
A.
,
2005
,
MNRAS
, in press (astro-ph/0507002)

Peebles
J. E. P.
,
2002
, in
Metcalfe
N.
Shanks
T.
, eds, ASP Conf. Ser. Vol. 283,
A New Era in Cosmology
.
Astron. Soc. Pac.
,
San Francisco

Ponder
J. M.
et al. ,
1998
,
AJ
,
116
,
2297

Raiko
T.
Valpola
H.
Harva
M.
Karhunen
J.
2005
,
J. Mach. Learn. Res.
, submitted

Renzini
A.
,
1998
, in
Carollo
C. M.
Ferguson
H. C.
Wyse
R. F. G.
, eds,
When and How Do Bulges Form and Evolve?
Cambridge Univ. Press
,
Cambridge

Ronen
S.
Aragon-Salamanca
A.
Lahav
O.
,
1999
,
MNRAS
,
303
,
284

Slonim
N.
Somerville
R.
Tishby
N.
Lahav
O.
,
2001
,
MNRAS
,
323
,
270

Socci
N. D.
Lee
D. D.
Seung
H. S.
,
1998
, in
Jordan
M. I.
Kearns
M. J.
Solla
S. A.
, eds,
Advances in Neural Information Processing Systems 10
.
MIT Press
,
Cambridge MA
(online ref: http://books.nips.cc/papers/files/nips10/0350.pdf)

Solorio
T.
Fuentes
O.
Terlevich
R.
Terlevich
E.
,
2005
,
MNRAS
,
363
,
555

Storchi-Bergmann
T.
Calzetti
D.
Kinney
A. L.
1998
,
A database of UV-optical spectra of nearby quiescent and active galaxies
, (http://www.stsci.edu/ftp/catalogs/nearby_gal/.).

Tipping
M.
Bishop
C.
,
1999
,
J. R. Statistical Soc., Ser. B
,
61
,
611

Tran
K. H.
Franx
M.
Illingworth
G.
Kelson
D. D.
van Dokkum
P.
,
2003
,
ApJ
,
599
,
865

Valpola
H.
Honkela
A.
Harva
M.
Ilin
A.
Raiko
T.
Östman
T.
,
2003
,
Bayes Blocks Software Library

Valpola
H.
Harva
M.
Karhunen
J.
,
2004
,
Signal Process.
,
84
,
267

White
S. D. M.
Rees
M. J.
,
1978
,
MNRAS
,
183
,
341

Winn
J.
Bishop
C. M.
,
2005
,
J. Mach. Learn. Res.
,
6
,
661

Zabludoff
A. I.
Zaritsky
D.
Lin
H.
Tucker
D.
Hashimoto
Y.
Shectman
S. A.
Oemler
A.
Kirshner
R. P.
,
1996
,
ApJ
,
466
,
104

1

In the bayesblocks-based implementation (Valpola et al. 2003) adopted here, a maximal flexibility is ensured by further formulating hierarchical priors on all variables graphic and graphic, as follows: graphic, with parameters common to all components of the variable, and the parameters graphic and graphic will then have vague priors in the form of a zero-mean Gaussian having a large value for variance. The definition is analogous for graphic and graphic. These additional variables have been omitted from the diagram in Fig. 2 for simplicity, as they are part of the relatively standard technicalities concerning a full Bayesian approach (Raiko et al. submitted) rather than an essential part of the application-specific model design.

2

All algorithms investigated here have been implemented as part of the bayesblocks software (Valpola et al. 2003; Raiko et al. submitted) and may be made freely available upon publication.

3

Here, σnt can be used to both control the compactness of the representation and simulate a level of measurement uncertainly similar to that encountered in real data.

Appendix

Appendix A: Implementation Details of the Vb-Rfa Data Model

A1 Data model estimation
Posterior updates for q(skt)
The following simplifying notations will be used in this subsection: sskt, ms≡<msk>, graphic. Further, the following expressions, involving variables from the Markov blanket (MacKay 2003) of a variable s, will be required: graphic and graphic. Using these notations, as well as expressing ƒ(s) =u(s)s, where u(s) is the step function, then the outline of the computation of the free form posterior q(s) and its sufficient statistics can be given as follows.
(A1)
(A2)
where
(A3)
(A4)
(A5)
(A6)
For more detailed derivation, see Harva & Kabán (2005).

Note the non-trivial variational posterior (A2) that resulted from the free-form approximation, a bimodal distribution (A6), which is essentially a mixture of two rectified Gaussians.

The posterior statistics associated with q(s) that are required for updating the posterior statistics of other variables of the model are the first and second order moments, as follows. Denoting graphic (which is the erfc term in A6), the ith order moments of s and ƒ(s) can be written as
(A7)
where
and
(A8)
The notation Mi(u) introduced in (A7) will be used also in (A11). The remaining integral is the ith moment of a rectified Gaussian density. The exact expressions for these, are given in Section A4 and more details on it can be found, e.g. in Harva (2004), along with numerical issues concerning an efficient implementation.
Posterior updates forq(msk)
The posteriors graphic are of a Gaussian form, where
which provide the required posterior expectation graphic. Note that if a ‘vague prior’ was used on graphic, then the posterior mean estimate graphic becomes similar to a maximum likelihood estimate graphic, as then graphic and graphic.
Posterior updates forq(ank)
The posteriors graphic are rectified Gaussians, where
which provide the required sufficient statistics: graphic.
Posterior updates forq(xnt)
The posteriors graphic are Gaussian, where

These provide the required posterior expectation graphic. Here, the indicator variable bnt= 1 if ynt is observed and zero if it is missing.

Posterior updates for variance parameters
Following Raiko et al. (submitted) and Valpola, Harva & Karhunen (2004), in the exp-parameterization (see Section 3.1.3), Gaussian priors are imposed on vθ, where θ stands for any of the variables x, s or m in Fig. 2. This is then solved by a fixed form Gaussian variational posterior,
(A9)
and this has been employed in the experiments described in this paper.

Here, graphic and graphic are computed by numerical optimization (Valpola et al. 2004; Raiko et al. submitted). Alternatively, a direct parameterization may be used with Gamma priors and Gamma posteriors may be sufficient for this model.

A2 The log evidence bound
Computing the log of the evidence bound is useful for monitoring the convergence of the iterative estimation process, and more importantly, for comparing different modelling hypotheses. As mentioned, (5) decomposes into a number of terms.
(A10)

The p and q terms for Gaussians will be given only once, while those that are different will be detailed separately.

p and q terms of Gaussian variables
The p and q terms in (A10), in the case of Gaussian variables, can be written as
where graphic and graphic are the posterior mean and variance of θ. If the exp-parameterization is used for the variances, then the term graphic- see Valpola et al. (2004).
p and q terms of ank
The p and q terms of ank, which are the elements of the matrix A in (A1), are given by:
where we have dropped the indices.
The q term of skt
The p term of variable s is computed as in the case of other Gaussian variables. The q term gives the following.
(A11)
For detailed derivation, see Harva & Kabán (2005).
A3 Missing value prediction
A Gaussian approximation of the predictive distribution can be obtained as follows.
(A12)
(A13)
(A14)
(A15)
where graphic denotes hidden variables above xnt in the conditional dependency architecture of the model. The mean of this distribution provides a prediction for the missing entry ynt. Conveniently, for missing entries the mean of (A15) is identical to that of q(xnt) (as bnt= 0), so that the imputation of missing values takes no additional computation. Further, evaluating (A15) when graphic (i.e. no measurement of errors) — in order to test the prediction performance — also happens to be identical to evaluating q(xnt) from (A15) and Section A1.4.
A4 Required properties of graphic
The rectified Gaussian distribution along with its first and second order moments, used in Sections A1.1 and A1.3, are listed here for completeness.
(A16)
See Harva (2004) or Miskin (2000) for details.