A supervised Bayesian factor model for the identification of multi-omics signatures

Abstract Motivation Predictive biological signatures provide utility as biomarkers for disease diagnosis and prognosis, as well as prediction of responses to vaccination or therapy. These signatures are identified from high-throughput profiling assays through a combination of dimensionality reduction and machine learning techniques. The genes, proteins, metabolites, and other biological analytes that compose signatures also generate hypotheses on the underlying mechanisms driving biological responses, thus improving biological understanding. Dimensionality reduction is a critical step in signature discovery to address the large number of analytes in omics datasets, especially for multi-omics profiling studies with tens of thousands of measurements. Latent factor models, which can account for the structural heterogeneity across diverse assays, effectively integrate multi-omics data and reduce dimensionality to a small number of factors that capture correlations and associations among measurements. These factors provide biologically interpretable features for predictive modeling. However, multi-omics integration and predictive modeling are generally performed independently in sequential steps, leading to suboptimal factor construction. Combining these steps can yield better multi-omics signatures that are more predictive while still being biologically meaningful. Results We developed a supervised variational Bayesian factor model that extracts multi-omics signatures from high-throughput profiling datasets that can span multiple data types. Signature-based multiPle-omics intEgration via lAtent factoRs (SPEAR) adaptively determines factor rank, emphasis on factor structure, data relevance and feature sparsity. The method improves the reconstruction of underlying factors in synthetic examples and prediction accuracy of coronavirus disease 2019 severity and breast cancer tumor subtypes. Availability and implementation SPEAR is a publicly available R-package hosted at https://bitbucket.org/kleinstein/SPEAR.


A. SPEAR overview
Fixing the model rank K and the weight parameter w, SPEAR considers the following weighted likelihood for the data given the latent factor U and the model parameters: (1) Let X and Y represent the high-dimensional multi-omics assays and low-dimensional response(s) respectively and let Z represent either X if completely observed or alternatively the fully imputed version of X if there are missing data.We use P (.) to denote a likelihood function for either a continuous or discrete variable.For the latent factor U, we model it as a noisy realization of a linear function of Z, and let where E ik ∼ N (0, 1).We introduce E here for stability that can happen when Zβ becomes very small.Θ = σ 2 , β, B, B is the collection of all model parameters, with β representing the regression coefficients as well as B and B representing the projection coefficients of X and Y respectively. .Let Γ(.|a, c) and Beta(.|a,c) represent the Gamma and Beta distributions with parameters (a, c): Let p(.) be a density function; for continuous variables, we use the following priors for the model parameters: • Inverse Gamma prior for variances.Let σ 2 j be the variance of X j .For convenience, we let ν j = 1 σ 2 j and will use ν j from here.p(ν j ) = Γ(ν j |a 0 , c 0 ) (3) • By default, we set features from the same assay as one group.For j = 1, . . ., p, let β jk = βjk γ jk , where βjk ∼ N (0, 1 τ g j k ) and γ jk ∼ Binary(π gj k ).We model different groups to have different prior non-zero probabilities and variances when being nonzero.Consider the case where features can be grouped into G different groups, and g j is the group identifier for feature j.Symmetrically, we let B jk = Bjk s jk where Bjk ∼ N (0, 1 τ g j k ) and s jk ∼ Binary(π gj k ) for j = 1, . . ., p.
• We do not impose sparsity for B jk , j > p, which corresponds to projection coefficients of Y j onto the factors, and model B jk ∼ N (0, 1 τ0 ).Since Y is lowdimensional, we aim to model all of them.
The hyper prior modeling allows τ gk and π gk to be tuning free.As a result, such a model is referred to an ARD (automatic relevance determining) model since feature sparsity and effect size for each group is estimated from the data (David Wipf [2007]).
In this paper, we model the features X as Gaussian but allow Y to be either Gaussian, categorical or ordinal, which are common data types in practice.We can also model X to be different data types, although in practice X is usually continuous or binary.For both cases, modeling X as Gaussian is reasonable for signal extraction purpose.We allow for more flexibility in modeling Y for the purpose of both better signal extraction and interpretability.When Y is a multiclass categorical response or an ordinal response, modeling Y as Gaussian may lead to worse estimation (if Y is highly non-linear in its nominal values in the ordinal case, or having co-linearity in the multinomial case).The resulting prediction can also be more difficult to interpret.

A.1. Modeling of various response types
In Eq. ( 1), p is the dimension of response so that we can deal with multi-dimensional response y as well.For convenience, we discuss only the one-dimensional response setting with p = 1 and denote y i1 by y i .Modeling of the multi-dimensional response is a simple extension of the one-dimensional problem.By taking proper forms for P (y i |u, Θ), we can flexibly model different response types.SPEAR support four response types:

Continuous response:
We consider a Gaussian model for continuous response y: where B is the projection coefficient from y onto U, u i is the ith row vector of U, and σ2 is the variance of noise in y.

Two-class categorical response
We model the two-class categorical response y i ∈ {0, 1} using the logistic regression: where α is the intercept in the logistic regression.

Ordinal response:
We consider an ordinal logistic model for modeling the ordinal response.Suppose that there are M ordered classes and y i ∈ {1, . . ., M }, then that determines the probability of falling into each class, and 1{y i = m} is the indicator function for y i = m.

Multi-class categorical response:
We model the multi-class categorical response y i ∈ {1, . . ., M } using the multinomial logistic regression.In this case B ∈ R K×M is a matrix with m th column representing the coefficient for class m: In Appendix B, we first derive the iterative parameter estimation procedures for the Gaussian data, and we then move to the non-Gaussian case in Appendix C.

A.2. Rank selection and weight tuning
In A.1, we described the SPEAR model with fixed weight parameter w and rank K. Here, we give more details on our default choice of w and K. Tuning the weight parameter w: We tune the weight parameter adaptively from the data based on cross-validation on the response's deviance loss.To reduce the computational burden, we adopt the warm-start strategy: when running the model with a lower weight, we use model parameters from the higher weight before it as the starting point.Suppose that we divide the data into T random folds ∪ T t=1 D t .Let Θ(w; t) be the estimated posterior means for model parameters using data excluding fold t.Let D(y i , x i | Θ(w; t)) be the deviance loss evaluation at sample i using the parameter Θ(w; t), e.g, for Gaussian response, we have where β(w, t) and B(w, t) are the estimated posterior means for β and B using data excluding fold t and at weight parameter w.Then, we can choose w minimizing the average deviance loss: Let w * be the weight that minimizes the cross-validation error.In the 1 standard deviation rule, we choose the largest weight w such that where sd D(Y, X| Θ(w * )) is the estiamted standard deviation of D(Y, X| Θ(w * ; t)) by comparing the average deviance loss from different folds.We suggesting using w * when the goal is for better prediction and using the 1sd rule when more structure in X is preferred.
Rank selection: SPEAR chooses a weight parameter adaptively after fixing the rank K.When there is strong factor structure in X that is irrelevant to Y , choosing a rank K too small can push our model towards small w, which can be unwanted in practice sometimes for interpretation.Here, we use recently developed statistical techniques and propose a simple novel approach as the default rank selection rule for SPEAR.We have assumed the following latent factor model for the features X ∈ R N ×p : where E ∈ R N ×p is the unstructured noise matrix and ∼ N (0, σ 2 j ).Without loss of generality, we always standardize X to have mean 0 and variance 1.Now, let's regenerate a new noise matrix Ẽ where Ẽij ∼ N (0, σ 2 j ) follows the same distribution of E ij .A key observation is that Ẽ are marginally independent if we consider both the randomness in E and the randomness in Ẽ.This is a direct result of the Gaussianity and the fact that the covariance between X κ and X Following this observation, we propose the following rank selection approach: This idea for = 1, . . ., L do Generate i.i.d Gaussian noise Ẽ, and form X κ , X 1 κ with κ = 0.1.Perform PCA on X κ , and let X[k] be the reconstructed feature matrix with rank k for k = 1, . . ., K max .Evaluate the reconstruction loss comparing X[k] and X 1 κ : Select the rank with smallest average reconstruction loss  Tibshirani, 2020;Tian, 2020), but not to the rank selection.
In practice, we do not know the true noise level σ 2 j .Since we are most interested finding the strong factors, we can afford using a larger variance estimate.When replacing σ 2 j by σ2 j with σ2 j > σ 2 j , this will introduce negative correlation between X κ and X 1 κ , which in turn favors smaller K.However, when the factor is strong, we expect them to be found even in this conservative setting.As a result, in the default SPEAR algorithm, instead of trying to estimate σ 2 j accurately, we use an aggressive estimate of σ2 j ∈ [0, 1], e.g., σ2 j = 0.9 is our default value.In our simulation and empirical studies, we have selected the model rank using this default strategy.We observe it works well for both cases: it often selects the true ranks in simulations and it also selects ranks that do not push w to 0 in real data experiments.

B. Parameter estimation for Gaussian response B.1. Review of the mean-field approximation
Let Θ = {θ 1 , . . ., θ T } be the collection of T -block of parameters, and D represent the data.Our goal is to approximate the posterior distribution of Θ with a simpler function q and find q minimizing their KL divergence between these two distributions: Define the evidence lower bound (ELBO) to be the integral of the log ratio between the marginal density p(Θ, D) and q(Θ): The log marginal likelihood on the data log p(D) is called the evidence.It measures how well our model describes the data and is a constant.Consequently, minimizing the KLdivergence between the variational distribution is equivalent of maximizing the ELBO.By choosing a proper variational distribution q such that the integral of the log joint density and log q over q is easy to calculate, we can find an optimal or local optimal solution to this maximization problem efficiently: The space of q often does not contain the actual posterior distribution p(Θ|D).However, variational Bayes has proven successful in many applied problems and researchers sometimes are willing to sacrifice accuracy for efficiency, especially in large-scale datasets.
One popular variational Bayesian method is mean field approximation, which fully decouples the parameters into several independent blocks and considers: where α j is the parameter determining the distribution q αj (θ j ).We can often iteratively update q αj (θ j ) in a computational efficient manner given the distributions of other parameters: Fixing others, we choose the parameters α j that improve most the ELBO: We can update the variational parameters α 1 , . . ., α p iteratively until the ELBO updates are deemed insignificant.

B.2. Variational distributions for SPEAR
Let q(Θ ∪ E) be the mean field approximation to the posterior distribution of Θ ∪ E. We assume q to be the product of a series of independent components: The specific forms for each component are given below.

B.3. Formula for different expectations
We use .to represent the expectation under the variational distribution.Then, We collect some results about the Gamma and Beta distributions below.Let x be a variable with density Γ(x|a, c) or Beta(x|a, c) depending on the context.Let Ψ(a, c) = d log Γ(a,c) da be the first-order polyGamma distribution, then, We can get the expectations ν j , νj , τ gk , log τ gk , π gk , log π gk , log(1 − π gk ) straightforwardly based on the above formulas.

B.4. Parameter update
Update Q(β jk ).For any parameter θ, we use L 0 (θ), L 1 (θ), L 2 (θ) to denote the logarithms of the prior, data likelihood and variational distribution throughout this supplement.For example, with respect to β jk , we let L 0 (β jk ), L 1 (β jk ), and L 2 (β jk ) denote its expected log prior, log data likelihood and log variational distribution respectively.Then, Re-arrange the terms for L 1 (β jk ), we have where where u \j ik = j =j z ij β j k is factor k constructed excluding the j th feature.It is easy to check that The variational parameters (µ jk , ζ 2 jk , ζ 2 0jk ) are chosen to maximize the ELBO, or equivalently, maximize Omitting the part irrelevant to β jk , and letting • µ jk is chosen as • ζ 2 jk and ζ 2 0jk are chosen as • ω jk is chosen as where Update q(B jk ).The expression of the expected log prior L 0 (B jk ) and the expected log variational distribution L 2 (B jk ) is similar to that of L 0 (β jk ) and L 2 (β jk ), except for replacing all quantities by their corresponding component for B jk .The likelihood related part is Rearrange the terms, we have Hence, our update is , The optimal updating parameters is In practice, we consider a constrained solution where we require μj 2 ≥ L Bj .This extra constraint is to alleviate the extra volatility due to the oscillation between (B, B) and β when w, the weight on X, is small.By default, we let L Bj = (1 − w) ∨ 0. This constraint does not affect ζ2 jk , but we no longer has a closed form expression for μj .However, we can numerically find the optimal solution: 1. Consider the ridge penalized problem for μj : 2. Let α * be the largest non-positive value such that μj (α * ) 2 2 ≥ L Bj , which can be found via binary search.
Update q(E).The expected logarithm from the ELBO calculation related to E ik is given below (used the fact that E ik = 0): To avoid instability, we also require e 2 jk ≥ C E and C E = 0.1 by default.Let A = . The optimal solution under this constraint is Update q(τ ).The parameter τ appear only at the prior and the hyper prior, its related ELBO at parameter θ = (ã 1 gk , c1 gk ) is where G g is the set of feature indices in group g and |G g | is the set size.The optimal update rule is Intuitively, when τ gk = ã1 gk bgk is large, we will have a prior more concentrated around 0, and hence, a denser model with small effects.When τ gk is very small, on the other hand, we will only let a feature being non-zero if it's effect size is very large.As a result, the model does not favor τ gk being too small or too large.Since ã1 gk = a 0 + |G g | is fixed, if we want τ gk ∈ [L low , L up ], we can restrict c1 g to be in the range [ . By default, we let L low = ln p n and L up = 1.Update q(π).The ELBO related to π gk at the variational parameter θ = (ã 2 gk , c2 gk ) is The optimal solution is The average sparsity level is π gk = .We can also have a user specified upper bound on this average sparsity level.Let α be the specified sparsity level upper bound with a default value at 0.5.We find the optimal solution under the constraint c2 If the optimal solution already satisfies the constraint, we don't need any modifications; otherwise, we let Update q(ν) and q(ν).
The optimal solution is ELBO increase.We can keep track of the ELBO increase by adding up the ELBO increase in each updating step.For given parameters, e.g., β jk , let L old (.) be the values from using the variational parameters that are currently under investigation, and let L(.) be the evaluation updated parameters.We can calculate the related ELBO increase as follows: where L(β jk ) is the ELBO related to β jk using θ and L old (β jk ) is the ELBO related to β jk using θ old .The total increase in ELBO can be calculated as

C. Parameter estimation with non-Gaussian response
Currently, SPEAR also supports two-class or multi-class classifications and ordinal regression.For these response types, the evidence lower bound under the given variational distribution becomes intractable.There are usually two ways for variational parameter updates when the expectation is intractable: (1) find a lower bound of the ELBO that has closed form and optimize for this lower bound, or (2) use a numerical estimate of the gradient and update using stochastic gradient descent.There are pros and cons for both methods.For the former, we no longer optimize with respect to the original objective.For the later, there is variability of the derivative as well as the objective value due to the stochastic nature and can incur a large computational cost in order to have low variability.Here, we consider the approach of using the lower bound.The lower bounds for all three data types are based on results from Jordan et al. [1999].The bound from Jordan et al. [1999] is used for the logistic model, and an variant is used for bounding the multinomial logistic regression loss Bouchard [2007].We further extend this result to ordinal regression in this paper.

C.1. Lower Bounds of ELBO
If the expectation of the log-likelihood under the variational distribution is intractable, one popular way to circumvent it is to consider a good lower bound of this expectation that has a nice form with the help from extra augmenting variables and optimize in the enlarged parameter space.In general, let f (θ) be the log likelihood function.If its expectation is intractable under the variational distributions, but the quadratic function of θ can be explicitly calculated.Then, if there exist functions a(ξ) and A(ξ) for the augmenting variable ξ, such that and Hence, we can optimize for q given ξ, and optimize for ξ given q.This guarantees that the augmented lower-bound function E q g(ξ, θ) is non-decreasing in the updating steps.We will consider two particular response types: logistic regression (two classes or multiple classes) and ordinal regression.The key results used to derive these lower bounds are from Jordan et al. [1999]: for any η, let f (η) = 1 exp(−η)+1 and λ(η) = tanh( η 2 )/(4η), then, we have A slightly more relaxed bound is We let ≥ 0 be a user specified small constant, and it can encourage more numerical stability in the case of perfect separation.

C.2. Two-class logistic regression
Let η i = (u T i B j + α) be our link function value for response j and sample i in the two-class logistic model, and α is the intercept.Let H i = (2S i − 1)η i where S i ∈ {0, 1} indicates the class label.The expected log likelihood of sample i, i = log 1 1+exp(ηi) , can be lower bounded with Eq. ( 14): Following the same argument as in Jordan et al. [1999], despite the additional term , the choice of ξ for maximizing the lower bound given the variational distribution is: We then approximate the lower bound i with the current ξ i given by Eq. ( 15), and we update the variational distribution using this lower bound: In other words, we are equivalently considering the following response when updating the variational distribution: ).
where the coefficient α can be first estimated by .

C.3. Ordinal logistic regression
Suppose that we are looking at the response Y i that takes value in {1, . . ., M }. 1 is the base class and we are interested in estimation The probability for observing a label at least k is modeled as After some simple rearrangements, we have We now apply Eq. ( 14) again to each of the terms, and introduce auxiliary variables ξ im for m = 2, . . ., M .The lower bound of the expected log likelihood given ξ im and α m is then: As a result, we replace Y i with the transformed response .

C.4. Multi-class logistic regression
A third common response type is the multi-label response.SPEAR models this response type with a multi-class logistic regression problem: As before, to have a closed form integral of the variational distribution, we are going to upper bound the log of the exponential sums with a quadratic function and thus provides a lower bound the log-likelihood.Generalization of Jordan et al. [1999] to multi-class logistic regression has been studied before in Bouchard [2007]: 2. Specific factors: 5 true factors were generated (U ), each one randomly influences 2 datasets in X (first 2 factors used to generate Y ) 3. No factors (XY): X is generated randomly (with varying levels of correlation amongst the features), and Y is randomly generated directly from X In each scenario, X was generated for 500 training samples and 2000 testing samples.We simulated X to contain 4 synthetic multi-omics assays, each with 500 synthetic analytes (for a total of 2000 synthetic analytes across the 4 assays).Each combination of parameters was repeated 10 times across a grid of signal-to-noise ratios (defined below).
The Gaussian scenario when factors were simulated (shared factors and specific factors) was performed in the following manner: 1. Let U represent the simulated factor scores with dimensions k × N and let B represent the simulated projection coefficients with dimensions p × K, where N is the number of simulated samples, p is the total number of simulated analytes, K is the total number of simulated factors, and c 1 is the modulated signal-tonoise coefficient.c takes a grid of values from 0.01 to 5. In particular, we pick c 1 ∈ (0.5, 1.0, 3.0) as low, moderate, and high signal.B and U are simulated as follows: Implement sparsity in B by nullifying any coefficients corresponding to factors that were not designated as influential to X. U XY was defined by factors 1 and 2, whereas U X was defined by all five factors.
3. X was generated in the following manner.Let E represent a matrix with the same dimensions as X (N ×P ) filled with noise drawn from a normal distribution (µ = 0, σ 2 = 1).
4. Y was generated in a similar fashion.Let c 2 represent another sparsity coefficient used to control the spread of Y .We kept this parameter consistent at 1 for all our Gaussian simulations.Let 1 represent a vector of all ones utilized to take the row sums, and let K Y represent the number of factors meant to influence Y , which we set to 2.
5. X was then generated exactly the same way from the Gaussian simulation as described in the manuscript.
We then trained multiple SPEAR models with each treating the response differently (Gaussian, multinomial, and ordinal).In the moderate signal-to-noise, the moderate signal case demonstrated Gaussian SPEAR's difficulty handling the nonlinearity of the class signals in a linear representation (Fig. 3c), further seen in the Gaussian model's predictions (Fig. 3d).SPEAR using multinomial and ordinal responses achieved lower balanced misclassification errors, with the ordinal model outperforming the others.Ordinal SPEAR also extracted predictive signals capable of identifying all seven classes as represented by the correct median probabilities (Fig. 3e).

D.3. Multinomial Simulation
The multinomial simulation was carried out similarly to the Gaussian and ordinal simulations, with varying signal levels to simulate low, moderate, and high signal.Results for the moderate signal are shown below.assays.Samples were then split into train (n train = 379) and test (n test = 610) sets as described in Singh et al.Finally, all three assays were scaled and centered to be normally distributed (µ = 0, σ 2 = 1).
MissForest, a non-parametric missing value imputation for mixed-type data was utilized to address missing values in both the proteomics and metabolomics (Stekhoven and Bühlmann).The metabolite expression values required quantile normalization and log transformation.Both proteomic and metabolomic expression values were then scaled and centered to be normally distributed (µ = 0, σ 2 = 1).

E.3. SPEAR Factor Influence on Prediction
To investigate the predictive influence of each factor generated by SPEAR, we iteratively trained multinomial Lasso classifiers using the glmnet R-package with increasing numbers of SPEAR multi-omics factors used as the predictive features.The multi-omics samples were divided into the same training and testing cohorts as described in the SPEAR manuscript.We inspected the balanced misclassification error rate on the training and testing cohorts (Supplementary Fig. 5a, Supplementary Fig. 6a) as well as the coefficient magnitudes for each SPEAR factor for each Lasso model (Supplementary Fig. 5b, Supplementary Fig. 6b).
We found that while Lasso models with fewer SPEAR factors were able to predict well in each dataset, best performance for the prediction of many response classes was found using combinations of many SPEAR factors, suggesting that a singular phenotypes may be driven by multiple complex underlying biological signals. 1 Grouped violin plots of the simulated samples by true factor scores (y-axis) and their ordinal response class (x-axis).(B) Scatter plot of simulated samples by true factor 1 score (x-axis) and true factor 2 score (y-axis).(C) Grouped boxplots of balanced misclassification error results from SPEAR models on simulated test data.SPEAR models are differentiated by how the response was treated.(D) Histogram of simulated test data ordinal class predictions.Plots are sorted by predicted class (x-axis) and colored by true class.(E) Boxplots of ordinal class assignment probabilities derived from the SPEAR ordinal model.Samples are separated by their true ordinal class.