TARO: tree-aggregated factor regression for microbiome data integration

Abstract Motivation Although the human microbiome plays a key role in health and disease, the biological mechanisms underlying the interaction between the microbiome and its host are incompletely understood. Integration with other molecular profiling data offers an opportunity to characterize the role of the microbiome and elucidate therapeutic targets. However, this remains challenging to the high dimensionality, compositionality, and rare features found in microbiome profiling data. These challenges necessitate the use of methods that can achieve structured sparsity in learning cross-platform association patterns. Results We propose Tree-Aggregated factor RegressiOn (TARO) for the integration of microbiome and metabolomic data. We leverage information on the taxonomic tree structure to flexibly aggregate rare features. We demonstrate through simulation studies that TARO accurately recovers a low-rank coefficient matrix and identifies relevant features. We applied TARO to microbiome and metabolomic profiles gathered from subjects being screened for colorectal cancer to understand how gut microrganisms shape intestinal metabolite abundances. Availability and implementation The R package TARO implementing the proposed methods is available online at https://github.com/amishra-stats/taro-package.


S1. Weighted adaptive-elastic-net penalty
We use the adaptive elastic net penalty [Zou andHastie, 2005, Zou andZhang, 2009], ρpC; λq " ρpC; W 1 , λ, αq " αλ}W 1 ˝C} 1 `p1 ´αqλ}C} 2 F " αλ Here } ¨}1 denotes the ℓ 1 norm, the operator "˝" stands for the Hadamard product, W 1 " rw ij1 s pˆq is a pre-specified weighting matrix, λ is a tuning parameter controlling the overall amount of regularization, and α P p0, 1q controls the relative weights between the two penalty terms.We set W 1 " | r C 1 | ´γ such that w ij1 " w where r C 1 " r d 1 r u 1 r v T 1 is the first set of unit-rank RRR estimators and γ is a non-negative constant with | ¨|´γ componentwisely defined.

S2. URE-TARO procedure
Here, we provide additional detail on the unit rank estimation procedure for TARO described in Section 2.2 of the main manuscript.In our previous work on sparse factor regression Mishra et al. [2021], we imposed sparsity directly on C. To enable appropriate aggregation of the microbiome features in the factor regression framework, we instead propose to impose sparsity on Γ.
In terms of the parameters tβ, d, u, vu, the optimization problems of URE-TARO is a multi-convex problem.We define the weighted penalty as: where λ is the tuning parameter, α provides a relative weights of ℓ 1 and ℓ 2 penalty.We estimate the model parameters using an iterative procedure that cycles between u-step, v-step and β-step until convergence.
u-step: For the fixed v and β, we jointly update pd, uq satisfying }v} " 1.Let us define ǔ " du.
In terms of ǔ, the optimization problem (3) is equivalent to solving (constrained adaptive elastic-net problem) where y " vecpY ´Zβq, X puq " v b X, λ puq 1 " αλw pdq p ř q j"1 w pvq j |v j |q , and λ puq 2 " p1 ´αqλ ř q j"1 v 2 j .Here vecp¨q is the vectorization operator, and b denotes the Kronecker product.We recover the singular value estimate as d " }ǔ} and singular vector estimate as û " ǔ{ d. v-step: For fixed u and β, we minimize the objective function in terms of the block variable pd, vq such that }u} " 1.Let us define v " dv.In terms of v, the optimization problem (3) is equivalent to solving (an adaptive elastic-net problem) where X pvq " I q b pXuq, λ pvq 1 " αλw pdq p ř p i"1 w puq i |u i |q, and λ pvq 2 " p1 ´αqλ ř p i"1 u 2 i .We recover the singular value estimate as d " }ǔ} and singular vector estimate as û " ǔ{ d. β-step: For fixed {d, u, v}, unique solution minimizing the objective function is given by β " pZ T Zq ´1Z T pY `r XΓq

S2.1 Constrained adaptive elastic-net solution
Consider a genetic form of the linear constrained adaptive elastic net regression, where y P R n , X P R nˆp , A " pa 1 , . . ., a p q P R hˆp , b P R hˆ1 , w " rw 1 , ..., w p s T are some predetermined weights, and λ 1 and λ 2 are tuning parameters.This is an equality-constrained convex optimization problem.The augmented Lagrangian function is where c is the Lagrange multiplier, and µ ą 0 is a penalty parameter.The iterative steps to solve the problem, The method converges under very general conditions.As the iteration proceeds, the residual Aβ s`1 ´b converges to zero, yielding optimality.Following ( 8), the key is to minimize Here the penalty parameter µ psq can be updated along the interactions; let µ Ñ 8 or increase with small increments can in general improve the speed of convergence [Goldstein and Osher, 2009].The above problem can be efficiently minimized by a coordinate descent algorithm.Suppose all the β k s are fixed except β j , and denote r j " y ´řk‰j x k β k .The objective function with respect to β j becomes Then it can be easily verified that where Spm, λq " signpmqp|m|´λq `is the soft-thresholding operator.( 9) can then be solved by iteratively updating each β j , j " 1, . . ., p, by (10) until convergence.Our proposed algorithm is presented in Algorithm 1.

S2.2 Constrained reduced-rank regression
We consider a general form of the linear-constrained reduced rank regression as min where Y P R nˆq , X P R nˆp , A " pa 1 , . . ., a p q P R hˆp and B P R hˆq .To solve the optimization problem, we write the augmented Lagrangian function as where δ P R hˆp is the Lagrange multiplier, and µ ą 0 is a penalty parameter.The iterative steps to solve the problem, The method converges under very general conditions.As the iteration proceeds, the residual }AΓ s`1 ´B} converges to zero, yielding optimality.We simplify the rank constrained L µ pΓ, δ psq q as where r Y " rY T ?µB T `δpsqT ?µ s T and r X " rX T A T s T .An optimal solution of arg min Γ Yṽṽ T where ṽ is the largest eigen-vector of r Y Algorithm 2 Linear Constrained Reduced Rank Regression Initialization: s " 0, Γ p0q P R pˆq , δ p0q " 0, µ p0q " 1, and ρ ě 1. repeat (1) Γ ps`1q " p r X T r Xq ´1 r X T r Yṽṽ T where ṽ is the largest eigen-vector of r Y Y " rY T ?µB T `δpsqT ?µ s T and r X " rX T A T s T .

S2.3 Tuning parameter selection
We have outlined the sequential procedure in Algorithm 1 of the main manuscript.Within each step of this sequential process, we solve a unit-rank estimation (URE-TARO) problem, as described above.With only one tuning parameter, λ, we generate a solution path for various λ values within the range of λ max to λ min (a chosen multiple of λ max ), depicted in the first two subplots of Figure S1.Subsequently, we employ k-fold cross-validation (with k=5 recommended) to select the λ value associated with the lowest error, as demonstrated in the rightmost plot of Figure S1.

S3. Supplementary figures S3.1 Simulation results
Here, we provide additional figures summarizing the simulation results for the settings described in Section 3.1 of the main manuscript.Specifically, we considered simulation settings where the unit-rank components of the coefficient matrix are constructed such that the following feature sets are relevant: a) features with higher variation, b) rare features, c) fine-resolution features (leaf nodes), and d) aggregated features (internal nodes).We have provided more details about the four simulation settings in Table S1.The results for setting a) are given in Figure 2 of the main manuscript.The results for the remaining scenarios are provided in Figure S2.TARO consistently outperforms the alternative methods considered in terms of accuracy in the estimation of the true coefficient matrix, prediction accuracy, and recovery of the true features.

S3.1.1 Robustness to choice of pseudocount
By default, TARO assumes a pseudocount of 1.To assess the robustness of TARO to the choice of pseudocount value, we have compared the default setting of TARO with pseudocount values of 0.5, 0.25, and 2 (see Figure S3 for performance comparison).We found that in all four settings, the pseudocount of 2 significantly increases the error estimate Er(C).
In Setting c), we observe an increase in the error estimate with pseudocounts of 0.25 or 0.5 compared to the default pseudocount of 1. Setting c) aims to showcase the model's efficacy in a scenario where the underlying relationship between the multivariate response and predictors is expressed solely in terms of leaf nodes in the taxonomic tree.
Table S1: Detail on the simulation settings.Further details regarding the implementation can be obtained from the taro sim function within the R package taro.

Description
Setting 1 In this setting, a greater number of leaf taxa in the taxonomic tree affect the outcome.We randomly select 10% of taxa, including both internal nodes and leaves, favoring those with higher variability across samples.Specifically, we include one taxon with four leaf nodes as children, one taxon with three leaf nodes as children, and three taxa with two leaf nodes as children.The remaining relevant features come mainly from leaf nodes in the taxonomic tree.Setting 2 In this setting, a greater number of leaf taxa in the taxonomic tree affect the outcome.We randomly select 10% of taxa, including both nodes and leaves, favoring those with lower variability across samples.Specifically, we include one taxon with four leaf nodes as children, one taxon with three leaf nodes as children, and three taxa with two leaf nodes as children.The remaining relevant features come mainly from leaf nodes in the taxonomic tree.Setting 3 In this setting, only leaf nodes in the taxonomic tree affect the outcome.We randomly select 10% of taxa, favoring those with lower variability across samples.Setting 4 In this setting, a greater number of higher taxa in the taxonomic tree affect the outcome.We randomly select 10% of taxa, including both internal nodes and leaves, favoring those with lower variability across samples.Specifically, we include four taxa with four leaf nodes as children, two taxa with three leaf nodes as children, and ten taxa with two leaf nodes as children are deemed relevant.The remaining relevant features come from leaf nodes in the taxonomic tree.
In summary, our analysis indicates that the choice of a pseudocount of 1 is optimal for multivariate analysis using TARO.

S3.1.2 Robustness to model misspecification
We have conducted a comprehensive series of simulations where we deliberately introduce model misspecifications.This involved systematically varying both the error structure (including different levels of correlation in the errors among the multivariate responses) and the error distribution (considering heavytailed symmetric distributions such as the Laplace, Cauchy, and Student's t distributions).This analysis aimed to provide insights into our method's ability to maintain accuracy and reliability in real-world scenarios where the underlying assumptions may not hold.We summarize the results in Figure S4.
The performance of TARO deteriorates when there is a high correlation among the multivariate responses or when the errors arise from heavy-tailed distributions.This is because TARO assumes that the errors are independent and normally distributed.The alternative methods considered in the main manuscript make similar assumptions, so this challenge is not unique to TARO.

S3.1.3 Evaluation of TARO unobserved true abundance
TARO primarily operates under the assumption that even though true microbial abundance data are not directly observed, the microbial abundance data are compositionally accurate.In this subsection, we consider a simulation framework that contrasts the performance of TARO under the simulation design discussed in Section 3.1 of the main manuscript, where the observed abundances are used to generate Y and also as input to TARO, with a scenario where the true unobserved relative abundances are used to generate Y, while the observed abundances are used as inputs to TARO.To construct the observed abundances from the true unobserved relative abundances, we sample from a multinomial distribution with the the sequencing depth as the number of trials and the true relative abundances as the event probabilities.We summarize the results in Figure S5 for the four settings.Our simulation study suggests that TARO's performance remains comparable to the scenario considered in the main manuscript.

S3.2 Application results
We provide a snapshot of the logistic regression modeling results in the upper left of Figure S6.Rectangular heatmaps showing the contribution of input features to the clinically relevant latent factors are shown for the metabolite variables (upper right) and microbiome features (bottom).We observed that the microbiome features contributing to the clinically relevant latent factors were at the family and genus level: X2 includes 8 family-level and 68 genus-level features, X5 includes 6 family-level and 75 genus-level features, and X7 includes 4 family-level and 76 genus-level features.Across the eight latent factors inferred by TARO, aggregation to higher tree levels was relatively rare, with 1 phylum, 2 classes, and 6 orders selected in total across all latent factors.The results of metabolite set enrichment analysis for the clinically relevant latent factors are shown in Figure S7.

Figure S1 :
Figure S1: Tuning parameter selection using k-fold cross-validation in TARO.The left subplot represents solution paths for û1 , the center subplot represents solution paths for v1 , and the right subplot represents the cross validation error for increasing values of λ.

Figure S4 :
Figure S4: Simulation study: Evaluation of TARO in the presence of correlated errors among the multivariate responses and errors from heavy-tailed distributions like Cauchy, Laplace and Student's t.

Figure S5 :
Figure S5: Simulation study: Evaluation of TARO in the four settings with observed and unobserved microbial abundance data.