Umibato: estimation of time-varying microbial interaction using continuous-time regression hidden Markov model

Abstract Motivation Accumulating evidence has highlighted the importance of microbial interaction networks. Methods have been developed for estimating microbial interaction networks, of which the generalized Lotka–Volterra equation (gLVE)-based method can estimate a directed interaction network. The previous gLVE-based method for estimating microbial interaction networks did not consider time-varying interactions. Results In this study, we developed unsupervised learning-based microbial interaction inference method using Bayesian estimation (Umibato), a method for estimating time-varying microbial interactions. The Umibato algorithm comprises Gaussian process regression (GPR) and a new Bayesian probabilistic model, the continuous-time regression hidden Markov model (CTRHMM). Growth rates are estimated by GPR, and interaction networks are estimated by CTRHMM. CTRHMM can estimate time-varying interaction networks using interaction states, which are defined as hidden variables. Umibato outperformed the existing methods on synthetic datasets. In addition, it yielded reasonable estimations in experiments on a mouse gut microbiota dataset, thus providing novel insights into the relationship between consumed diets and the gut microbiota. Availability and implementation The C++ and python source codes of the Umibato software are available at https://github.com/shion-h/Umibato. Supplementary information Supplementary data are available at Bioinformatics online.

n,i , ..., y the growth rate of i-th microbe of the n-th observation point of the s-th subject Ns , n,i , ..., σ the variance of the growth rate of i-th microbe of the n-th observation point of the s-th subject (s) n the one-hot vector that indicates the state of the n-th observation point of the s-th subject the stochastic process of states between the n-th and (n + 1)-th observation points of the s-th subject at time t the growth parameter of the i-th microbe φ k,i,j the interaction parameter from the j-th microbe to the i-th microbe the time interval between n-th and (n + 1)-th observation points of s-th subject Q the transition rate matrix q k,l the (k, l) element of Q P(t) (p 1 (t), ..., p k (t), ..., p K (t))

S1 Derivation of variational inference algorithm
The notations of the variables are described in Table S1.

S1.1 Decompostion of ELBO
Decompose the ELBO according to the generative model as follows: S1.2 Update of q(Φ) We obtained the partial derivative of the ELBO with respect to q(Φ) using the variational method as follows:
Then, we obtained the following: Finally, q(φ k,i ) can be represented the following distribution by normalization: S1.3 Update of q(Z (s) ) In the same way as q(Φ), we obtained the following formula using the variational method: n,i , η) where The q(φ k,i ) expectation term can be written as ln p(y n,i , η) S1.4 Update of q(ζ(t)) We used the true p(ζ (s) (t)|Z (s) , Q, d) as q(ζ (s) (t)|Z (s) ).q(ζ (s) (t)|Z (s) ) can be decomposed as follows: Here, c(n, s, t, k, u, v) is represented as follows: . Therefore, we can represent the expectation values of τ as follows: The calculation of these integrals was based on the study of Liu et al. (2015) and is given as follows: where Here, 0 is a matrix with 0 in the all elements, and I(i, j) is a matrix with 1 in the (i, j) element and 0 otherwise.

S1.5 Update of maximum likelihood estimated parameters
We derived the updates for the maximum likelihood estimated parameters Q, λ, and η.The prior distribution for ζ(t) and Z can be rewritten by As such, we obtained the update of Q as follows: We obtained the update of λ as follows: We obtained the update of η as follows: .

S3 Synthetic data experiments
We created several datasets of 100 days based on gLVE.The gLVE parameters were randomly generated by the following generative process: NegativeHalfNormal(0, 0.01) (i = j) PositiveHalfNormal(0, 0.01) (j = 0) Normal(0, 0.01) (otherwise) The generative process of the gLVE parameters satisfies the assumptions in MDSINE (Bucci et al., 2016).We randomly obtained the initial microbial quantitative abundance x(0) from Gamma(1, 1) and computed y(0) using gLVE, where Gamma(k, θ) denotes the gamma distribution with shape parameter k and scale parameter θ.From this value, we iterated the gLVE and obtained the following equation: x(t + dt) = exp(ln x(t) + y(t)dt), where dt is a small time (0.001 days).The state switches from state 0 to state 1 in the middle of this period.S = 7, d n = 1, and N s = 100 were used in all datasets.We set different numbers of states K and microbes M for each dataset.Table S3 lists the settings for each dataset.We generated datasets until a non-divergent trajectory was obtained.Finally, we added random noises Normal(0, 0.01) to generate x(t).
Table S3: Settings for each dataset.

S4 Computational time
We measured the computational time of the GPR and CTRHMM estimation on the mouse gut microbiome dataset.We measured the time of one GPR estimation and the average time of 10 CTRHMM trials.It took 7.86 and 24.3 seconds to fit the GPR and the CTRHMM, respectively.We used a ZenBook 3 UX390UA with a 2.7 GHz 2-core Intel Core i7 processor, 16GB of RAM, and Ubuntu 18.04 for measuring the computational time.

Figure S1 :Figure S2 :Figure S3 :Figure S4 :Figure S6 :Figure S7 :
FigureS1: True function and the mean squared error (MSE) of growth rate estimation methods (Gaussian process regression, penalized spline) for synthetic data.The top row represents the shape of the true function, and the bottom row shows the MSE corresponding to the true function.The x-and y-axes in the top row represent x and y, respectively.The lower x-and y-axes represent noise and MSE, respectively."GP" represents the Gaussian process, and "Splinep" represents the penalized spline, where p is the coefficient of the penalty term.Each column is the result of (a) y = sin(x), (b) y = sigmoid(x − 15), and (c) y = 0.5 sin(0.5x)+ cos(x − 1) being a true function.