Coracle—a machine learning framework to identify bacteria associated with continuous variables

Abstract Summary We present Coracle, an artificial intelligence (AI) framework that can identify associations between bacterial communities and continuous variables. Coracle uses an ensemble approach of prominent feature selection methods and machine learning (ML) models to identify features, i.e. bacteria, associated with a continuous variable, e.g. host thermal tolerance. The results are aggregated into a score that incorporates the performances of the different ML models and the respective feature importance, while also considering the robustness of feature selection. Additionally, regression coefficients provide first insights into the direction of the association. We show the utility of Coracle by analyzing associations between bacterial composition data (i.e. 16S rRNA Amplicon Sequence Variants, ASVs) and coral thermal tolerance (i.e. standardized short-term heat stress-derived diagnostics). This analysis identified high-scoring bacterial taxa that were previously found associated with coral thermal tolerance. Coracle scales with feature number and performs well with hundreds to thousands of features, corresponding to the typical size of current datasets. Coracle performs best if run at a higher taxonomic level first (e.g. order or family) to identify groups of interest that can subsequently be run at the ASV level. Availability and implementation Coracle can be accessed via a dedicated web server that allows free and simple access: http://www.micportal.org/coracle/index. The underlying code is open-source and available via GitHub https://github.com/SebastianStaab/coracle.git.


Introduction
The role of prokaryotes becomes ever more important in the consideration of ecosystem-and organismal-level processes (McFall-Ngai et al. 2013;Peixoto et al. 2022).Although the assessment of bacteria that are present in a given environment or organism is now commonplace by means of 16S rRNA marker gene sequencing, the identification of bacterial communities and taxa that are indicative of or associated with a specific condition or setting is of particular interest.This provides a framework for biomarker development to aid environmental monitoring or identification of microbial candidates for prospecting in microbiome restoration/rehabilitation to increase host resilience or organismal well-being (Peixoto and Voolstra 2023).For instance, coral bleaching triggered by rising seawater temperatures is decimating coral cover globally (Hughes et al. 2017;Eakin et al. 2022).This spurs the development and application of active interventions to counter the ongoing decline (Voolstra et al. 2021a;Voolstra et al. 2023).Microbiome stewardship, the management of ecosystem function using the microbiome, e.g. through the provision of beneficial microorganisms that increase stress tolerance and improve recovery, is shown to work in principle (Rosado et al. 2019;Santoro et al. 2021), but the underlying mechanisms and the selection of putative beneficial microbes remain elusive (Peixoto et al. 2021(Peixoto et al. , 2022)).Further, the process of culturing and screening putative beneficial bacteria to be tested as probiotics can be laborious and time-consuming (Schultz et al. 2022;Do ¨rr et al. 2023).To this end, artificial intelligence (AI)-based approaches have the potential to accelerate and optimize this process.This is because AI-based approaches allow pattern recognition in high-dimensional and big datasets (many features, many samples) that typically exceed the capabilities of simple statistics.For instance, machine learning (ML) can be applied to predict host phenotypes by human gut microbiome composition (Marcos-Zambrano et al. 2021) and guide precision medicine (Schork 2019).AI-based quantitative approaches to discovering potentially beneficial bacteria could therefore support the identification of putative coral probiotics by analyzing existing datasets.While research on biomarker identification and feature selection for multi-omics datasets exists, identification of bacterial communities associated with specific continuous variables of interest poses unique challenges.This is (i) because many existing methods are built for binary classification tasks that do not account for continuous data, but more importantly, (ii) because commonly conducted correlational analyses are limited to detecting linear relationships.ML approaches by comparison are superior for revealing nonlinear relationships and making accurate predictions from complex data.Physiological traits, such as thermal tolerance, are not "either-or" but rather are distributed as a continuum in natural populations (Evensen et al. 2022;Humanes et al. 2022).Thus, a framework for selection of bacteria associated with a "higher" or "lower" measurement of a continuous physiological variable is desirable.We have built Coracle, an ML framework that can work with complex features (e.g.microbial assemblage) and continuous target variables (e.g.thermal tolerance, growth, activity, resilience, etc.).Coracle is programmed to allow a robust identification of features associated with a given target variable even within smaller datasets.We show the utility of Coracle by running it on a 16S rRNA gene amplicon-based microbiome dataset with accompanying standardized thermal tolerance thresholds (i.e.ED50 values) from the same coral samples (Voolstra et al. 2020;Evensen et al. 2021Evensen et al. , 2022Evensen et al. , 2023)).Potential other analyses can include identification of microbes associated with growth, disease susceptibility, recovery rate from injury, etc.Running Coracle in a successive manner at a higher (family) and then lower (Amplicon Sequence Variant, ASV or Operational Taxonomic Unit, OTU) taxonomic level identified bacterial taxa previously implicated in coral thermal resilience that can then serve as a starting point for further elucidation.

Implementation
This AI framework can select bacterial taxa associated with an increase or decrease in any continuous target variable (Fig. 1).To achieve this, the framework uses a multilevel ensemble approach that consists of multiple steps: In a first step, it forks through two popular forms of transformation (i.e.l1and centered log ratio-normalization) (Aitchison 1982).After transforming the data, it applies various feature selection methods to extract a subset of features to feed into the ML models.Metrics of interest will then be aggregated via a scoring function to allow a simple yet informative ranking of bacterial taxa associated with thermal tolerance (i.e.ED50 values).
The task of identifying bacterial taxa or communities associated with a trait of interest (e.g.thermal tolerance) has several underlying complexities.Continuous target variables require regression-like models.In addition, bacterial community data are often sparse and high-dimensional, requiring methods to counter overfitting and the "curse of dimensionality" (i.e.arising difficulties when working with high-dimensional data, e.g. the amount of data needed to obtain statistical significance increasing exponentially with Figure 1.Coracle workflow.The Coracle workflow requires two input files (Dataset): one file that is a community data matrix, i.e. either ASVs with associated counts or aggregated counts at a higher taxonomic level, and a second file containing a continuous variable (e.g.physiological data such as thermal tolerance, growth, etc.) for the same set of samples.Data are then processed to assign each feature a score based on its importance with regard to the target variable of interest, weighted by the predictive performance of each model.The framework outputs the ranked list of all features along with their scores, sub-measures, and the coefficient values obtained from regression-like models.These coefficients indicate the direction of association (positive coefficients ¼ an increase/decrease in bacterial abundance associated with an increase/decrease in the continuous variable; negative coefficients ¼ an increase/decrease in bacterial abundance associated with a decrease/increase in the continuous variable).Abbreviations: RFR ¼ Random Forest Regression, CFS ¼ Correlation-based Feature Selection, LASSO ¼ Least Absolute Shrinkage and Selection Operator, ALASSO ¼ Adaptive LASSO.
added feature counts).Therefore, the present framework validates the performance of all traversals via leave-one-out cross-validation, maximizing the training data for each cycle.With the typically small sample sizes (in comparison with the number of features) in multi-omics datasets, the leave-one-out cross-validation allows us to use the maximal number of samples for training (n-1) and consequently one sample for testing, without introducing overfitting.Coracle cycles through the dataset such that each sample will be used for testing once.While this comes at the cost of computational complexity, it can be considered more efficient/economical than the creation of bigger sample sets.As we only have two phases, training and validation, the parameters in the model cannot be optimized separately.Instead, the framework is optimized for an aggregated (collapsed) form of the microbiome.Based on training datasets, the taxonomic levels of "order" and "family" performed well and showed a similar feature number and resolution.Thus, parameters were broadly optimized for these two levels, although any other taxonomic level can be used as input to the framework, but may result in potential losses of robustness, performance, and an increased computational complexity.If the lowest taxonomic level (i.e.ASV or OTU level) is the desired output, prefiltering the data first is recommended to reduce runtime complexity.One promising alternative route for this is a two-tier approach with an identification of promising bacterial families (or orders) first, followed by a subsequent second Coracle run on the lowest level (ASV/OTU), but solely for the ASVs/OTUs of the previously identified high-scoring bacterial families (or orders).
Coracle follows an ensemble approach to support the robustness of feature selection, minimization of overfitting, and maximization of insights gained from available datasets.Two ensemble options can be considered: an ensemble of different models and an ensemble of different data subsets employing the same model.In Coracle, both approaches are implemented.For the implementation of an ensemble approach for the models employed, we chose a set of largely different approaches, including steps of preprocessing, feature selection, and ML to exploit their differing strengths and offset their weaknesses (Fig. 1).The first two transformation methods, l1-normalization and center log ratio-normalization (Aitchison 1982), are applied in parallel on bacterial abundance counts (community data matrix).Both are popular normalization methods for microbial abundance data and allow the examination of mathematically differing associations.
Within the cross-validation loop, three feature selection methods, Lasso (Tibshirani 1996), adaptive Lasso (Zou 2006), and correlation-based feature selection (CFS) (Hall 2000), are applied.For the implementation of an ensemble approach using variable input data (Fig. 1), different subsets are utilized in the random forest regression step (Louppe 2014) to get an ensemble of regression trees.These are used to model subsets of selected features as well as the original feature set.Additionally, Lasso and adaptive Lasso are used to collect a number of feature coefficients to gain first implications on direction of association of selected bacteria taxa.
To aggregate all different steps and enable comparison of the different results, a scoring function combines the metrics of feature importance and model performance and ranks the bacterial taxa according to their score.
The respective importance of input feature k (i.e.bacterial taxa at the family/order or ASV/OTU level, depending on the taxonomic level chosen in the input file) in model t is represented by the importance kt , whereby R 2 t represents the popular R 2 metric for model t and fm k 2 N j m k 2 [0,8]g stands for the number of models that chose feature k.The final output includes all model performances, the feature importances, and coefficients of each feature (i.e.bacterial taxa at the family/order or ASV/OTU level, depending on the taxonomic level chosen in the input file), as well as the overall aggregated score, by which the resulting output is sorted (Fig. 2).
Given that typical biological datasets feature small sample sizes (<500 samples), computational complexity in this regard is of minor importance.However, the computational burden of running Coracle scales significantly with the number of features.Thus, the 2-tier approach consisting of an initial run at the family/order level (with aggregated ASV/OTU counts), followed by a subsequent run at the ASV/OTU level of those families/orders that produced high-scoring results, is advised for datasets featuring >1000 ASVs/OTUs.
It is important to note that while the performance evaluation requires leave-one-out cross-validation to prevent training and validating on the same data, the coefficients and feature importance are obtained via training the respective models on the whole dataset.The intercept for the regression models (orange) is presented in the third row.The intercept is useful to contextualize regression models, but not important to consider for interpretation of results.The following row headers are the input features (here: bacterial families; purple) ranked by their aggregated score (red).This score is the main outcome of Coracle and identifies the most important and robust features (here: bacterial taxa associated with thermal tolerance).The feature importance of the RFRs (blue) and the regression coefficients (yellow) allow to retrace the scoring and further analyze the output.This output is generated based on a previously published dataset that examined 28 coral colonies across 4 sites for their thermal tolerance and microbial assemblage (Voolstra et al. 2021b).Bacterial family assemblage at the baseline temperature (30 C) and colony-based ED50 thermal tolerance estimates were used as input files for Coracle to find bacterial taxa associated with thermal tolerance (Supplementary Data).
Coracle 3 3 Application/case study Our specific goal of this AI framework was to select coral bacterial taxa associated with tolerance/resilience to thermal stress (see also Fig. 1) (Supplementary File S1, S2).For a pilot run, we analyzed the association between bacterial assemblage (16S rRNA gene amplicon data) and thermal tolerance (ED50 standardized thermal tolerance values) of 28 Stylophora pistillata coral colonies across four Red Sea sites (Voolstra et al. 2021b) on the aggregated taxonomic level of bacterial family (Supplementary File S3).A partial view on the family-level output is given in Fig. 2. High scores for the bacterial family output were obtained for Microbacteriaceae (negative association), Alteromonadaceae (positive association), Vibrionaceae (positive association), Rhodobacteraceae (positive association), Flavobacteriaceae (negative association), and Endozoicomonadaceae (positive association) (among others) (Supplementary File S4), many of which were previously found associated with coral thermal tolerance (Rosado et al. 2019;Santoro et al. 2021;Savary et al. 2021;Voolstra et al. 2021b;Buitrago-Lo ´pez et al. 2023).ASVs underlying high-scoring families (only the top5 high-scoring families were considered) were then consecutively run in Coracle (Supplementary File S5) and produced high-scoring bacterial taxa (Supplementary File S6) that were previously deemed functionally important in coral microbiome studies, e.g.taxa of Alteromonas (Ziegler et al. 2017;Doering et al. 2021), Marinobacter (Camp et al. 2020), and Aurantimicrobium in the phylum Actinobacteria (Epstein, Torda and van Oppen 2019).Performance assessments based on the predictive capabilities of single models alone (i.e.paths in the Coracle workflow, Fig. 1) demonstrate the potential of the framework.Importantly, the assessments also showed that there is no single best path that outperformed all other approaches.With regard to the direction of association, the output of regression coefficients obtained from the Lasso models gives a first hint on the direction of association between the respective bacterial taxa and the physiological variable.The robustness of the Coracle output, i.e. whether the same taxa retrieve highscoring outputs in iterative runs, was tested by running Coracle 100 times using the same input dataset.
Considering the top10, top20, top30, top40, top50, and all taxa at the family and at the ASV level (ASVs of top5 families considered), the correlations between the feature's score and rank across repeat runs were very high (>0.98Pearson's correlation coefficient for all instances; >0.93 Spearman rank correlation coefficient for all top scoring taxa; and >0.8 when considering all features) (Supplementary File S8).Sensitivity to sample size, which is a known limitation of random forests, was assessed via a jackknifing approach.To do this, we generated 10 randomly jackknifed datasets for each number of samples (n ¼ 27, 26, 25, . . ., 5) to run Coracle (Supplementary File S7).Subsequently, we compared the average performance (mean R 2 ) and standard deviation between each of the 10 datasets for any given number of samples.Coracle performance was good down to 20 samples and provided acceptable and still meaningful performance down to 10 samples, which one might consider a minimum sample size.
In general, larger sample sizes will always be beneficial for performance and confidence.Based on these results, we recommend sample sizes 20 and encourage the use of Coracle at lower sample sizes with conscientious interpretation of the results.

Conclusion
Coracle is an AI framework that uses an ensemble approach of prominent feature selection methods and ML models to identify associations between bacterial communities and continuous variables.We tested and optimized Coracle using coral microbiome datasets.A typical use-case scenario is that a researcher is studying a continuous phenotypic characteristic (e.g.growth, stress tolerance, etc.) or variable of interest (e.g.salinity, turbidity, nutrient levels, etc.) while characterizing the microbiome to assess whether changes in bacterial community assemblage align with differences in the target variable (host physiology or environment).Coracle can identify bacterial taxa that are predictive of phenotypic trait or environmental condition performance, and thus provide a means to align host biology or the prevailing environment with microbiome assemblage.Notably, the application of Coracle is not restricted to microbial community data matrices but can process other types of high-dimensional data, such as gene expression matrices, in association with a continuous variable.Thus, Coracle can be used for a wide range of datasets with similar properties (a data matrix and a continous response variable for the same set of samples as input files).Importantly, Coracle can only account for association and not for causation.Thus, the obtained results need to be validated in real-world experiments.We set up a convenient web server implementation of the Coracle framework accessible at http://www.micportal.org/coracle/index.The web interface provides a tutorial for the upload of data files and execution of Coracle, including example files that can be downloaded for correct formatting of input data tables.The results are sent to the email address submitted on the webpage.The code of Coracle is accessible at https://github.com/SebastianStaab/coracle.git.

Figure 2 .
Figure2.Coracle output.The first two rows show the R 2 and mean squared error metrics (green) for each Coracle model (column headers).The intercept for the regression models (orange) is presented in the third row.The intercept is useful to contextualize regression models, but not important to consider for interpretation of results.The following row headers are the input features (here: bacterial families; purple) ranked by their aggregated score (red).This score is the main outcome of Coracle and identifies the most important and robust features (here: bacterial taxa associated with thermal tolerance).The feature importance of the RFRs (blue) and the regression coefficients (yellow) allow to retrace the scoring and further analyze the output.This output is generated based on a previously published dataset that examined 28 coral colonies across 4 sites for their thermal tolerance and microbial assemblage(Voolstra et al. 2021b).Bacterial family assemblage at the baseline temperature (30 C) and colony-based ED50 thermal tolerance estimates were used as input files for Coracle to find bacterial taxa associated with thermal tolerance (Supplementary Data).