Gradient boosted regression as a tool to reveal key drivers of temporal dynamics in a synthetic yeast community

Abstract Microbial communities are vital to our lives, yet their ecological functioning and dynamics remain poorly understood. This understanding is crucial for assessing threats to these systems and leveraging their biotechnological applications. Given that temporal dynamics are linked to community functioning, this study investigated the drivers of community succession in the wine yeast community. We experimentally generated population dynamics data and used it to create an interpretable model with a gradient boosted regression tree approach. The model was trained on temporal data of viable species populations in various combinations, including pairs, triplets, and quadruplets, and was evaluated for predictive accuracy and input feature importance. Key findings revealed that the inoculation dosage of non-Saccharomyces species significantly influences their performance in mixed cultures, while Saccharomyces cerevisiae consistently dominates regardless of initial abundance. Additionally, we observed multispecies interactions where the dynamics of Wickerhamomyces anomalus were influenced by Torulaspora delbrueckii in pairwise cultures, but this interaction was altered by the inclusion of S. cerevisiae. This study provides insights into yeast community succession and offers valuable machine learning-based analysis techniques applicable to other microbial communities, opening new avenues for harnessing microbial communities.


Introduction
Micr obial comm unities ar e ubiquitous in natur e and foundational to all living systems.Understanding the underlying ecological interactions and cellular and molecular functioning of microbial communities is one of the major remaining research frontiers in microbiology (Altermann and Hick e y 2020 , Sun and Sanchez 2023 ).The role of time has been highlighted as a critical lens thr ough whic h to unr av el fundamental principles of ecological systems (Ryo et al. 2019 ).Given that micr obial comm unities ar e dynamic entities , in v estigating these comm unities thr ough a tempor al fr ame work is essential for understanding their succession, ada ptation, and r esilience.Furthermor e, in the r a pidl y de v eloping field of le v er a ging pr edictiv e modelling for biological a pplications, integrating principles of temporal dynamics into the development of pr edictiv e models for micr obial comm unities is timel y.
There is a paucity of r esearc h into eukaryotic microbial communities, and as such, comparatively very little is known curr entl y about ho w eukary otes interact and function within micr obial comm unities when compar ed to pr okaryotes (Conac her et al. 2019 ).A eukaryotic micr obial comm unity of industrial significance is the community of yeasts involved in the natural fermentation of gr a pe m ust to wine, r eferr ed to as the wine yeast comm unity.The div ersity and composition of the yeast community involved in wine fermentations impact the final properties of wine; ho w e v er, the causativ e mec hanisms involv ed ar e still lar gel y viewed as a black box, which has driven research into understanding how the wine yeast community functions.Besides the commercial r ele v ance, fr om a fundamental r esearc h perspectiv e, the wine yeast community has several properties that make it an ideal model for studying eukaryotic microbial community functioning (Conacher et al. 2021 ).Fermenting gr a pe m ust is a harsh environment for yeasts, with stressors of a physical and chemical nature (e.g.low pH, high osmolarity, limited nitrogen sources, ethanol) as well as stressors of a biotic nature (several competing populations of yeast species) (Bauer andPretorius 2000 , Conacher et al. 2019 ).Since wine is an exclusiv el y anthr opogenicall y linked pr oduct, humans have influenced these yeasts' evolution (i.e.domestication) as we have selected for adaptations that are advantageous for successful fermentation.Notably, wine yeasts are phylogenetically separ ate fr om envir onmental isolates, as well as beer-and breadassociated yeast isolates; we have thus created an evolutionary isolated niche (Chambers et al. 2015 , Gibbons andRinker 2015 ).This is additionally evidenced phenotypically since wine yeasts ar e inher entl y better ada pted to fermentation-r ele v ant str essors and excel at functions related to fermentation (Guillamón and Barrio 2017 ).Further, there is global uniformity of wine yeast comm unity members, impl ying r elativ el y consistent biotic selection pr essur es for comm unity member coe volution (Belda et al. 2021 ).
Ther efor e, the wine yeast comm unity pr ovides a consistent ecological fr ame work fr om whic h to study eukaryotic microbial community functioning and dynamics (Conacher et al. 2021, Ruiz et al. 2023 ).
Tempor al patterns ar e centr al to the wine fermentation process.During natural or spontaneous wine fermentation, the yeast comm unity is intr oduced to the gr a pe m ust at the beginning of the pr ocess, r eferr ed to her e as 'time zer o'.Time zer o of the fermentation is ther efor e ar guabl y the most determinant biological phase of natural wine fermentations.From an industrial perspective, this is also the point at which interventions are most practical to make to ensure successful completion of the fermentation process .T here is a large body of work regarding description of yeast community succession that follows this initial inoculation, where tempor al abundance c hanges of differ ent yeast species populations have been mapped through time (Comitini et al. 2011, Ciani and Comitini 2015, Bokulich et al. 2016, Bagheri et al. 2017, Bagheri et al. 2018, 2020, Alonso-del-Real et al. 2019 ).The consensus is that the community undergoes a decrease in diversity, until one dominant species r emains: Sacc harom yces cerevisiae (Conacher et al. 2021, Ruiz et al. 2023 ).This is especially interesting from an ecological standpoint since the abundance of S. cerevisiae in the natur al comm unity at the start of fermentation is very lo w, y et it consistentl y outcompetes mor e abundant yeast species populations .T here is a good consensus regarding the fact that temporal succession occurs during wine fermentation, and that the diversity and composition of the wine yeast community at different stages of fermentation cause chemical changes to the final wine product; ho w ever, a predictive understanding of this process has not been ac hie v ed.
Further, since time zero is so pivotal to this process, there is an opportunity to investigate how the environment of these yeasts prior to time zero may impact on this succession.This concept is loosel y ca ptur ed in the definitions of 'ecological memory' and 'priming effects' (Ryo et al. 2019 ).Within the context of this paper, we are able to manipulate the environment of members of the wine yeast community prior to their introduction to an experimental community.Thanks to a wealth of fundamental physiological r esearc h on the model organism S. cerevisiae , w e kno w that the environment that a yeast is cultured in, i.e. growth media, will impact on its metabolism, and similarly, the metabolic state of yeasts is also impacted in each phase of growth in a batch experiment (Bely et al. 2005, Kolkman et al. 2006, Lackner et al. 2012, Hall et al. 2014, Thomas 2015, Kr a gh et al. 2018, K eil et al. 2019 ).By manipulating the growth medium and growth phase of yeast communities prior to their inoculation into fermentation, it can r oughl y shed light on to what extent the history of a particular yeast needs to be considered in the prediction and control of succession in the wine yeast community.
The diversity of the natural wine yeast community is such that disentangling the causativ e mec hanisms behind how each member, as well as ecological interactions between members, ultimatel y r esults in a particular fermentation outcome is an enormously complex theoretical and experimental challenge.In synthetic ecology, thoughtfully designed subsets of natural ecosystems are studied to reduce the technical complexity to a level wher e accur ate deductions can be made r egarding ecosystem establishment and functioning (Ben Said andOr 2017 , Ruiz et al. 2023 ).In particular, bottom-up a ppr oac hes of building simplified wine yeast communities are common to investigate yeast interspecies inter actions.Inter actions ar e defined her e as c hanges to a yeast's physiology that occur exclusiv el y when exposed to a different yeast species population (i.e. are different from what occurs during mono-species culture).Within this context, most commonl y, tr anscriptomic and proteomic approaches combined with population dynamics have been taken to investigate pairwise inter-species interactions (Curiel et al. 2017, Tronchoni et al. 2017, Peng et al. 2019, Shekhawat et al. 2019, Bordet et al. 2020, Tondini et al. 2020, Mencher et al. 2021 ), and these inter-species inter actions hav e a clear impact on temporal dynamics of the synthetic y east communities.Ho w ever, recent evidence suggests that ad diti ve pairwise inter-species interactions alone cannot predict how more complex yeast comm unities, whic h mor e accur atel y r esemble the natural diversity, will perform (Conacher et al. 2022, Chang et al. 2023 ).Further, there is currently a research chasm in the study of synthetic wine yeast comm unities, wher e ther e has been focus on either two species or m uc h lar ger assembla ges (n species > 7), but very limited work on synthetic communities between these sizes and e v en less on e v aluating impacts of stepwise increases in diversity.This is problematic because we know that species-specific interactions occur; therefore, teasing apart the trajectory of complex communities based on their initial composition is not possible at this point since we do not know enough about how assembly impacts future population dynamics .T herefor e, if we ar e to gain a pr edictiv e understanding of comm unity dynamics, synthetic communities of more than two species need to be studied further.This is, ho w e v er, easier said than done.
Ther e ar e significant c hallenges that accompan y inv estigation of complex microbial communities and their associated interactions (Widder et al. 2016 ).Firstl y, exhaustiv el y inv estigating interactions within a microbial community in different environmental settings generates an experimental space that is usually logistically unfeasible to execute.Secondly, in cases where researchers do indeed attempt complex microbial community experiments, the data generated are rich and challenging to derive meaning from in an objective way, especially considering that these data lar gel y do not conform to the assumptions for traditional statistical tests.
Machine learning (ML) has emerged as a promising tool in micr obial comm unity r esearc h (Rubbens and Pr ops 2021 ).Studies a ppl ying ML algorithms to experimental microbial community data are limited, with most examples either using simulated micr obial comm unity data and/or focusing on pairwise inter actions alone .T he generation of multivariate time series models from tabular datasets, such as those commonly generated in microbial ecology studies, is an active research challenge in the ML field, with most benchmarking having been performed for univariate time series (Padhi et al. 2021 ).Ne v ertheless, for m ultiv ariate dynamic tabular data, tree-based methods are the current recommendation in terms of performance, with neural network architecture performance lagging behind (Borisov et al. 2022 ).Treebased algorithms, which involv e v arious conformations and combinations of decision trees , ha ve been shown to perform well on av ailable micr obial comm unity datasets .T hese studies ha ve r anged fr om pr edicting str ength and dir ection in pairwise interactions of bacteria (DiMucci et al. 2018 ), prediction of hydrogen sulphide production (Dutta et al. 2022 ) and dissolved organic carbon le v els (Thompson et al. 2019 ) based on micr obial comm unity structure as an input, prediction of soil microbial community structur e fr om envir onmental par ameters (Peng et al. 2022 ), and ranking of phenotypes important to sugar consumption in wine yeast communities (Ruiz et al. 2023 ).It is ther efor e clear that these algorithms may aid in the quest to predicting temporal dynamics in micr obial comm unities .In this study, we ha ve opted for a Gradient Boosted Regression (GBR) tree-based model.In GBR models, during the training process, a given loss function is optimized by sequentially adding decision trees, in a way that minimizes the r esidual err ors fr om pr e vious iter ations, ultimatel y gener ating an ensemble of decision tr ees that, if pr operl y tuned, accur atel y pr edicts target values (Friedman 2002 ).
Her e, we explor ed whic h factors impact tempor al dynamics and succession of a synthetic wine yeast community, using concepts of past (prior to time zer o), pr esent (at time zero), and future (after time zero) by applying an explainable ML algorithm.A combination of near-exhaustiv e m ulti-species experimental cultures of a 4-species synthetic yeast community with the generation of a tree-based GBR model was applied.Time series of community composition and absolute viable cell numbers were generated in species pairs , triplets , and quadruplets using four different yeast population 'history' settings prior to growth in the community.
We sought to make use of the GBR model to provide insights into three biologically relevant elements that may impact community succession: (i) whether different pre-culture conditions would impact growth of community members in a competitive environment, showing whether the initial physiological state of yeast cells impact on competitive growth phenotypes, (ii) the importance of inoculation dosage as opposed to inoculation alone in impacting community member performance, indicating the impact of initial population size in community dynamics; within this context, we also e v aluated how inoculation dosage of other species impact each community member, and finally, (iii) how congruent important model featur es ar e between pairwise ( n = 2) combinations versus more complex ( n > 2) combinations, to show emerging properties of complex systems such as higherorder m ulti-species inter actions .T hese factors ar e r eadil y implementable in the context of understanding and manipulating temporal dynamics of microbial communities by adjusting procedur es r elated to the point of inoculation.

Yeast strains
Four yeast species r epr esentativ es of wine-related origin were used to construct model yeast communities of various complexities and species combinations as in Table 1 .Three of these species wer e fluor escentl y labelled, eac h with a differ ent fluor escent label, while the fourth species was not labelled.The strains representing the four species were: S. cerevisiae VIN13 (Anchor Yeast, Cape Town, South Africa; biogeographical origin: South Africa) labelled with Ta gRFP657; Lac hancea thermotolerans IWBT Y1240 [CBS: 16 374, biogeogr a phical origin: South Africa] labelled with mTag-BFP2; Torulaspora delbrueckii LO544 [CRBO: LO544, biogeographical origin: France] labelled with eGFP; and unlabelled Wickerhamomyces anomalus IWBT Y934 (CBS: 16 372, biogeographical origin: South Africa) (Conacher et al. 2020 ).All y east strains w ere stored as glycerol stocks (25% w/v glycerol) at −80 • C.

Growth media
Prior to inoculation, gl ycer ol stoc ks wer e str eaked out onto Wallerstein Laboratory (WL) nutrient agar (Sigma-Aldrich, Johannesburg, South Africa) and incubated at 30 • C for three da ys .T he Synthetic Gr a pe Must (SGM) used her e contained: 100 g l −1 glucose, 100 g l −1 fructose, 200 mg l −1 assimilable nitr ogen, tr ace elements and vitamins as described by Henschke and Jiranek ( 1993 ) and 10 mg l −1 er goster ol.

Different pre-culture comparisons
All experiments were completed independently of each other, with a biological repeat defined as a culture originating from a separate single colony.
Four of the most commonl y r eported pr e-cultur e conditions in wine synthetic community experiments were tested, namely, preculture in Yeast Peptone Dextrose (YPD) and harvested at either exponential or stationary phase, or pr e-cultur e in SGM and harvested at either exponential or stationary phase.Single colonies of eac h yeast str ain wer e inoculated into 5 ml of YPD br oth (Sigma-Aldric h, Johannesbur g, South Africa) in a test tube and incubated on a test tube rotator at 30 • C for 18 h.The culture was tr ansferr ed to 50 ml YPD or SGM, at a concentration of 1 × 10 6 cells ml −1 , in a 250 ml Erlenmeyer flask with a cotton plug and foil co vering.T he flask was incubated at 30 • C with agitation (150 RPM) until either exponential (10 h) or stationary phase (24 h) had been r eac hed, after which the culture was harvested and inoculated into multispecies culture .T he time at whic h the pr e-cultur es wer e deemed to be in exponential or stationary phase was based on pr e vious monocultur e gr owth curv es ( Supplementary Fig. S1 ).

Community culture inoculation and growth procedure
Pr e-cultur es wer e centrifuged at 5000 × g for 5 min at room temper atur e and re-suspended in Phosphate Buffered Saline (PBS), pH 7.2, at a volume of 10 × less than the initial culture volume before being enumerated and inoculated into one of 38 multispecies culture options (Table 1 ).While this is not a full-factorial experimental design, it is still a compr ehensiv e set of experiments that were sufficient for the purpose of this study.All multi-species cultures were conducted as previously described (Conacher et al. 2020 ) in SGM, at a final volume of 6 ml in a sterile 6-well tissue culture plate, which was sealed with parafilm and incubated at 30 • C, with a gitation (150 RPM).Eac h r epr esentativ e species was inoculated in v aried cell r atios, for a final total concentration of a ppr oximatel y 3 × 10 6 cells/ml −1 , measured b y v olumetric cell counts using the CytoFLEX (Beckman Coulter) flow cytometer.Monocultures of eac h species wer e gr own in par allel and wer e inoculated at an initial concentration of 3 × 10 6 cells/ml −1 ( Supplementary Fig. S2 ).

Monitoring community population dynamics
Community population sizes, represented by viable cell numbers, were determined by quantitative flow cytometry as pr e viousl y described (Conacher et al. 2022 ).Briefly, fluorescently labelled yeast species' cells are distinguished by fluorescence flow cytometry and volumetric cell concentrations are determined.Samples of 50 μl were taken at time points 0, 6, 12, 24, 48, and 72 h to quantitate viable cell numbers of each species within the community and in monocultures.Samples were diluted in PBS supplemented with EDTA [0.1 M] and propidium iodide [1 μM], prior to flow cytometry analysis.

Data pr e-pr ocessing
The first step in generating the model is to convert the dataset into a format that is computationally readable .Here , the dataset was r epr esented as a dynamic tabular dataframe ( Supplementary Dataset File ).This dataframe was generated by Table 1.Multispecies culture configurations that were conducted to generate the dataset.

Pre-culture in YPD and harvested a t sta tionary phase
Sacc harom yces cerevisiae + L. thermotolerans Lachancea thermotolerans + W. anomalus performing data pr e-pr ocessing.The categorical v ariables of pr eculture conditions were transformed using one-hot encoding, a data pr e-pr ocessing method that converts a categorical variable into a binary numerical format that is readable as an input for a model, as follo ws: tw o m utuall y exclusiv e one-hots r epr esenting the pr e-cultur e medium YPD (gr owth_medium_a) or SGM (growth_medium_b), and two m utuall y exclusiv e onehots r epr esenting the pr e-cultur e gr owth phase exponential (growth_phase_a) or stationary (growth_phase_b).The use of mutuall y exclusiv e one-hots was pr eferr ed since this pr eserv es the non-ordinal nature of the input data, and prevents numerical bias where the model might weigh categories labelled with a larger number higher than those labelled with smaller numbers.Similarl y, the pr esence of a particular y east species w as encoded in four one-hot categories for each species: S. cerevisiae as yeast_1_sc, L. thermotolerans as yeast_2_lt, T. delbrueckii as yeast_3_td, and W. anomalus as y east_4_w a.Time w as r epr esented as a timestamp in hours and percentage abundance of the yeast species in multispecies culture was represented as a value between 0 and 1, with S. cerevisiae as yeast_1_abundance, L. thermotolerans as yeast_2_abundance, T. delbrueckii as yeast_3_abundance, and W. anomalus as yeast_4_adundance.Absolute cell numbers were identicall y r epr esented, with column names indicating 'absolute'.Biological repeat values are given for each data row.

GBR model generation
The next step was to use the pr e-pr ocessed data as the basis for generating a GBR model.For this, it is necessary to classify the variables in the dataframe as features or targets .F eatures refer to the input variables of the model, while targets are the output values that the model is trained to predict based on the input features.
The dataset was read into a Python environment using P andas (v ersion 1.2.4) in a Google Collaboratory notebook ( Supplementary Code S1 ).The dataset contained 13 fea-tur es, namel y time, gr owth medium (growth_medium_a: YPD, gr owth_medium_b: SGM), gr owth phase (gr owth_phase_a: exponential, growth_phase_b: stationary), a binary indicator for presence of each yeast species at inoculation (yeast_1_sc: S .cerevisiae , yeast_2_lt: L. thermotolerans , yeast_3_td: T .delbrueckii , y east_4_w a: W . anomalus ), and initial abundance of each yeast species at inoculation (yeast_1_initial_abundance → yeast_4_initial_abundance).The targets of the model were the r elativ e abundance or absolute abundance of each yeast species (abundance: y east_1_abundance → y east_4_abundance, absolute: y east_1_absolute → y east_4_absolute).The choice to include both r elativ e and absolute abundance as targets was made since each of these parameters provides a unique view of the dynamics of the comm unity.Relativ e abundance is key for showing the composition and temporal shifts within microbial communities, while absolute abundance provides perspective on the actual scale of micr obial pr esence and activity.Since absolute abundance quantifies the total number of microbial cells or the ov er all biomass in a specified environment, it provides a measure of the scale at whic h eac h yeast species population exerts ecological functions.Combining both methods yields a more accurate and comprehensive understanding of the system (Props et al. 2017, Morton et al. 2019 ).
Next, it is necessary to split the total dataframe into rows that will be used during training of the model and rows that will be excluded or hidden during training and used for testing the accuracy of the model predictions.During these tests, for the rows that are excluded during training, the model will provide a prediction for what the target values are, and this prediction is compared to the actual target data within that row, and this comparison is used to calculate r egr ession metrics that give an indication of model performance.
Here, the dataset was divided into training and testing sets using an 80:20 ratio using the 'train_test_split' function from sklearn.model_selection module (sklearn version 0.24.2).A Gra-dient Boosting Regressor model (from sklearn.ensemblemodule) was used for the prediction of target variables .T he model's hyper par ameters wer e fine-tuned using a grid searc h a ppr oac h (GridSearc hCV fr om sklearn.model_selectionmodule) with a 3fold cr oss-v alidation on the tr aining data.The grid searc h explor ed v arious combinations of the following hyper par ameters: 'n_estimators' (50,100,200), 'max_depth' (3, 4, 5), and 'learn-ing_rate' (0.01, 0.1, 0.2).The 'n_estimators' parameter determines the number of boosting stages, balancing accuracy and overfitting risks.'max_depth', indicating the depth of each tree, was varied to optimize the complexity and overfitting trade-off.The 'learn-ing_rate' influences the step size in corrections, guiding the search for a balance between training duration and model efficiency.The grid search aimed to minimize the mean squared error (MSE), calculated using 'mean_squar ed_err or' fr om sklearn.metricsmodule.Optimization for the number of boosting stages to perform, the maximum depth of r egr ession estimators, and the learning rate was performed using a simple grid search since this was sufficient to obtain final models with good performance metrics.
For each target variable in the 'absolute' and 'abundance' sets, the trained model's predictions were evaluated on the held-out test dataset.Negative predictions were set to zero, ensuring meaningful inter pr etation using numpy's maxim um function (numpy version 1.20.1).Model performance was assessed using Concordance Correlation Coefficient (CCC), root mean squared error (RMSE), and mean absolute error (MAE)-computed using respective functions from sklearn.metrics module .T he use of CCC is important to note since this metric is amenable to compositional data and provides an indication of both precision and accuracy of the models, allowing for comparisons to be made across all the models generated here.

Model analysis and visualization
An important aspect of ML-based model analysis is inter pr etability.Her e, featur e importance analysis was carried out, which quantifies the contribution of each input parameter (feature) to the accuracy of the generated model, thereby giving a score of importance between 0 and 1.A higher feature importance score indicates that the particular input parameter plays a bigger role in the model's pr edictions.Importantl y, the ma gnitude of a featur e importance scor e can be corr elated to the influence on the model's predictions, but it does not provide information about the direction of the effect or concrete evidence of any causal relationship.Still, featur e importance scor es ar e highl y useful in impr oving inter pr etation of a model and provide important insights into the underlying processes of the model, which researchers can tak e ad v anta ge of when anal ysing complex datasets with se v er al v ariables, suc h as those commonly found in microbial ecology datasets.
Feature importance values were calculated and plotted to understand whic h featur es contributed most to the model's predictions .T he data visualization was carried out using Matplotlib (version 3.4.2) and Gr a phP ad Prism 9. A thr eshold for featur e importance value significance was calculated based on the point where the first deri vati ve of the slope of the cumulative feature importance curve for a particular model falls below a threshold of 0.01, and increases in cumulative importance diminish significantly ( Supplementary Fig. S5 ).Simply put, when an additional feature incr eased the cum ulativ e featur e importance by < 0.01, indicating that adding more features beyond this point yields diminishing returns, it was disregarded.
Additionall y, as pr ovided in the supplementary materials, learning curv es wer e plotted to illustr ate the effect of the training set size on the model's performance ( Supplementary Fig. S3 ), and pr edicted vs. actual v alue plots wer e cr eated ( Supplementary Fig. S4 ) to visualize the model's prediction accuracy for each target variable.

Subset community datasets as input
Here, the dataset was subset based on community complexity, and for each subset, the processes outlined in sections 2.6.1.1-2.6.1.3wer e r epeated.In the 'pair-based' subset, data exclusiv el y comprising pairs of yeast species ( n = 2) was isolated ( Supplementary Dataset File , Supplementary Table S1 , Supplementary Figs S6 ,  S7 ).The 'community-based' subset encompassed data involving more than tw o y east species ( n > 2) in multi-species cultures ( Supplementary Dataset File , Supplementary Table S2 , Supplementary Fig, S8 , S9 ).Based on each subset as input, respectiv e models wer e tr ained and v alidated ( Supplementary Codes S2 and S3 ).This enabled targeted analyses of various community configurations, to gain insights into the dynamics of yeast interactions under distinct conditions.

Application of the model generation and interpretation approach to a published dataset
The model generation and performance evaluation procedur e wer e r epeated for the dataset r eported in Ba gheri et al. 2020 ( Supplementary Dataset File , Supplementary Code S4 , Supplementary Table S3 , Supplementary Figs S12 , S13 ), where the follo wing w as specific to this dataset.
Since the data reported were a verages , to generate more training data points, the original dataset was supplemented with 4 points of synthesized data that were randomly generated around the mean of the data points ( Supplementary Code S4 ).

Data and code availability
All datasets are provided in the supplementary materials.All code used is provided as Google Collaboratory Notebooks ( Supplementary Code S1-S4 ).

A grid-search tuned gradient boosted regression tree model accur a tel y describes the temporal dynamics of a synthetic yeast community
A GBR model was successfully trained and optimized to describe the r elativ e and absolute abundance dynamics of a four-species synthetic yeast community in various species combinations and pr e-cultur e conditions .T he choice of model tar gets, namel y r elative abundance and absolute abundance of species are all commonly used to describe population dynamics in microbial comm unities, whic h allows for extr a polation of this modelling appr oac h to se v er al differ ent population measur ement methodologies .T he optimized parameters as well as the performance metrics for the optimized models are reported in Table 2 .According to these metrics, the model ca ptur es the population dynamics trends within the dataset fairly well, confirming the applicability of the GBR fr ame work for this datatype, as well as the gridsearc h a ppr oac h for hyper par ameter tuning.As suc h, the model algorithm of GBR was an a ppr opriate c hoice for this dataset, and a high inter pr etation confidence can be attributed to the model parameters.

Pre-culturing methodologies had minimal influence on community dynamics
To e v aluate whether the envir onment of eac h yeast prior to its introduction to the community influenced succession patterns, we made use of differing pre-culturing methodologies (Fig. 1 ).The pr e-cultur e conditions tested were growth medium that the inoculums wer e cultur ed in and gr owth phase at whic h the inoculums wer e harv ested.Ov er all, pr e-cultur e had a minimal influence on the models that describe each yeast species' succession in the comm unity (i.e.featur e importance v alues wer e below the calculated significance threshold), suggesting that community succession did in most cases not depend on the prior state of the inoculated cells ( Supplementary Fig. S10 ).One exception, howe v er, was T. delbruec kii , wher e the featur e importance v alue for pr e-cultur e in YPD was ranked third in the GBR model of absolute abundance through time (Fig. 1 F, Supplementary Fig. S10 E,  F).

Inoculation dosage of non-Saccharomyces species influences their performance during mixed culture growth, while S . cerevisiae performs consistently regardless of inoculation dosage
The GBR models were generated with input features that indicated the presence of a particular species (dosage-independent), as well as input features that indicated the inoculation dosage of each species, allowing for us to determine how pr edictiv e initial abundance is for temporal succession patterns.Inter estingl y, the inoculation dosage was ranked consistently high among all community members (Fig. 1 ), ho w ever, for S. cerevisiae , the dosageindependent feature (i.e .F eature: Presence of S. cerevisiae at inoculation) was higher ranked (Fig. 1 A, B ). T herefore , S. cerevisiae ultimately dominates the wine community through time, regardless of its initial abundance values.Also, the strongest predictor for non-Sacc harom yces succession patterns within this comm unity was the amount at which they were initially inoculated.

In mixed cultures with more than two community members, the initial dosage of S . cerevisiae impacts L . thermotolerans and W . anomalus performance
To e v aluate how succession patterns in pairwise communities may be altered by the presence of additional species, two separate GBR models were trained and validated on the relative and absolute abundance through time for r espectiv e pairwise or complex mixed culture settings .T he results were consistent between the r elativ e and absolute abundance models ( Supplementary Fig. S11 ); ther efor e, for br e vity, the r elativ e abundance models ar e pr esented.The model performances wer e compar able to the original model, with high CCC scores ( Supplementary Tables S1 -S2 ).
For S. cerevisiae and T. delbrueckii , the most important model featur es wer e r elativ el y consistent between pairwise and more complex combinations (Fig. 2 A, B , E , F ), while some interesting differ ences wer e seen for the other non-Sacc harom yces species (Fig. 2 C, D , G , H ). For L. thermotolerans and W. anomalus, in 3-and 4species settings, S. cerevisiae .inoculation dosage is highest ranked (Fig. 2 D, H ), in contrast to pairwise settings (Fig. 2 C, G ). Exploring this further, the time-series plots of T. delbrueckii , L. thermotolerans , and W. anomalus , in various mixed culture settings, echo these features (Fig. 3 B, C , D ).Specifically , T .delbrueckii performs similarl y acr oss all 3-and 4-species settings, regardless of whether S. cerevisiae is present (Fig. 3 C).In contrast, L. thermotolerans and W. anomalus both attain higher r elativ e abundance values in 3species settings where S. cerevisiae is absent (Fig. 3 B, D ).

Wickerhamomyces anomalus abundance is impacted by T . delbrueckii during pairwise culture, while L . thermotolerans abundance is impacted by S . cerevisiae across all mixed cultures
The GBR model features also highlighted species-specific impacts of initial dosage on temporal succession in the synthetic comm unities.Simpl y put, this relates to how the relative amount of a particular species at inoculation may distinctly impact on another yeast species' population dynamics .Here , the presence and dosage of S. cerevisiae were influential in the performance of L. thermotolerans in all settings (Fig. 1 C, D and Fig. 2 C, D ), while for W . anomalus , T .delbrueckii was influential during pairwise growth, but this strong pairwise interaction was not retained when additional species wer e pr esent (Fig. 2 G, H ). For S. cerevisiae , performance was lar gel y consistent r egardless of what cultur e setting it was exposed to (Fig. 2 A, B and Fig. 3 A), and T. delbrueckii was influenced most by its initial inoculation dosage (Fig. 2 C, D ).

The GBR model generation and interpretation approach is effective across different yeast consortia and environments
To further test the applicability of GBR models for highlighting factors that impact succession, we analysed a published dataset of a more complex wine yeast community in a more industrially r ele v ant fermentation environment.
Here, 10 potential community members were grown in three differ ent gr owth media, and under differ ent temper atur e and sulphur concentrations, and community succession was monitor ed thr oughout fermentation (Fig. 4 ).This dataset was generated as a compar ativ e inv estigation into the impact of abiotic and biotic factors on metabolic and community outcomes, and is a prime example of a m ulti-v ariate, m ulti-species dataset, for whic h data inter pr etation in a quantifiable w ay w ould be challenging.
The GBR model successfully described community dynamics, with high CCC scores ( Supplementary Table S3 , Supplementary Fig. S13 ).Here, we compared the insights generated by our models to the results that wer e originall y r eported by Ba gheri et al. as an illustration into what additional insights our approach provides for understanding the community succession within this complex dataset.The feature importance analysis of the resultant GBR models indeed ca ptur ed the br oad r esults that wer e originall y r eported by Ba gheri et al. namel y that the most influential input parameters were biotic in nature, while the abiotic factors: growth medium and sulphur concentration were less influential (Fig. 4 ).
Specificall y, our anal ysis sho w ed lo w or non-existent influence of growth media and sulphur addition parameters, and high influence of inoculation dosage of community members (Fig. 4 A-I ).Inter estingl y, temper atur e was highlighted as an influential parameter within this community, being ranked within the top four most influential model input features for S. bacillarus (Fig. 4  Similar to the results within our synthetic community, this analysis also highlighted specific impacts of initial abundance of certain community members on each other.Namely, the initial abundance of M. pulcherrima ranked highly for impacting temporal dynamics of both S. cerevisiae strains (Fig. 4 H, I ), the initial abundance of W. anomalus a ppear ed to impact C. parapsilosis performance within the community (Fig. 4 D), and finally, there was a r ecipr ocal impact of initial abundance of L. thermotolerans and H. vineae on each other's succession within the community (Fig. 4 F, G ).These results provide clues into how this complex community may be manipulated, and which factors are most important in controlling the wine yeast community.

Discussion
This study aimed to shed light on how factors pertinent to arrival or inoculation impact on community succession within a model yeast ecosystem.By modelling population dynamics with a ML algorithm, we sought to investigate how a yeast's environment prior to inoculation, as well as initial abundance at inoculation impact on the trajectory of each yeast.A comprehensive dataset of tempor al comm unity data was gener ated in v arious numbers and combinations of wine y east species, as w ell as differing pr epar ation methods for each species' inoculum.By leveraging GBR models, we identified the most likely factors related to inoculation that alter temporal dynamics of each community member, and validate the strength of this approach in a previously published dataset that includes more varied environmental parameters.
The first r esearc h question r elated to how the history of a particular yeast species prior to its arrival in the community would impact on its trajectory within a mixed-species culture.In the four-species yeast community investigated here, differing pr e-cultur e conditions impacted far less than the combination of yeast species present, except for T. delbrueckii .T his ma y be linked to pr e viousl y observ ed high str ess r esponse observ ed in the tr anscriptome of T. delbruec kii during ada ptation to a high sugar medium (Tondini et al. 2020 ).This highlights the challenge that e v en for studies using synthetic systems with reduced complexity, a careful evaluation of pre-culture conditions should be carried out, as species-specific impacts may arise.).The r elativ e abundance of L. thermotolerans was consistently impacted by S. cerevisiae , and the inter-species interaction betw een these tw o y easts is indeed w ell studied and is hypothesized to be antagonistic in nature (Shekhawat et al. 2019, Luyt et al. 2021 ).The important influence of inoculation dosage in regulating temporal dynamics has been observed in other wine yeast comm unities (Ba gheri et al. 2018 ), and is an indication that this is important to consider in the quest to r ationall y contr ol tempor al succession in these communities.
The thir d resear ch question related to how the number and identity of species present at arrival may impact on community succession, with a focus on whether we could identify signatures of m ulti-species-inter actions in our dataset.Her e we identify a Most influenƟal parameters for the generated GBRs were determined and filtered according to a threshold strong influence of T. delbrueckii on W. anomalus in pairwise coculture; ho w ever, the presence of S. cerevisiae modulates this influence .T his shows that the presence of an additional species can impact dynamics of this pair in a species-specific manner.This has implications for future studies that wish to elucidate the mechanisms of interspecies interactions within the wine yeast comm unity, as building pr edictiv e models of the system will r equire inclusion of data from complex multispecies cultures, and not pairwise co-cultures alone (Chang et al. 2023 ).Finally, to test whether our approach is effective across datasets and to gain a better understanding of how general the influence of inoculation dosage is for yeast communities in more naturally r ele v ant conditions, we then generated a model for a more complex wine yeast comm unity, wher e abundance dynamics were measur ed in differ ent temper atur es, sulphur concentr ations, and growth media-including real grape must (Bagheri et al. 2020 ).The GBR feature importance values aligned well to the originally r eported anal ysis of the dataset, but also pr ovided ne w insights, namely into the fact that inoculation dosage was also highly influential in temporal population dynamics within the community, and that temper atur e (but not gr owth medium nor sulphur addition) modulated community trends .T his pro vides more robust evidence for our claim that inoculation dosages are highly important in determining yeast community dynamics, and introduces temper atur e as a way to manipulate community succession.We further also highlighted se v er al inter esting impacts betw een y east species in this community that w er e not originall y r eported, namel y between M. pulc herrima initial abundance and both S. cerevisiae strains-this pair has indeed been shown pr e viousl y to interact with one another at the cellular and transcriptional We ther efor e pr ovide further e vidence of the str ength of tree-based GBR algorithms in modelling wine yeast communityr elated datasets, whic h is in line with the success of other treebased algorithms that have been applied in a microbial ecology context (DiMucci et al. 2018, Thompson et al. 2019, Nestor et al. 2023 ), and that this a ppr oac h pr oduced accur ate and interpr etable models acr oss differ ent consortia complexities and environments .T his approach allo w ed us to quantitativ el y discern patterns in each dataset that may not have been readily apparent through manual inspection of the dataset or simple gr a phing tec hniques.Pr omisingl y, the model featur es identified as important aligned with pr e viousl y r eported r esearc h findings, while still pro viding no vel insights from the dataset that had been missed, adding credibility to this approach in the context of wrangling complex temporal datasets to mine for factors important in community succession.Ho w ever, the concept of using an inter pr etable model to assist in identifying driving factors in multi-species systems is important to distinguish from the traditional approach of generating a model for generalizing across different datasets and simulating experiments .T he a ppr oac h pr esented her e is designed as a tool for mining complex time-series datasets as they are, and the models generated are not appropriate for use as a predictive sim ulation tool acr oss nov el experimental conditions or datasets.Pr e vious studies hav e also observ ed that this type of model is not accurate in providing predictions for conditions that the model has not been exposed to (DiMucci et al. 2018, Nestor et al. 2023 ).We also emphasize that the feature importance analysis is not causativ e, and is r eliant on the categorical v ariables pr ovided; we ther efor e str ongl y encour a ge further experimental v alidation of these features.
This study provides a significant contribution to addressing the curr ent c hallenges in understanding the dynamics of microbial comm unities by pr esenting a w ay to quantify ho w species are differ entiall y impacted by biotic and abiotic parameters within synthetic or natural ecological contexts, opening avenues for rational manipulation and mechanistic studies in tar geted envir onments.This study is one of v ery fe w that has incor por ated experimental data generation with ML for the study of a complex microbial comm unity.Further, we pr ovide a r ar e look at how a stepwise increase in complexity of a microbial community changes its succession dynamics.Notably, viable cell numbers are provided, whic h ar e usuall y not possible to r eport due to the molecular nature of most microbial community quantitation techniques.Within this context, the study is limited by the fact that it does not e v aluate molecular inter action mec hanisms and is based at the population-cellular le v el.Also, incor por ation of mor e experimental abiotic parameters ma y ha ve shown that factors besides inoculation dosa ge, suc h as oxygenation or av ailable nitr ogen, may be more impactful in modulating community succession, and we recommend exploring this further with the aid of the approach we pr esent her e.In addition, it will be essential to link the manipulation of temporal succession dynamics to community function- ing, such as sugar consumption, in future studies.Further , exploring more complex ML tools that incorporate time series memory, suc h as r ecurr ent neur al networks , ma y impro ve generalizability (Thompson et al. 2023 ).
We envision that the concepts and data presented here should aid in bridging the gap in understanding complex dynamics in mi-cr obial comm unities by acceler ating data inter pr etation and allowing for targeted experimental design.This should also significantly aid in the challenge of rational design of synthetic consortia, since community succession is an important factor in the functioning of micr obial comm unities.With the r esources pr ovided in this manuscript, particularly for non-experts in coding, r epr oducing model gener ation for an y tar get dataset should be possible within r eason.Giv en the r a pid pace of integr ation of ML algorithms into microbiological research, this work presents an a ppr oac hable and po w erful analysis tool for domain experts to get the most out of their microbial community datasets.

Figur e 1 .
Figur e 1. F eature importance values for GBR models trained on the full multispecies culture dataset.The model targets were either relative abundance (A, C, E, and G) or absolute abundance (B, D, F, and H).Featur e v alues ar e r eported for eac h species within the comm unity, namel y S. cerevisiae (A, B), L. thermotolerans (C, D), T. delbrueckii (E, F), and W. anomalus (G, H).

Figur e 2 .
Figur e 2. F eature importance values for GBR models trained on subsets of the dataset, including either 2 (A, C, E, G) or more than 2 (B, D, F, H) species in the m ultispecies cultur e .T he model tar gets wer e all r elativ e abundance .F eatur e v alues ar e r eported for eac h species within the comm unity, namel y S. cerevisiae (A, B), L. thermotolerans (C, D), T. delbrueckii (E, F), and W. anomalus (G, H).

Figure 3 .
Figure 3. Relative abundance time series of each community member in each possible multispecies configuration tested.Values are averaged across all pr e-cultur e conditions tested.Time series ar e shown for S. cerevisiae (A), L. thermotolerans (B), T. delbrueckii (C), and W. anomalus (D).The legend contains abbr e viations for the particular combination of yeasts within the time series, namely Sc: S. cerevisiae , Lt: L. thermotolerans , Td: T. delbrueckii , and Wa: W. anomalus .

Figur e 4 .
Figur e 4. F eatur e importance v alues for GBR models tr ained on r elativ e abundance of a nine-species wine yeast community in differ ent gr owth media, gr owth temper atur es, and sulphur addition.Featur e importance v alues for eac h species ar e r eported, namel y: M. pulc herrima (A), P .terricola (B), S. bacillaris (C), C. parapsilosis (D), W. anomalus (E), L. thermotolerans (F), H. vineae (G), S. cerevisiae EC1118 (H), and S. cerevisiae Indigenous Strain (I).The legend for A-I is provided, with each bar colour representing a model feature.A summary heatmap of the feature importance values is provided for comparison, with a colour scale where darker blue is a low value and brighter y ello w is a higher value (K).

Table 2 .
Optimal parameters and performance of the GBR models.
1: Number of boosting stages to perform 2: Maximum depth of the individual r egr ession estimators 3: Shrinks the contribution of each tree by the learning rate 4: Concordance Correlation Coefficient GBR model generated using the enƟre mixed culture dataset (Table 1) for training and validaƟon, with species percentage abundance values as the target GBR model generated using the enƟre mixed culture dataset (Table 1) for training and validaƟon, with species cell concentraƟon values as the target Focus: Community composiƟon through Ɵme : Direct quanƟtaƟon of species biomass through Ɵme Most influenƟal parameters for the generated GBRs were determined and filtered according to a threshold ΔS. cerevisiae(t) ΔL. thermotolerans(t) ΔT. delbrueckii(t) ΔW. anomalus(t) ΔS. cerevisiae(t) ΔL. thermotolerans(t) ΔT. delbrueckii(t) ΔW. anomalus(t) Focus

n=2 RelaƟve Abundance Model Data
Generate a GBR model using n=2 training and test dataGenerate a GBR model using n>2 training and test data Data pre-processed to remove data points including n>2 experimental data pre-processed to remove data points including n=2 experimental data