Data integration strategies for whole-cell modeling

Abstract Data makes the world go round—and high quality data is a prerequisite for precise models, especially for whole-cell models (WCM). Data for WCM must be reusable, contain information about the exact experimental background, and should—in its entirety—cover all relevant processes in the cell. Here, we review basic requirements to data for WCM and strategies how to combine them. As a species-specific resource, we introduce the Yeast Cell Model Data Base (YCMDB) to illustrate requirements and solutions. We discuss recent standards for data as well as for computational models including the modeling process as data to be reported. We outline strategies for constructions of WCM despite their inherent complexity.


Introduction
Whole-cell models (WCM) have proven to be as insightful as they ar e c hallenging to establish.By cov ering systemic effects and interactions between individual biological processes , they pa ve the way to a deeper mechanistic understanding of cells, organisms, and diseases.
As for all models, WCMs r el y on experimental data that are ideally both high resolution in time and space and high quality in terms of r epr oducibility and accur ac y.Ho w e v er, due to the wide variety of biological processes covered, the data sets on which WCM ar e based hav e to fulfill further r equir ements: T hey ha ve to be obtained fr om compar able-ideall y the same-experimental conditions for all measurement methods of diverse biomolecules and processes .T hey ha ve to co v er all r ele v ant pr ocesses in a quantitativ e manner.Natur all y, they hav e highest information content if they are acquired from single cells or at least cell cycle sync hr onized colonies.
Due to the technical complexity of modeling all processes in a cell, WCM is so far carried out for simple model organisms only.
WCM a ppr oac hes ar e av ailable, e.g. for Mycoplasma genitalium , a micr oor ganism with a well-c har acterized small genome.Also, for the eukaryotic model organism Saccharomyces cerevisiae WCM efforts have been established, harnessed with decades worth of experimental data available from literature .T his data has, howe v er, been obtained from many different strains, culture conditions , and measurement methods , resulting in questionable comparability of the gained results.On the other hand, if all environmental conditions are carefully considered, the data can be compiled and converted into a full y quantitativ e description of a prototypic yeast cell.
Here, we discuss the types of data needed for a yeast WCM and introduce a database that contains a systematic collection of data intended to par ameterize suc h a model.We r e vie w existing models for parts of the cellular life as well as strategies for their integration.

W ha t is a whole-cell model?
A whole-cell model is a computational model that attempts to simulate the behavior of an entire living cell, capturing the intricate interactions between various cellular components such as proteins , metabolites , genes , and regulatory networks .T hese models aim to provide a compr ehensiv e understanding of cellular processes at the systems le v el, allowing r esearc hers to study how different molecular components work together to carry out functions such as metabolism, signaling, and gene expression.
WCM typicall y integr ate experimental data fr om v arious sources , including genomics , proteomics , metabolomics , and bioinformatics databases, to construct a detailed r epr esentation of cellular processes .T hey often utilize mathematical and computational tec hniques suc h as differ ential equations, stoc hastic modeling, and constraint-based modeling to simulate the dynamics of cellular systems under different conditions.
The de v elopment of WCM holds pr omise for adv ancing our understanding of cellular biology, enabling the prediction of cellular behaviors in response to perturbation and providing insights into the underl ying mec hanism of diseases .T hese models can also be used to guide experimental r esearc h and facilitate the design of nov el ther a pies and biotec hnological a pplications.
There is now a series of models available that strive for a holistic description of cellular processes.Ho w ever, they are different in nature and, therefore, require different types of data.Most of these models r epr esent differ ent bacteria, as pr ocesses in pr okaryotic cells ar e less complex, especiall y due to less compartmentalization.We start with a brief introduction to some WCM efforts.
The E-Cell project was launched at Keio University by the Tomita gr oup alr eady in 1996 and studied M. genitalium .This Figure 1.A WCM ca ptur es cellular maintenance and the cell division cycle, as well as cellular responses to external stresses .Here , icons within the sketched yeast cell symbolize the major processes and MET stands for metabolism, SIG for signaling, GEX for gene expression, TRP for transport, CDC for cell division cycle, and VOL for volume changes and growth.Around the cell, G1, S, M, and G2 indicate the cell cycle phases, where the cell starts as a small cell with only one copy of DNA and then grows in G1 phase.In S phase, DNA is duplicated and yeast cells start to form a bud.After another phase of growth in G2, cells organize division in mother and daughter cell during M phase.
or ganism is especiall y inter esting to build a WCM, because it has the smallest known genome of all free-living organisms with 580 kb coding for 563 genes (Fraser et al. 1995 ).The small size of the genome gave rise to the assumption of a minimal functional gene set with only essential genes and a unique assignment of genes and functions .T he E-Cell project intended to dev elop gener al tec hnologies and theor etical support for computational biology with the grand aim to make precise whole cell simulations at the molecular le v el possible .T heir first virtual hypothetical cell published in 1997 and 1999 contained 127 genes, was self-sustained by producing energy from glucose with the help of the encoded genes, but did comprise neither pr olifer ation, DNA replication nor cell cycle (Tomita et al. 1997(Tomita et al. , 1999 ) ).The endeavor continues as E-Cell Project (e-cell.org).The whole-cell model of Karr, Covert and colleagues intended to predict phenotype from genotype, again for M. genitalium (Karr et al. 2012 ).It contains 28 submodels of cellular processes and 16 variables integrate cellular functions.Shuler and colleagues developed a series of minimal models for E. coli , which included growth on glucose, certain synthesis pathways as well as cell size and cell sha pe (Domac h et al. 2000, Castellanos et al. 2004 ).Morgan and colleagues introduced in a series of publications a fr ame work of whole-cell modeling that extended to w ar ds gro wth and division (Morgan et al. 2004, Sur ovtse v et al. 2008, 2009 ).
Recentl y, an alternativ e type of models used the principle of resource balance analysis to quantitatively predict whole cell properties such as growth rate, metabolic fluxes, or abundances of molecular machines in different bacteria (e.g.Goelzer et al. 2015, Weisse et al. 2015, Bulovic et al. 2019 ).Molenaar et al. ( 2009 ) used a v ery simplified r epr esentation of a cell to elabor ate the hypothesis that gr owth str ategies ar e the r esult of tr adeoffs in the economy of the cell, in which growth rate maximization of the entire system is the objective, and where the only limitations are those set by the laws of physics and chemistry.Elsemman et al. ( 2022 ) applied comparable principles to yeast to analyze metabolic adaptation to nutrient conditions .T he fate of single cells , their cell cycle , shape , or div ersity ar e not consider ed in this type of modeling a ppr oac h.
Our understanding of a whole-cell model as r ele v ant for the considerations below comprises three major features (see Fig. 1 ).First, it is focused on a single , though a v er a ge, cell.Second, it is dynamic, which means it covers the life of that cell from one division to the next division.Third, it should include all major processes of cellular life along with the machinery necessary for cellular replication, though the le v el of detail can v ary.An additional desir ed feature is responsiveness of the cell to selected external changes of conditions and all kinds of cues.
In order to give a reader who is not experienced with mechanistic modeling a brief introduction into the steps of model de v elopment and r equir ements, we illustr ate the w orkflo w of dynamic modeling of a small biochemical reaction network in Fig. 2 .

Standardization of data generation, storage, and sharing
T he FAIR (findable , accessible , inter oper able, and r eusable) principles have become indispensable for any project that uses shared data or intend its data to be reused (Wilkinson et al. 2016, David et al. 2023 ).Well-cur ated databases suc h as GenBank, PDB, UniPr ot, CheBi, Ensemble, or Pfam are standard resources for data in The information about the processes to be covered by the model can be given in graphical representation.The example used here can be re presentati ve both for metabolic or signaling processes: compound S 1 is produced and degraded by reactions 1 and 2 (with velocities v 1 and v 2 ), compounds S 2 and S 4 ar e conv erted into eac h other by r eactions 3 and 4, compound S 3 is also pr oduced and degr aded by r eactions 5 and 6.Compounds S 1 and S 3 modify (activate or inhibit) the velocities of reactions 3 and 5, respectively, without being consumed or produced themselves by these reactions.(B) The systems equations, in gener al, r epr esent the temporal changes of the compounds S i (denoted by the time deri vati ve d / dt ), which is given by the rates (or velocities) v j combined with the stoichiometric coefficients .T he necessary steps such that the system can be simulated are sketched in panels (C)-(F): (C) r epr esents the set of systems equations for the example in (A).(D) illustrates choices for rate expressions.v 3 , v 4 , and v 6 follow mass action wher e par ameters k stand for r ate constants, v 2 is an example for Mic haelis-Menten kinetics (with V max maximal v elocity and K M Mic haelis constant) and v 5 for Hill kinetics ( K 0,5 is the concentration giving half maximal velocity, n is the Hill coefficient).(E) P ar ameter v alues can be either obtained fr om databases, estimated from experimental data (genomics, proteomics, metabolomics, and biophysical measurements) or simply guessed (as done here).Briefly, parameter estimation requires systematic repeated simulation with different parameter values and comparison with experimental data with the aim to minimize the difference between data and simulation.(F) For a simulation to start, one has to determine the initial conditions.(G)-(J) are examples for simulation experiments based on the ODE system in panels (C)-(F).(G) shows a time course simulation.(H) presents the state space for S 1 and S 2 wer e v ectors indicate the dir ection of motion fr om differ ent starting points.(G) and (H) sho w that the system moves to w ar d a steady state.(I) A typical way to analyze the ODE system is sensitivity analysis , i.e .testing the effect of small parameter variations on the dynamics .Here , parameter V max2 has been varied (10 simulations with different values).(J) shows the result of a stochastic simulation of the same system with the Langevin a ppr oac h, wher e a noise term is added to each equation resulting in slightly different dynamics for each of, here, 10 simulations.(K)-(M) Boolean model of a comparable system: Component S 1 activates S 2 , S 2 , and S 4 can be converted into each other (thereby annihilating the other component) and S 2 activates S 3 .Here, all compounds can have only two states, ON or OFF; also time proceeds in discrete steps.(K) Graph of the model.(L) Systems equations denote the state of the compound at the right side at time t + 1 as function of the state of components at the left side at time t .These c hanges ar e expr essed with Boolean rules.(M) sho ws tw o sim ulation experiments with differ ent initial conditions, wher e S 4 starts ON at t 0 in both cases and S 1 is either OFF or ON.If S 1 is OFF, the system is already at a fixed point and shows no changes in the following time steps.If S 1 is ON, the system oscillates , i.e .it has a cyclic attractor.For both ODE and Boolean modeling, it is often necessary to r e vise the model and repeat the modeling steps , i.e .network creation (components and reactions), assignment of rate expression or rules, and the parameter values, until the model behavior corr ectl y r eflects the experimentall y observ ed behavior of the system.modeling projects (Benson et al. 2012, Hastings et al. 2016, Burley et al. 2017, Mistry et al. 2021, Martin et al. 2023, Uniprot et al. 2023 ).Ho w e v er, other types of data are useful for or required to parameterize a WCM, which cannot be found in the standardized databases.
The modeling process itself establishes a type of data critical for WCM.Any computational model is characterized by a series of choices or decisions: (i) the assignment between real-world modalities (e .g. abundances , volumes , pr essur e, and temper atur e) and v ariables, par ameters or constants of the model, (ii) the treatment variables as discrete (e.g.molecule numbers) or continuous (concentrations), (iii) the discrimination between what belongs to the model and what to its environment, (iv) the types of interactions between the variables , i.e .whether there is an inter-action and which model describes this interaction best (e.g.either mass action of Michaelis-Menten kinetics or another kinetic type for metabolic interconversions), (iv) the formalism to update the state of variables , e .g. set of differential equations or Boolean model or stochastic simulation, including its detailed settings (e.g.whic h pr ogr am and whic h solv er to use).Mor eov er, eac h computational experiment with the model creates a data set, i.e. which model perturbation (e.g.parameter change or addition of a cue or set of repetitions for stochastic models) and type of simulation (e.g.steady state or time course) has been performed and what is the outcome .T hus , the model formalism and all r ele v ant decisions have to be reported in sufficient detail to r epr oduce the model results .T his also includes implicit assumptions .In addition, the parameter estimation process or alternative ways to ob-tain parameters have to be outlined.More and more computational projects are organized in shareable repositories such as Git, which makes this kind of information accessible to the scientific community.
In the last 20 years, the scientific area of mathematical modeling in biology (often called systems biology) has seen an explosion of standards that have been developed for different purposes and modalities, starting with the systems biology markup language (SBML) standard on the modeling site (Hucka et al. 2003 ) and minimum information requested in the annotation of biochemical models (MIRIAM) (Novere et al. 2005 ) on the experimental side .T he SBML is a standard format used for r epr esenting computational models of biological processes in systems biology.It provides a means for describing models in a standardized, mac hine-r eadable format, enabling inter oper ability between differ ent softwar e tools and facilitating model sharing and collaboration among researchers.SBML defines a set of rules and structur es for r epr esenting bioc hemical r eaction networks, including information such as species , reactions , compartments , and parameters.It is designed to be human-readable as well as mac hine-r eadable, making it accessible to both r esearc hers and softwar e a pplications.SBML allows modelers to describe complex biological systems in a concise and unambiguous way, facilitating the exchange of models between different modeling and simulation tools .T his standardization helps promote reproducibility and tr anspar ency in systems biology r esearc h by enabling r esearc hers to share their models and results more easily.The most recent version of SBML is SMBL Le v el 3 offered as extensible format for exc hange and r euse of biological models (K eating et al. 2020 ).MIRIAM is a set of guidelines and recommendations for annotating computational models in systems biology.It was de v eloped to impr ov e the clarity , consistency , and inter oper ability of model annotations, facilitating the exchange and reuse of models among r esearc hers and software tools .It pro vides a framework for annotating models with metadata that describe various aspects of the model, such as its creators, version history, biological context, and experimental conditions .T his metadata helps ensure that models ar e pr operl y documented and can be understood and used by others .T he COMBINE standards ( https://co.mbine .org/)comprise a series of modeling formats , i.e .next to SBML also BioPax (Demir et al. 2010 ) as a language to integrate , analyze , and exchange pathway data, CellML (Clerx et al. 2020 ) to store and exchange computer-based mathematical models, Synthetic Biology Open Language Data (SBOL Data, Mclaughlin et al. 2020 ) for the description and the exchange of synthetic biological parts, devices and systems .T he Systems Biology Gr a phical Notation (SBGN; Nov er e et al. 2009) is a standardized gr a phical langua ge used to r epr esent biological pr ocesses , pathwa ys , and networks .It pro vides a set of symbols and rules for creating visual r epr esentations of biological systems, enabling r esearc hers to communicate complex biological concepts in a clear and unambiguous way.SBGN consists of three main languages or branches , i.e .the Process Description focuses on the relationships between biological entities and processes within a network, the Entity Relationship emphasizing the relationships between biological entities, such as proteins , genes , and complexes , without specifying the exact nature of the interactions, and the Activity Flow (AF) describing the flow of biological activities or signals within a network.Last not least, The Simulation Experiment Description Markup Language (SED-ML; Bergmann 2011 ) is a standard format for describing computational simulation experiments .It pro vides a means for specifying the setup, execution, and postprocessing of simulation experiments in a mac hine-r eadable format, facilitating the r epr o-ducibility and exchange of simulation studies among r esearc hers and software tools.It includes model description, simulation settings , experimental conditions , as well as specifications of outputs and analyses.
Ther e ar e open standard model r epositories suc h as BioModels (Malik-Sheriff et al. 2020, Glont et al. 2018 ), CellCollective (Helikar et al. 2012 ), JWSonline (Olivier and Snoep 2004 ), and Physiome (Hunter 2004 ), where models can be stored, accessed and in some cases e v en cr eated for a specific pur pose.Specific communities develop not only their o wn standar ds, but e v en their related test tools to assess the quality of models in that field.A recent example is metabolic model tests for genome-scale metabolic models (Lie v en et al. 2020 ).

Da ta integr a tion entails, but is more than storing data in the same database
What is data when we talk about whole-cell modeling?First of all, it comprises all results of experimental investigations of the system to be modeled.Second, it entails all so-called meta-data, i.e. extended information about these experiments-such as cultiv ation context, gr owth r ates , or instruments used-i.e .helpful or necessary to corr ectl y inter pr et the r esults.And finall y, it has to cover all the aspects of the modeling exercise itself.We will elucidate the experimental part of these three types of data below.
Within the realm of a WCM, the experimental data should prefer entiall y comprise compr ehensiv e information about all parts and processes of a living cell.It typically includes compound abundance information such as genomics , transcriptomics , proteomics , phosphoproteomics , and metabolomics .It leads to values of abundances or r elativ e abundance c hanges of pr oteins, their complexes and modification states (e.g.phosphorylation and methylation), abundances of mRNAs, and concentrations of metabolites or changes thereof.In the best case, this data is timer esolv ed ov er r ele v ant pr ocesses suc h as cell cycle or the dur ation of a response to a dedicated perturbation.Further information is r equir ed on r ates, suc h as gr owth r ates, r ates of nutrient uptake and release of waste or other components, r espir ation r ates, r eaction r ates, r ates for tr anscription and tr anslation, and r ates for decay of cellular compounds.Rele v ant biophysical information comprises cell mass , volume , shape , surface area, sizes of compartments, external and internal osmolarity as well as turgor pr essur e.
Meta-data is r equir ed to, first, make differ ent data sets comparable and, second, characterize the context of the measurement.It includes all information about the analyzed organism (e .g. species , strain, accession, and genetic modifications).The experiment itself must be sufficiently described: what has been measur ed under whic h circumstances, whic h instruments hav e been used, maybe e v en specific settings of the instrument, if rele v ant.Next, differ ent types of physical information must be reported, suc h as temper atur e, pH, and medium composition.And finall y, the c hanges in the environment during the experiment can be r ele v ant, e.g.whic h nutrients wer e depleted, whic h compounds wer e r eleased by the inv estigated or ganism.This information is crucial to understand whether the measured values are comparable or convertible between different studies.Only if, i.e. possible, they can be used in the same mathematical model.As an example, the pH value of the medium has an effect on many cellular processes-internal metabolite concentrations, membrane potentials, r espir atory r ates, and mor e.Hence, data acquir ed in media with v astl y differ ent pH v alues cannot be used without at least precaution or recalculations in the same model.

The YCMDB as example for systematic collection and integr a tion of da ta
Making data reusable as well as understanding the exact experimental bac kgr ound and, hence, the data usability for a specific modeling project is the idea behind YCMDB-a curated database for quantitative modeling of yeast ( https://tbp-klipp.science/ycmdb ).An ov ervie w of the database is r epr esented in Fig. 3 .The database collects published as well as novel data that fulfills the r equir ements for insightful mathematical models: quantitativ eness (including measur ement uncertainties), ideall y tempor al and single-cell r esolution as well as a high le v el of annotation.Data on compound concentrations and uptake rates, synthesis, and degr adation r ates as well as general biophysical numbers such as cell size or cell cycle phase lengths are included in the relational DB.
The number of data points contained in the YCMDB is 10 727 for metabolic data, 151 299 for genomic data, and 3488 for biophysical data.They stem from 74 publications.A total of 164 different media have been reported as well as 63 different strains (excluding k.o.libraries).A total of 16.3% of the contained data belongs to BY4742 and 73.5% to CEN-PK.
Each data point is annotated with specific meta-information to apprise how comparable the data is to the modeled scenario (cultur e conditions, gr owth phase, and measur ement tec hniques) and r efer enceable with a static ID.The high degree of interconnectivity between the datasets allows to attain additional information for unit conversions and adaptation to e.g.different cell sizes as well as to find data points stemming from similar experimental setups.YCMDB is an integr ativ e r esource for quantitativ e yeast modeling projects on any scale and fosters reproducible modeling efforts.

Da ta collection, annota tion, con version, and interconnection in YCMDB
The majority of data was collected from experimental studies published through the last decades.From each publication, relevant data points were extracted from tables , supplements , figures (WebPlotDig itizer; Rohatg i 2020 ) and text, follo w ed b y careful annotation with r ele v ant meta information: Publication, strain, medium, unit, culture mode (continuous/batc h), gr owth phase, gr owth r ate, temper atur e, sync hr onization status, aer ation, measur ement method, and a comment usually to ease finding of the data point in the r espectiv e publication.If the data is time-r esolv ed, the time point and the unit of time (e.g.s/min/h) are noted.We further note, which type of value the data point r epr esents along with information on the measurement accur acy: The v alue type can either be an av er a ge (with standard deviation), median, minim um/maxim um v alue, or a plain value with no accur acy information.Wher e av ailable, we tr ac k the number of replications.
Data points were considered if they were quantitative (with a convertible unit) and had information about the used strain and culture conditions.
Eac h data point can be r efer enced unambiguousl y by a static YCMDB-Id.T his wa y, model calibration efforts can be documented in a r epr oducible manner, e.g. in combination with PETabs (Schmiester et al. 2021 ).
Oftentimes, data points with exactly the right conditions will not be available anywhere in the liter atur e .T her efor e, YCMDB aims at tightly interconnecting the data such that the search for data points that match a user's conditions close enough becomes possible along with an a ppr aisal of the differences to be consider ed.Ther efor e, the str ain ancestry was compiled from liter-ature and is made available in YCMDB (see Fig. S1 , Supporting Information ).This allows to search for data from similar strains and, if found, to understand which genetic alterations separate the strains.
Data points for similar media can be searched, the medium composition is compiled in detail, for each medium constituent concentr ations (if av ailable) ar e listed and externall y r efer enced via PubChem (Kim et al. 2021 ).External database IDs are used for unambiguous identification of metabolic compounds (Hastings et al. 2016 ) and genes/proteins (Cherry et al. 2012 ).
The database was implemented in MySQL and is available via an online user interface based on R shiny.The code is available via GitLab.

Application areas of YCMDB
Yeast liter atur e is ric h in datasets, with earl y publications measuring the nuts and bolts of the cell with a variety of techniques.The measurements were tailored to the individual question of the study, using differ ent str ains, media, gr owth conditions, anal ytic tec hniques , and displa ying results in wild units .To make the available data usable for mathematical modeling and quantitativ e anal ysis, effects of these differ ences in experimental setups have to be accounted for.YCMDB collects r ele v ant data points along with all available meta-information, to allo w resear chers to (i) quic kl y and r epr oducibl y find data points r equir ed for the calibration of their models and (ii) to assess how suitable and comparable the values are to the modeled scenario.
All data points included in YCMDB are quantitative, ideally also of single cell and temporal resolution.YCMDB contains data for 22 different data types measuring the most important concentrations and biophysical properties of the cells .T his includes (i) values for biophysical properties such as size or volume and information about the cellular state such as cell cycle phase, (ii) abundances and rates of change for mRNAs and proteins as well as (iii) metabolic data.The data is highly interconnected and YCMDB allows for cross-functional searching to discover further relevant data points with similar metadata.Data is stored in the original units of the publication, conversion to a desired unit is in many cases possible via the linked information such as cell size, growth rate, etc..

The YCMDB web interface
We provide a browsable web interface based on R shiny for YCMDB, available on https:// tbp-klipp.science/ycmdb .
Each data point that fulfills the requirements gets assigned to a static identifier, the YCMDB-ID.This IDcan be used for reference purposes of the developed models and makes each used data point traceable .T he YCMDB ID is displayed at the beginning of eac h r o w in the w eb interface tables.
The thr ee differ ent data classes-metabolic , genomic , and biophysical-r e-explor able in separ ate tabs of YCMDB.Eac h database entry is linked to an external identifier: ChEBI-IDs for metabolites, SGD-Ids for genes, protein, and mRNAs.We aim to pr ovide concentr ations or numbers for cellular components as well as change rates (uptake, secretion, synthesis, degradation, and translation) and characteristic features of the entire cell (density, size, weight, volume , organelle numbers , and duration of cell cycle phases in the tab for biophysical data).Each entry has a numerical value taken directly from the publication along with the giv en unit, wher e av ailable a r ange or standard de viation is included (can be selected in the "select columns to show" drop down menu of each tab).Users can search for data entities in each tab and download the filtered and ordered datasets.A full download of all data is available as well.
Strain ancestry was compiled, including mutations and the liter atur e tr ail whic h led to the str ains' de v elopment, wher e possible .T he Strain Tab provides a browsable ancestry tree (see Fig. S1 , Supporting Information ) combining all information along with links to external publications and the data points available from the strain.
Data was only included from experiments that clearly stated the composition of the used media.All compounds are listed and linked to the PubChem database for comparability, the composition of commonly used media bases such as YNB is also r esolv ed to the le v el of individual constituents .T he Medium Tab allows searching for similar media via a graphical clustering representation where medium compounds can be highlighted (similar media appear closer together on the graph, t-Sne clustering).This allows to find useful data points e v en if nod data for the exact same cultivation conditions is available in YCMDB.
Publication information is available in the Publication Tab allowing users to search for publications that include a specific data type.For each publication a list of further available data points is shown, whic h is especiall y important for whole cell modeling as it r equir es data on all cellular processes being as consistent as possible.
The process of curation data for YCMDB requires a high level of car e and accur acy to ensur e that the set quality criteria are met for each data point.In the future we plan to add a data submission featur e, whic h will allow the user to upload their own r ele v ant data along with all r equir ed meta information.

Divide and conquer
A WCM consists of the computational description of a large amount of different cellular processes that relate to different cellular locations, different orders of magnitude of number of involved molecules, and different aspects considered to be of importance to the modeler or model user.There are, at least, two alternative ways of coping with this complexity: one can describe all processes with the same formalism, e.g.ordinary differential equations, or one can use the best suitable computational formalism for each process, see Fig. 4 (Hahn 2020 ).Both approaches come with their own adv anta ges and disadv anta ges.
Figure 4 gives an overview of the approach.In its panel (A) it introduces the components of the Yeast Cell Model (YCM): the model is structured into distinct modules to r epr esent v arious cellular processes: r Metabolic module (MET): this module describes the biochemi- cal reactions involved in cellular metabolism, including pathways for the synthesis and breakdown of metabolites such as sugars, amino acids, and nucleotides.It encompasses central carbon metabolism (CCM), cell wall synthesis (CWS), DN A synthesis (DN A), lipid metabolism (LIP), amino acid metabolism (AAM), and stor a ge (STO).r Gene expression module (GEX): this module models the reg- ulatory networks that control gene expression, including tr anscription factors, r egulatory pr oteins, and r egulatory sequences in DNA.It covers transcription (TRX), translation (TRL), assembly of protein complexes (APC), and histone activity (HIS).(B) Schematic of the merging approach: parameters are estimated for module separ atel y with other modules reduced to critical information, e.g.just the relevant volume .T hen modules are merged to one large ODE model, parameters are readjusted taking now the dynamics in other modules into account.Eventually, the whole WCM is simulated as one ODE system.(C) Schematic of consolidation approach: again, parameters are estimated for each module.In the simulation process, each module is sim ulated separ atel y with its own algorithm for a giv en time ste p t .It tak es gi v en v alues of link v ariables as input, c hanges them during the simulation and provides them after t as output.All link variables are updated based on outputs of all modules .T he process is iteratively repeated until the predefined end time.
r Transport module (TRP): this module describes the move- ment of molecules across the cell membrane, including passi ve diffusion, acti ve transport, and facilitated diffusion.It includes ion (ION) and nutrient (NUT) transport.
r Signaling module (SIG): this module models cellular signal- ing pathwa ys , including r eceptor-ligand inter actions, second messenger signaling, and signal transduction cascades.It includes pathways such as the high osmolarity gl ycer ol (HOG), pheromone (MAT), Ca-calcineurin (CAL), and TOR pathways.
r Growth module (VOL): this module ca ptur es the r equir e- ments for an increase in cell volume as well as a description of the changes of volume and surface for both the mother (MOT) and bud (BUD).
Figur e 4 (B) pr o vides an o v ervie w of the Mer ging Appr oac h: in this a ppr oac h, eac h module is initiall y anal yzed separ atel y with other modules simplified to critical information, such as r ele v ant volume.Subsequentl y, these modules ar e mer ged into a unified Ordinary Differential Equation (ODE) model.Parameters are then r ecalibr ated considering the interplay between dynamics in different modules.Ev entuall y, the entir e WCM is sim ulated as a single ODE system.Fig. 4 (C) gives an outline of the Consolidation Appr oac h in the context of the YCM, which involves several steps: (i) parameter estimation for each module: first, parameters are estimated for each individual module within the model.These parameters define the behavior and c har acteristics of eac h mod-ule, allowing them to function autonomously.(ii) Simulation Process for each module: each module operates independently during the sim ulation pr ocess.Using its own specific algorithm and the parameters determined in the first step, each module simulates its behavior over a given time step ( t ).(iii) Input and output handling: during sim ulation, eac h module takes input from specified link variables .T hese input values may include information from other modules or external factors .T he module then processes this input, updates its internal state, and generates output variables.(iv) Updating link variables: after completing the simulation for a time step ( t ), the module outputs the r e vised v ariables .T hese updated variables may affect the behavior of other modules or influence the ov er all system dynamics.All link variables are then updated based on the outputs from all modules.(v) Iter ativ e pr ocess: the sim ulation pr ocess iter ates ov er m ultiple time steps until r eac hing a pr edefined end time.At eac h iter ation, modules continue to interact with each other by exchanging information through link variables .T his iterative approach allows the system to e volv e dynamicall y ov er time.By following this consolidation a ppr oac h, the YCM can ca ptur e the complex inter actions and dynamics between different cellular processes while maintaining modularity and computational efficiency.
The formulation of a WCM as set of ordinary differential equations is potentially laborious, though achievable.If the model is formulated in SBML, semanticSBML is a tool intended for merging models of individual subprocesses (Krause et al. 2010 ).
Merging has the advantage that even large systems of ODEs can no w adays be solved computationally in a reliable manner.It also allows smooth integration of different cellular functions into the same concept without ov er pr onouncing the links between potentially defined modules (such as between metabolism providing ATP and amino acids and translation using ATP and amino acids).Ho w e v er, a disadv anta ge is that, giv en the complexity of a WCM and the different scales of times and abundances, it is unforeseeable in the near future that we will hav e par ameter estimation tools available that will allow us to estimate model parameters from a data collection in one go.Since the WCM has a large number of parameters, most of them unknown, it is challenging that the state space , i.e .the range of values different model compounds can assume, is v ery lar ge and the r equir ed r e petiti ve simulations are computationally demanding.
T hus , we need to de v elop str ategies to cope with this problem.One potential solution is to divide the full set of processes into a number of subsets or modules.In the context of whole-cell modeling, a module refers to a distinct component or subsystem within the ov er all model that r epr esents a specific aspect of cellular function.Modules can be thought of as building blocks that are combined to form a compr ehensiv e model of the entire cell.Each module typically focuses on modeling a particular set of molecular interactions or biological processes.
Each module may be represented using different mathematical and computational techniques, depending on the specific details of the processes being modeled.These modules are then integrated into a cohesive whole-cell model that ca ptur es the interactions and dynamics of the entire cell.
The boundaries and links of the subsets must be clearly defined.Then, one de v elops a coarse description of each subset, potentially guided by steady state assumption or other easy-toobtain information.The union of all but one subset then serves as context for the remaining subset, providing it with boundary conditions and fluxes over its borders.
An alternative to the merging of all modules into one comprehensive ODE model is the consolidation approach.Such an appr oac h was , e .g. used for the WCM for M. genitalium (Karr et al. 2012 ) as well for our YCM (Hahn 2020 ).Her e a gain, models hav e to be de v eloped for eac h module or cellular function.But these models can come with their own formalism, e.g. an ODE model for metabolism, a stochastic agent-based model for transcription and translation, and a stochastic model employing the Gillespie algorithm, i.e. a computational method used to simulate the time evolution of stochastic biochemical systems, for the cell cycle.
The modules share link variables , i.e .those variables that are c hanged in mor e than one of these modules and, ther eby, connect these modules.Typical link variables are energy equivalents such as ATP and ADP or NAD and its deri vati ves, gene expression levels, signaling molecule activities, or the volumes of the cell and the compartments.Since they are changed in several modules, they serve as a communication channel and allow a flow of information between these modules .T heir dynamics can be very complex, because it is influenced by the different modules.
Independent of the choice of approach to link the modules: a connection between the experimental data and the models or modules r equir es to estimate the model parameters from those experimental data.T his is , according to our experience , fr equentl y a challenging task for large and complex systems.Fortunately, by now there is a series of tools available to ease that task, e.g.parameter estimators with COPASI (Mendes et al. 2009 ) or Tellurium (Choi et al. 2018 ), MATLAB-related estimation programs such D2D (Raue et al. 2015 ), or estimators for Python pr ogr ams (Sc halte et al.

).
For WCM, it is important to take care that parameters used in different modules are not conflicting, which potentially can be ac hie v ed by parameter re-estimation for the full model.
In summary, by breaking down the complexity of cellular systems into modular components, r esearc hers can de v elop mor e manageable models and gain insights into the individual mechanisms underlying cellular function.Additionally, modular modeling a ppr oac hes allow for flexibility and scalability, enabling us to add or modify modules as new experimental data becomes available or as the model is extended to study differ ent or ganisms or cellular processes.

Toward a comprehensive YCM
The yeast community has created a comprehensive series of models of different cellular processes that can serve as modules for a whole-cell model of yeast (YCM), though they also stand in their own right to integrate data and explain biological observations.These are models for cell cycle in different detail and formalisms such as (Chen et al. 2000, 2004, Barberis et al. 2007, Zhang et al. 2011, Spiesser et al. 2012, Palumbo et al. 2016, Münzner et al. 2019, Adler et al. 2022 ).Other models describe volume changes of mother and bud (Altenburg et al. 2019 ).DNA replication has also gotten attention (e.g.Brümmer et al. 2010, Spiesser et al. 2010 ).Signaling pathwa ys ha v e attr acted a m ultitude of modeling efforts, either in their own right (Yi et al. 2003, Kofahl and Klipp 2004, Sackmann et al. 2006, Schaber et al. 2006, 2012, Waltermann and Klipp 2010, Thomson et al. 2011, Lubitz et al. 2015, Stojanovski et al. 2017, Dunaye vic h et al. 2018, Pomer oy et al. 2021 ) or as models combining signaling with complex regulation processes (Klipp et al. 2005, Adr ov er et al. 2011 ).Metabolism has experienced a series of compr ehensiv e r econstructions, often comm unity-based (herrgaard et al. 2008(herrgaard et al. , Lu et al. 2019 ) ), ho w e v er, ther e ar e also dynamic detailed models for different parts of metabolism including glycolysis (Rizzi et al. 1997, Hynne et al. 2001, Lao-Martil et al. 2023, Van Heerden et al. 2014, Smallbone et al. 2013 ) or lipid metabolism (Schützhold et al. 2016 ).Ion transport has been described for single cells (e.g.Kahm et al. 2012, Gerber et al. 2016 ) Gene expression is fr equentl y anal yzed in statistic models, but ther e ar e also a ppr oac hes for dynamic models of transcription (e.g.Amoussouvi et al. 2018 ) and translation (e.g.Seeger et al. 2023 ).
Unfortunately, the list of mentioned works cannot be complete, since there are so many, which underlines the deep interest in the dynamics of the yeast S. cerevisiae as model organism.
Yet, the full establishment of a YCM that describes a single cell and is compr ehensiv e , dynamic and co vers a whole cell cycle isto our best knowledge-still future .T his can have a series of reasons .First, as the abo v e r efer enced liter atur e ma y indicate , the cov er a ge of cellular processes with detailed dynamic mathematical models is not equally dense .Some pathwa ys or networks have attr acted m uc h mor e modeling efforts than others , e .g. gl ycol ysis more than amino acid synthesis or cell wall synthesis or the pheromone pathway more than the starvation response pathway.Second, experimental data with tempor al r esolution ov er one cell cycle that can be inter pr eted for av er a ge single cells is still r ar e.This is in part due to the fact, that cell-cycle resolution on a population le v el r equir es sync hr onization, whic h al ways interfer es with the cell's normal pr ocesses.Mostl y used sync hr onization by adding and removing again the pheromone alpha-factor activates the pheromone pathway and leaves behind cells that are much lar ger than av er a ge (because they hav e gr own during pher omone tr eatment) and, ther efor e , ha ve shorter G1 phases than an avera ge population.Sync hr onization by elutriation pr ovides low yields F igure 5. Information flo w for the cr eation of a YCM.(A) Liter atur e and experimental r esults hold information on a m ultitude of biological pr ocesses and interactions.Ho w ever, this data is highly condition dependent and stored in nonstandardized wa ys .(B) To r epr oducibl y use the data, understand their connection, and judge their consistency and information content, se v er al nontrivial digitalization steps are required.(C) The curated data can then be consistently and formally analyzed, e.g. in mathematical models, to foster the understanding of underlying biological processes, but also to r e v eal knowledge ga ps.Ev entuall y, models of whole cells could be simulated to understand the complex interwiring of cellular processes.of newborn small G1 cells, but also exerts stress on those cells.While time-r esolv ed and quantitativ e data ar e inv aluable for systems biology modeling, it is important to acknowledge that such data may not always be r eadil y av ailable, e v en not in the YCMDB, or compr ehensiv e. Systematicall y measuring time-r esolv ed data at different conditions and perturbations can indeed enhance the reconstruction of a YCM.The systematic reconstruction of a YCM ma y in volv e iter ativ e r efinement and integr ation of div erse data sources to ca ptur e the complexity of biological systems accur atel y.Ev entuall y, the cr eation of a YCM is a complex effort that would certainly profit from a concerted action of interested parties.

Figure 2 .
Figure 2. Ov ervie w ov er typical steps in mechanistic modeling, illustrating both ODE modeling (A)-(J) and Boolean modeling (K) and (L).(A)The information about the processes to be covered by the model can be given in graphical representation.The example used here can be re presentati ve both for metabolic or signaling processes: compound S 1 is produced and degraded by reactions 1 and 2 (with velocities v 1 and v 2 ), compounds S 2 and S 4 ar e conv erted into eac h other by r eactions 3 and 4, compound S 3 is also pr oduced and degr aded by r eactions 5 and 6.Compounds S 1 and S 3 modify (activate or inhibit) the velocities of reactions 3 and 5, respectively, without being consumed or produced themselves by these reactions.(B) The systems equations, in gener al, r epr esent the temporal changes of the compounds S i (denoted by the time deri vati ve d / dt ), which is given by the rates (or velocities) v j combined with the stoichiometric coefficients .T he necessary steps such that the system can be simulated are sketched in panels (C)-(F): (C) r epr esents the set of systems equations for the example in (A).(D) illustrates choices for rate expressions.v 3 , v 4 , and v 6 follow mass action wher e par ameters k stand for r ate constants, v 2 is an example for Mic haelis-Menten kinetics (with V max maximal v elocity and K M Mic haelis constant) and v 5 for Hill kinetics ( K 0,5 is the concentration giving half maximal velocity, n is the Hill coefficient).(E) P ar ameter v alues can be either obtained fr om databases, estimated from experimental data (genomics, proteomics, metabolomics, and biophysical measurements) or simply guessed (as done here).Briefly, parameter estimation requires systematic repeated simulation with different parameter values and comparison with experimental data with the aim to minimize the difference between data and simulation.(F) For a simulation to start, one has to determine the initial conditions.(G)-(J) are examples for simulation experiments based on the ODE system in panels (C)-(F).(G) shows a time course simulation.(H) presents the state space for S 1 and S 2 wer e v ectors indicate the dir ection of motion fr om differ ent starting points.(G) and (H) sho w that the system moves to w ar d a steady state.(I)Atypical way to analyze the ODE system is sensitivity analysis , i.e .testing the effect of small parameter variations on the dynamics .Here , parameter V max2 has been varied (10 simulations with different values).(J) shows the result of a stochastic simulation of the same system with the Langevin a ppr oac h, wher e a noise term is added to each equation resulting in slightly different dynamics for each of, here, 10 simulations.(K)-(M) Boolean model of a comparable system: Component S 1 activates S 2 , S 2 , and S 4 can be converted into each other (thereby annihilating the other component) and S 2 activates S 3 .Here, all compounds can have only two states, ON or OFF; also time proceeds in discrete steps.(K) Graph of the model.(L) Systems equations denote the state of the compound at the right side at time t + 1 as function of the state of components at the left side at time t .These c hanges ar e expr essed with Boolean rules.(M) sho ws tw o sim ulation experiments with differ ent initial conditions, wher e S 4 starts ON at t 0 in both cases and S 1 is either OFF or ON.If S 1 is OFF, the system is already at a fixed point and shows no changes in the following time steps.If S 1 is ON, the system oscillates , i.e .it has a cyclic attractor.For both ODE and Boolean modeling, it is often necessary to r e vise the model and repeat the modeling steps , i.e .network creation (components and reactions), assignment of rate expression or rules, and the parameter values, until the model behavior corr ectl y r eflects the experimentall y observ ed behavior of the system.

Figure 3 .
Figure 3. Ov ervie w of the contents and search functionalities of the YCMDB.

r
Cell division cycle (CDC): this module r epr esents the pro- cesses involved in cell growth and division, including DNA r eplication, c hr omosome segr egation, and cell wall synthesis.

Figure 4 .
Figure 4. Modular a ppr oac h to WCM. (A) Modules of the Yeast Cell Model: cell division cycle (CDC), metabolism [MET, containing central carbon metabolism (CCM), cell wall synthesis (CWS), DNA synthesis (DNA), lipid metabolism (LIP), amino acid metabolism (AAM), and stor a ge (STO)], transport [TRP, comprising ion (ION) and nutrient (NUT) transport], gene expression [GEX, with transcription (TRX), translation (TRL), assembly of protein complexes (APC), and histone activity (HIS)], volume changes [VOL, for mother (MOT) and bud (BUD)], and signaling (SIG, including the high osmolarity gl ycer ol (HOG), pher omone (MAT), Ca-calcineurin (CAL), and T OR (T OR) pathwa ys)].(B) Schematic of the merging approach: parameters are estimated for module separ atel y with other modules reduced to critical information, e.g.just the relevant volume .T hen modules are merged to one large ODE model, parameters are readjusted taking now the dynamics in other modules into account.Eventually, the whole WCM is simulated as one ODE system.(C) Schematic of consolidation approach: again, parameters are estimated for each module.In the simulation process, each module is sim ulated separ atel y with its own algorithm for a giv en time ste p t .It tak es gi v en v alues of link v ariables as input, c hanges them during the simulation and provides them after t as output.All link variables are updated based on outputs of all modules .T he process is iteratively repeated until the predefined end time.