Incorporating temporal information during feature engineering bolsters emulation of spatio-temporal emergence

Abstract Motivation Emergent biological dynamics derive from the evolution of lower-level spatial and temporal processes. A long-standing challenge for scientists and engineers is identifying simple low-level rules that give rise to complex higher-level dynamics. High-resolution biological data acquisition enables this identification and has evolved at a rapid pace for both experimental and computational approaches. Simultaneously harnessing the resolution and managing the expense of emerging technologies—e.g. live cell imaging, scRNAseq, agent-based models—requires a deeper understanding of how spatial and temporal axes impact biological systems. Effective emulation is a promising solution to manage the expense of increasingly complex high-resolution computational models. In this research, we focus on the emulation of a tumor microenvironment agent-based model to examine the relationship between spatial and temporal environment features, and emergent tumor properties. Results Despite significant feature engineering, we find limited predictive capacity of tumor properties from initial system representations. However, incorporating temporal information derived from intermediate simulation states dramatically improves the predictive performance of machine learning models. We train a deep-learning emulator on intermediate simulation states and observe promising enhancements over emulators trained solely on initial conditions. Our results underscore the importance of incorporating temporal information in the evaluation of spatio-temporal emergent behavior. Nevertheless, the emulators exhibit inconsistent performance, suggesting that the underlying model characterizes unique cell populations dynamics that are not easily replaced. Availability and implementation All source codes for the agent-based model, emulation, and analyses are publicly available at the corresponding DOIs: 10.5281/zenodo.10622155, 10.5281/zenodo.10611675, 10.5281/zenodo.10621244, respectively.

Emergent behavior metrics are defined and calculated as presented in a previous study of heterogeneous vasculatures in ARCADE (Yu and Bagheri, 2021).

Growth rate
Growth rate quantifies the change in colony diameter over time.First, colony diameter is calculated at each time index.Growth rate is the slope of the simple linear regression between [0, 0.5, . . ., ti] and corresponding diameters [D0, D0.5, . . ., Di], where i indicates the timepoint.These calculations were performed using the Python function polyfit from package numpy with degree of 1.

Symmetry
Symmetry (S) quantifies colony shape at a given timepoint, ranging from 0 (not symmetric) to 1 (perfectly symmetric).For hexagonal coordinates, a colony is perfectly symmetric if for each location (u,v,w), the corresponding five locations (-w,-u,-v), (v,w,u), (-u,-v,-w), (w,u,v), and (-v,-w,-u) are all occupied.Symmetry is calculated as: where N is the number of unique locations and ni is the number of corresponding unoccupied locations for a unique location i.

Activity
Activity (A) quantifies the ratio of active (proliferative and migratory) to inactive (necrotic and apoptotic) cells at a given timepoint.This metric ranges between -1 (all non-quiescent cells are inactive) and +1 (all non-quiescent cells are active).An activity value of 0 indicates an equal number of active and inactive cells.Activity is calculated as: where Na is the number of active cells and Ni is the number of inactive cells.

where
dist(v) is the shortest distances between v and all other nodes The average longest shortest path that each nodes is from another (|V |−1) where dist(v) is the shortest distances between v and all other nodes The average shortest path between each nodes and every other Cl(v) is the ratio of the ratio between the number of existing connections between neighbors of v and the maximum possible connections between them The average clustering coefficient of a graph; the tendency for nodes to form tightly interconnected clusters Average closeness AVG CLOSENESS (|V |−1) * (|V |−1) v∈V dist(v) |V | where dist(v) is the shortest distances between v and all other nodes The average closeness between nodes using the Wasserman and −1) * (|V |−2) where B(v) is the sum of the fraction of shortest paths passing through v over all pairs of nodes in the graph The average betweenness of a graph; the average importance of nodes connectors between other nodes in the v) is the coreness of node v The average coreness of a graph; the level of connectedness of nodes within a graph Supp.Fig. 1: Emulators capture variance across the data during training -(A) Histograms show a range of performances achieved during training for different parameter combinations in the colony context.More complex models, such as MLP, can perfectly fit the data in many cases.The vertical dashed line indicates the training performance on model with the best validation performance.(B) Similarly, histograms show prediction performance during training across responses and models in the tissue context.Supp.Fig. 2: Discrepancy in feature importance across models -Histograms along the diagonal show the normalized importance distribution of features for each model.Parity plots show the importance of a feature for one model against another.Perfect parity would indicate that the models agreed on the importance level of each feature.Supp.Fig. 3: Training data points to diminishing returns -Line plots indicate the predictive performance of models trained on increasingly large training data sets.In most cases, the RMSE shows diminishing returns of model performance across all feature sets (topolgical, hemodynamic, and spatial), emergent targets (activity, growth, and symmetry), and algorithm (MRL, RF, SVR, MLP).Supp.Fig. 4: Differential timepoint analysis shows minimal improvement in performance in a colony context -Differential features were calculated by subtracting the features at either one or two weeks from initial features in order to capture the evolution of the features over time.(A) Parity plots show the predicted values of emergent targets against the real value to demonstrate fine-grained predictive performance.(B) Bar plots show R 2 values for ML models trained on differential features.Growth predictions in the colony context improve with simulations using week one and initial timepoint differential features.Supp.Fig. 5: Differential timepoint analysis does not improve performance in a tissue context -Differential features were calculated by subtracting the features at either one or two weeks from initial features in order to capture the evolution of the features over time.(A) Parity plots show the predicted values of emergent targets against the real value to demonstrate fine-grained predictive performance.(B) Bar plots show R 2 values for ML models trained on differential features.Supp.Fig. 6: Shortened prediction horizon has minimal effect on emulator performance -Models predicting spatio-temporal dynamics from after one simulation week.(A) Bar plots indicate very poor predictive performance in all cases.Bar chart values range from -0.1 to 1.0; the horizontal axis is at 0.0.The Bonferroni corrected p-values from a two-way ANOVA highlight significant results (noted with black circles) that have an adjusted p-value less that 0.05.(B) Parity plots reveal substantial discrepancies in the variance between the predicted and true responses.(C) Line plots show predictive performance of the MLP models (the average RMSE) as a function of the size of training data.Performance improvements are limited from additional data.The data points highlight the RMSE from randomized test sets.Supp.Fig. 7: Mid-simulation features show improvement in performance in both contexts -Models trained on vascular features characterizing network structure at one simulation week result.(A) Bar plots show limited improvement from training on mid-simulation features.Bar chart values range from -0.1 to 1.0; the horizontal axis is at 0.0.The Bonferroni corrected p-values from a two-way ANOVA highlight significant results (noted with black circles) that have an adjusted p-value less that 0.05.(B) Parity plots reveal large amounts of variance in predicted values with some improvement in growth predictions in the colony context.(C) Line plots show predictive performance of the MLP models (the average RMSE) as a function of the size of training data.Performance improvements are limited from additional data.The data points highlight the RMSE from randomized test sets.Supp.Fig. 8: Temporal information improves ML model predictions in tissue context -Parity plots show the predictive performance of ML models trained on features from later timepoints against emulators trained on features from the initial timepoint.Improved prediction of activity is limited; growth and symmetry reflect minimal improvements.Supp.Fig. 9: RNN model captures vascular feature variance -(A) Parity plots highlight the performance of a RNN model at predicting network metric features.Both the simulated features and forecasted features were combined to perform PCA.The dimensionality of the feature set are reduced to the first 10 principal components, which represent 95% of the feature variance.The Pearson correlation coefficient (ρ) is reported for each parity plot.(B) Scatter plots illustrate the overlap between true and forecasted features using the first four principal components.

Fig. 10 :
Architecture and structure of the RNN used for feature prediction -This flowchart describes the layers used while training the feature prediction RNN.All parameter values are defined in Supp.

Table 4 .
Emulator run times across feature sets on an m5.large EC2 instance