CONNECTOR, fitting and clustering of longitudinal data to reveal a new risk stratification system

Abstract Motivation The transition from evaluating a single time point to examining the entire dynamic evolution of a system is possible only in the presence of the proper framework. The strong variability of dynamic evolution makes the definition of an explanatory procedure for data fitting and clustering challenging. Results We developed CONNECTOR, a data-driven framework able to analyze and inspect longitudinal data in a straightforward and revealing way. When used to analyze tumor growth kinetics over time in 1599 patient-derived xenograft growth curves from ovarian and colorectal cancers, CONNECTOR allowed the aggregation of time-series data through an unsupervised approach in informative clusters. We give a new perspective of mechanism interpretation, specifically, we define novel model aggregations and we identify unanticipated molecular associations with response to clinically approved therapies. Availability and implementation CONNECTOR is freely available under GNU GPL license at https://qbioturin.github.io/connector and https://doi.org/10.17504/protocols.io.8epv56e74g1b/v1.


Supplementary Material
Fig. S1: A Cross LogLikelihood plot, fDB, and tightness obtained from the CONNECTOR run on all 1536 individual mice. The optimal set of parameters selected is p=4, h=2, and G=3. B fDB, tightness grid, and stability matrix obtained on cluster A. The optimal set of parameters selected is p=4, h=3, and G=5. C fDB, tightness grid, and stability matrix obtained on cluster B. The optimal set of parameters selected is p=4, h=3, and G=3.

S2 Application of CONNECTOR to PDX curves of high-grade serous epithelial ovarian cancer
To show the basic use of CONNECTOR, we analyzed the spontaneous tumor growth patterns of PDX lines derived from the propagation of one chemotherapy-naïve high-grade serous epithelial ovarian cancer (HGS-EOC), which had been passaged for five generations until the production of 21 xenografts Erriquez et al. (2016). The progression curves of the PDX lines are reported in Figure S5A. The time grid suggests that until 70 days the frequency of the observations is larger than 0.75 and it drops for larger times, so we truncated the curves at day 70. The model selection suggests that the optimal parameters set are dimensions of the spline basis equal to 3 and the number of clusters equal to 4 (CTGC-A, CTGC-B, CTGC-C, CTGC-D). This combination of the parameters led to an average value of cluster stability equal to 0.8. The detailed CONNECTOR analysis is reported in Figure S6. Figure S5A shows the four CTGCs in which it is possible to appreciate four different progression patterns.
Notably, the cluster aggregation reflected a generation separation, likely due to intra-tumor heterogeneity. Indeed, initial engraftment is expected to impose a selection bottleneck, and subsequent propagations exacerbate clonal divergence owing to repeated sampling bias and genomic evolution. During the growth of each PDX line, functional outliers came out, that is, single PDXs emerged that were characterized by increased growth rates. For example, the PDX lines in CTGC-D reached 1000mm 3 of volume in 40 days with respect to the models in CTGC-A or CTGC-C, which reached the same volume in 60 and 70 days, respectively. These outliers might represent the expansion of clonal subpopulations with a fitness advantage. Moreover, the discriminant function returned by CONNECTOR highlights the strong discriminant power of the first 15 days of observation, which can be ascribed to the intra-tumor heterogeneity, see Figure S6. The comparisons of the CONNECTOR performance with respect to the classical fitting methods followed by k-means clustering and other functional methods are reported below.
To establish the versatility of CONNECTOR in analyzing different dynamics, we investigated the tumor growth curves of 15 PDX lines propagated from the same chemotherapy-naive HGS-EOC sample presented above, which were exposed to carboplatin (n=5), gemcitabine (n=5) or trabectedin (n=5). Carboplatin and gemcitabine were administered twice weekly for 4 weeks via intraperitoneal injection; trabectedin was only once administered through the tail vein. The optimal model selected indicates the dimension of the spline basis equals 4 and the number of clusters equals to 6 (CTGC-A, CTGC-B, CTGC-C, CTGC-D, CTGC-E, CTGC-F); considering this setting the cluster stability is 0.9. The detailed CONNECTOR analysis is reported in Figure S7.
In this cohort of PDXs, the response to drugs of individual models was uneven; some models showed a significant reduction, while other models proved to be resistant, Figure S5B. Notably, the discriminant function reveals that the time with the highest discriminatory power corresponds to day 35.

Fig. S6:
A line plot of the 21 PDX curves of HGS-EOC tumor. B Time grid, the suggested time at which truncate the data is 70 days. C Cross LogLikelihood plot reports 4 as the optimal parameter for the base spline, this value is also supported by the Knots distribution, D. E fDB and tightness violin plot for the selection of the number of clusters, fours are associated with the lower fDB value, and at 4 is appreciable the elbow in the tightness plot. F the stability matrix obtained selecting 4 clusters. G Discriminant plot and H discriminant functions. The optimal set of parameters selected is p=3, h=3, and G=4.
Indeed, around that time, the curves start to be separated as a consequence of the chemotherapy treatments effects. In detail, in CTGC-A, CTGC-B, and CTGC-F, the tumors remained stable during the treatment time window, while in CTGC-C and CTGC-D a clear increment, followed by a reduction of the tumor mass, was appreciable. This variable distribution of responses can be explained, again, with a high degree of intra-tumor heterogeneity in the original tumor from which the different PDX lines were propagated; indeed, the genetic deviation is expected to influence response to therapy Schmitt et al. (2016).
We illustrated the versatility of CONNECTOR starting with the analysis of the spontaneous growth patterns in 21 PDX lines propagated from a single HGS-EOC tumor sample. We observed uneven growth rates in PDX lines derived from the same original tumor. This is consistent with the notion that ovarian cancers show a high degree of intratumor heterogeneity McPherson et al. (2016) Schwarz et al. (2015) Castellarin et al. (2013), which results in

S3 Application of CONNECTOR to PDX curves of metastatic colorectal cancer: details
Inspection of CONNECTOR results The CONNECTOR results on all PDX growth curves of metastatic colorectal cancer reveal that 3 clusters is the optimal solution. We then investigated the sub-clustering of the CONNECTOR clusters A and B. The fDB and tightness plots of Cluster A highlighted that 3 and 5 clusters are both reasonable choices. Otherwise in cluster B, 3 resulted as the optimal number of clusters. Moved by the deep nature of clustering solutions that cannot be evaluated in a problem-independent way, we reported a comparison between two alternative scenarios: scenario (i) where all curves are divided into 3 clusters, namely A, B, and C; Cluster A is divided into 5 groups (Aa, Ab, Ac, Ad, and Ae), and Cluster B is divided into 3 groups (Ba, Bb, and Bc) reaching in total 9 CTGCs. scenario (ii) all curves are divided into 3 clusters, namely A, B, and C; Cluster A is divided into 3 sets (Aa, Ab, and Ac), and Cluster B is divided into 3 sets (Ba, Bb, and Bc) reaching in total 7 CTGCs.
The results of scenario (i) are reported in sections 3.1 and 3.2 of the main paper. Fig. S8: A) CONNECTOR tumor growth classes. The seven boxes result from a first run with a number of clusters equal to 3 followed by second runs on the CTGC-A (with 3 sub-classes) and on the CTGC-B (with 3 sub-classes). B) t-SNE visualization of the CTGCs induced on the parental tumors. Each dot corresponds to a parental tumor. The color of the dots matches the color of the assigned CTGC, see Panel A. The dimension of the dots is inversely proportional to the Shannon Index calculated on the distribution of the curves of the same parental tumor across CTGCs (large dots -small entropy). C) Molecular and phenotypic characterization of the CONNECTOR clustered mCRC xenografts. Each sample was annotated according to CRIS subtype, response to cetuximab, and somatic alteration known to determine cetuximab resistance or sensitivity.
Concerning scenario (ii) seven plots of the tumor growth clusters are reported in Figure S8 Panel A. In the scatter plot, reported in Figure S8 Panel B, it is appreciable that distinctly isolated CTGCs, by focusing on the color and the size of the dots, suggest their coherence. In this scenario, the new CTGCs are also characterized by strong biological correlates, see Figure S8 Panel C, with a polarization of non-responder tumors in clusters Ab, Ba, Bb, Bc, and C and a strong enrichment of partial responses (PR) and growth controls (SD) in Aa. CRIS-C is associated with Aa and well-known resistance markers are over-represented in CTGCs Ba and Bb. Moreover, differential gene expression analyses indicate a strong keratinization signal in Ba and Bb, as expected. Comparing with scenario (i) to scenario (ii), it is appreciable that the division of cluster A in five CTGCs (scenario (i)) presents a higher granularity resulting in A subclusters being more specific, with one enriched in PR, one in SD, and one in PD, while the division of cluster A in three CTGCs (scenario (ii)) tends to split the SD between Aa and Ab and thus dilute the signal.
mCRX PDX generation and treatment protocol After surgical removal from patients, each metastatic colorectal cancer specimen was fragmented and either frozen or prepared for implantation: cut into small pieces of which 2 fragments were implanted in 2 mice. After engraftment and tumor mass formation, the tumors were passaged and expanded for 2 generations until the production of 2 cohorts, each consisting of 12 mice. Tumor size was evaluated once weekly by calliper measurements and the approximate volume of the mass was calculated using the formula 4/3⇡(d/2) 2 D/2, where d is the minor tumor axis and D is the major tumor axis. PDXs derived from each original fragment were then randomized for treatment with placebo (6 mice) or cetuximab (6 mice); animals with established tumors, defined as an average volume of 400mm 3 , were then treated with cetuximab (Merck, White House Station, NJ) 20 mg/kg/twice-weekly i.p.
For assessing CTGCs response to therapy, we used averaged volume measurements at 3 weeks after treatment normalized to the tumorgraft volume at the time of cetuximab treatment initiation. Tumors are then classified as follows: 1. partial response (PR): decrease of at least 50% in tumor volume 2. progressive disease (PD): increase of at least a 35% in tumor volume 3. stable disease (SD): the ones between 50% decrease and 35% increase All animal procedures were approved by the Ethical Commission of the Institute for Cancer Research and Treatment and by the Italian Ministry of Health.
Transcriptional analyses Differentially expressed genes (DEGs) between different CTGC or keratin-high and -low groups were obtained using the R package DESeq2 (v1.26.0) Love et al. (2014) with design: batch+cluster. Here batch is used to correct for sequencing batches and cluster to select samples belonging to the CTGC of interest (or keratin-high and -low). We did not compare CTGC clusters for which less than 5 samples were available. Genes with more than 5 reads in only 1 sample were removed before testing for differential expression, DEGs were identified using |LFC| >=0.5849625 and adjusted p-value < 0.05. Gene Ontology analyses for the upregulated or downregulated genes were performed with the R library ClusterProfiler (v3.14.3) Wu et al. (2021); Yu et al. (2012).
The selection of a robust set of markers for the epithelial stratification differentiation phenotype was performed with two criteria: • being assigned to the GO keratinization (GO:0031424 Sup (2022)), the most significantly enriched term across all the CTGC-Ac vs other CTGC comparisons, via direct experimental evidence, using filters for the taxon (Homo Sapiens) and evidence (manual assertion) on the EBI Quick GO web site. AND • being significantly upregulated either in Bb versus Ac or Bc versus Ac The resulting genes are: KRT80, KRT7, KRT86, KRT81, KRT83, KRT6B, KRT6A, KRT74, KRT79, KRT16. To classify samples in high or low expressing keratin groups, quartiles of the expression level of the ten selected genes were calculated from progressive disease (PD) samples belonging to CTGC-Ac. Each sample was then assigned ten labels, according to its expression of those keratins with respect to CTGC-Ac first and third quartiles -"high" if the expression is larger than the 3rd quartile and "low" if it is smaller than the 1st. Samples were then classified as "keratin-high" if they were assigned no low labels, and "keratin-low" if they were assigned no high labels. Z transformed expression values in COAD/READ patients for HOPX were downloaded alongside clinical annotations from CBioPortal (Firehose legacy dataset), and patients with multiple expression values or missing expression/survival data were filtered out. Survival analysis was performed on the remaining 371 patients using the libraries survival and survminer Therneau (2022); Kassambara (2021), and the optimal threshold for HOPX expression was determined using the maximally selected rank statistic approach (maxstat, Hothorn (2017)), correcting accordingly the resulting log rank p value. DEG and all the enrichment and survival analyses were performed with R version 3.6.3.

Morphological analyses
Haematoxilin and Eosin was performed in xenografts from mice treated with vehicle (until tumors reached an average volume of 1500mm 3 ). Tumors were explanted, routinely processed and stained with H&E (Bio-optica). Images were captured with the Leica LAS EZ software using a Leica DM LB microscope.

S4 Further computational methods
Naive Bayesian Classification The naive Bayesian classification is a probabilistic classifier based on the Bayes' theorem. It can be efficiently exploited to assign a cluster membership to each parental tumor (from now on referred to model) given the cluster membership of every curve of the PDXs derived from it. Let (k i 1 , . . . , k i n i ) be the clusters assigned to the n i PDX curves belonging to the ith model. Thus, the classifier assigns the cluster k i 2 {1, . . . , G} to the ith model as:k   (1948) is an information statistic index that we used to quantify curve diversity in a specific PDX model. It is computed as follows: where H i is the Shannon index for the ith model, G is the number of clusters, p k,i is the proportion of PDX curves from the ith model and belonging to the kth cluster. Therefore, a larger value of H i corresponds to a larger diversity of the ith PDX model, i.e. many different clusters are assigned to the same model. If H i is equal to zero it means that the curves of the ith PDX model belong to the same cluster. (2008) is a statistical method used to visualize high-dimensional data in a two or three-dimensional map exploiting a nonlinear dimensionality reduction technique. t-SNE is performed on a dataset containing for each model the frequency distribution of the PDX curves over the CTGCs considered, that is a vector sized the number of CTGCs number reporting the proportion of PDX curves belonging to each CTGC. The scatter plot generated provided a spatial visualization of the agreement of the PDX curves belonging to a given model. Indeed, the size of the dot corresponds to the grade of agreement (in terms of the Shannon Index) of all PDX curves belonging to that model while the color corresponds to a specific CTGC.

S5 Model Selection Tools: numerical details.
The parametrized family of semi-metrics between curves, denoted as Dq and defined in eq. (3), is the building block for both the total tightness T defined in eq. (4) and the functional DB (fDB) defined in eq. (5). Such indexes are proposed as measures for the quality of the clustering of functional data, useful for a proper comparison between different classifications, and as a guide to the user. Let's here illustrate the numerical details needed for their reliable calculation. Following Ferraty and Vieu (2006b), the numerical evaluation of Dq should exploit the spline approximation of the curves involved. For q = 0 the curves are not differentiated and hence equations. (S4) and (S3) give the values of the curves at any points. Hence, the integral in eq. (3) can be evaluated using an appropriate quadrature formula on the full-time grid, made of all the sampled time points for all the sampled curves, as CONNECTOR calculates curve predictions and cluster means for the entire time interval.
For q 1, the computation of successive derivatives is performed by direct calculation of their analytical form. More precisely, the estimated i-th curve is given by eq. (S4) as a linear combination of the spline basis functions s(t) (with known analytical expressions) and the coefficients⌘ and the differentiated estimated i-th curve can be represented aŝ where the analytical expressions of s (1) (t) are known too, being the derivatives of the basis functions s(t). Let us comment that, the functional clustering method James and Sugar (2003) considers the spline basis s(t) obtained by the singular value decomposition of the B-spline basis for a natural cubic spline. Hence the (k, p) matrix A with elements the values of the B-spline basis (dimension p) functions evaluated at the grid time points (k) is decomposed as where U and V are (k, k) and (p, p) orthogonal matrices and D is a (k, p) diagonal with non zero elements (p 1 , . . . , pp). The spline basis considered in James and Sugar (2003) is given by the first p columns of the matrix U (the only columns which enter the product with matrix D). Hence for i = 1, . . . , p, the i-th column of such matrix is given by the linear transformation When q 1, the same linear transformation applies to the matrix B with elements the q-th derivative of the natural B-spline basis functions evaluated at the grid time points. So finally the integral in definition (3) can be evaluated with a suitable quadrature formula.

S6 Functional Clustering Model: details and fitting algorithm
Notice that 0 , ⇤ and ↵ k are confounded if no constraints are imposed. Hence, we ask meaning that s(t) T 0 may be interpreted as the overall mean curve. Moreover, we ask the reason of which is well explained in James and Sugar (2003). Notice that the kth cluster mean curve can be retrieved asḡ Moreover, the functional clustering procedure can accurately predict unobserved portions of the curves g i (t) by means of the natural estimatê where⌘ i is a prediction for ⌘ i which is proven to be optimally computed as E(⌘ i | Y i ) and explicitly given in James and Sugar (2003), eq. (17).

S6.1 The fitting algorithm
Fitting the model consists of estimating the parameters 0 , ⇤, ↵ k , and 2 , included in eq. (2). The cluster membership is treated as missing data through a multinomial random variable z i with parameters ⇡ = (⇡ 1 , . . . , ⇡ G ), where ⇡ k is the probability that the observation belongs to cluster k.
As the cluster memberships z i 's and the i 's are assumed to be independent, the complete log-likelihood can be factorized as follows: The equation displays the z i 's log-likelihood in the first row (S5), the i 's log-likelihood in the second row (S6), which are ⇠ N (0, ), and the Y i 's conditional log-likelihood in the last two rows (S7), which are ⇠ N (S i · ( 0 + ⇤↵ k ), 2 I). The EM algorithm is ruled to iteratively maximize the expected values of (S5), (S6) and (S7), given Y i and the current parameter estimates. Hence the algorithm iterates through the following steps: 1. Compute the membership probabilities. As the z ik are actually unknown, we calculate their expected values by Bayes theorem where f k is the conditional density of the i th curve belonging to the k th cluster, that is multivariate gaussian with mean vector S i ( 0 + ⇤↵ k ) and covariance matrix 2 I + S i S T i , see eq. (2). 2. E-M steps for (S5). In the E-step the expected log-likelihood is computed by substituting z ik for ⇡ k|i and in the M-step this quantity is maximized.
The result can be explicitly calculated: 3. E-M steps for (S6). The i 's are multivariate Gaussian with zero mean vector and covariance matrix and the Y i 's, conditional on the i-th curve belonging to the k-th cluster, are multivariate Gaussian as well, with mean vector µ Y ik and covariance matrix ⌃ Y ik given as follows Hence the conditional distribution of the i |Y i , z ik = 1 is again Gaussian with mean vector µ ik and covariance matrix ⌃ ik given as follows The above equations are obtained considering that the covariance matrix of the i 's and Y i 's is (S i ) T . Hence the expectation of (S6) is maximized by the sample covariance matrixˆ . The last part of the complete log-likelihood contains the parameters 0 , ↵ k , ⇤ and 2 . The expectation is repeatedly maximized until convergence, calculating explicitly the following point of maxima, for 0 and ↵ k : and for each column of ⇤: where↵ k,c is the c-th component of↵ k . Finally the expected log-likelihood is maximized for 2 , at the point: The fitting algorithm is outlined in the box 1. Notice that the initial guess for the cluster memberships of the curves is calculated through a k-means clustering algorithm on the spline coefficients ⌘ i , see eq. (1), calculated as / * log likelihood maximization * / 1. Take initial guesses for ⇡ k , , 0 , ↵ k , ⇤, 2 ; 2. Expectation Step: compute the membership probabilities ⇡ k using equations (S9) and (S8); 3. Maximization Step: 3.a computeˆ given in eq. (S10); 3.b iteratively until convergence computeˆ 0 ,↵ k and⇤c given in equations (S11), (S12) and (S13); 3.c computeˆ 2 given in eq. (S14); 4. Iterate steps (2) and (3) until all the parameters have converged.

S7.1 Accuracy of the model selection procedure
In this Section, we illustrate a simulation study with the purpose to point out the performances of CONNECTOR for different choices of the free parameters p, the spline basis dimension, and G, the number of clusters.
We run CONNECTOR on a simulated dataset, made of points sampled at discrete times from the trajectories of three-time continuous Gaussian processes, (X A t ) t 0 , (X B t ) t 0 and (X C t ) t 0 . The three processes share the same covariance function K given as The mean of the processes is given by the function m with three different choices of the parameter µ, namely E(X A t ) = m(t, µ A ), E(X B t ) = m(t, µ B ) and E(X C t ) = m(t, µ C ). In particular, the parameters have been set to the values µ A = 6, µ B = 0, and µ C = 6. The dataset consists of n = 50 curves for each process, hence N = 150 sampled curves altogether, and the parameters shared by the three processes are = 10 4 , l = 1, t 0 = 10 and c = 10 2 . The time points at which the trajectories are sampled are randomly chosen, both their number and their position. For each trajectory, the number of sampling time points is given as the max(3, r), where r ⇠Binomial (7, 0.5), and uniformly distributed in the time interval [1,22]. The dataset is illustrated in Fig. S10-panel A and colored by label: A for X A , B for X B , and C for X C . Setting p. As the dataset contains the true labels (A, B, and C), we can evaluate the performance of CONNECTOR as a classifier, by setting G = 3 and by finding a correspondence between the CONNECTOR clusters and the groups of identically labeled curves, that is the group of sampled curves from X A t , from X B t and from X C t . The point in this experiment is the choice of p, the dimension of the spline basis. This parameter defines the flexibility of the functional clustering model (2), larger values of p corresponding to higher flexibility, hence smaller bias and larger variance. Good values for this parameter are expected to be central, with a preference for small values in order to reduce the number of parameters to be estimated. The cross-loglikelihood plot, see Fig. S9, suggests choosing p = 4. This choice is motivated by the compromise between a large log-likelihood and a small number of parameters to be estimated. This choice is furthermore supported by the fDB index, see eq. (5), and the total tightness T , see eq. (4), calculated as p varies, for G = 3 and for the most frequent output clustering, see Fig. S11-panels A and C. The value p = 4 corresponds to the minimum fDB and to an elbow in the total tightness plot. The stability of the clustering is reported in Fig. S11-panel B and for p = 4 is large enough. The clustered curves are illustrated in Fig. S10-panel B and colored by the true label. The examination of the plots of the curves in the CONNECTOR clusters leads to a reasonable link between the CONNECTOR cluster 1 with the label A, the CONNECTOR cluster 2 with the label C, and the CONNECTOR cluster 3 with the label B.
The performance of CONNECTOR in clustering together sampled curves from trajectories of processes with the same label is evaluated by plotting the True Positive (TP) rate and the False Positive (FP) rate, for each label, in Fig. S11-panels D, E and F. The value p = 4 is the only value at which, for all three labels, TP rate is large and FP rate is small. Notice that the label B is the label that is most often confused. This is not surprising as the process X B t has zero mean, hence its trajectories are driven by the variance. With a small number of sampling points, a small positive fluctuation can be wrongly classified in CONNECTOR cluster 1, while a small negative fluctuation can be wrongly classified in CONNECTOR cluster 2. The best clustering reported in Fig. S10-panel B shows that indeed only one curve sampled from C is wrongly assigned to CONNECTOR cluster 3 and some curved sampled from B are wrongly assigned to CONNECTOR clusters 1 and 2. Though, it is important to emphasize that all the wrong assignments are actually very well fitting in the designated CONNECTOR cluster. Time points distribution 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 9 9 9 9 10 10 10 10 11 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 15 16 16 16 16 17 17 17 17 18 18 18 18 19 19 19 19 20 20 20 20 21 21 21 21 Time Boundary knots Knots 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 9 9 9 9 10 10 10 10 11 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 15 16 16 16 16 17 17 17 17 18 18 18 18   Setting G. In this second experiment, we show the performance of the proposed method for choosing G, that is the analysis of the fDB violin plots, the Tightness violin plots, and the stability matrices. In Fig.S12 the plots are reported, for the suggested choice of p = 4 and for G = 2, 3, 4. The values of the fDB indices are very good, the distributions of the indices on different runs lie entirely below the constant line at 0.6 both for the choices G = 2 and G = 3 (good fDB values are considered to be smaller than 1). The case G = 4 suffers from a higher instability, as shown by the dots around the values 0.65 and 1.2 and by the stability matrices. Hence, we would suggest G = 3 be the optimal number of clusters.

S7.2 Robustness for the spline basis dimension choice
In this Section we give some additional details on the performances of CONNECTOR for different choices of the spline basis dimension parameter p, hence when setting p as suggested by CONNECTOR and when setting p to other values.
To explore this issue, we run CONNECTOR on the PDX growth curves analyzed in Section 3, for different choices of the parameter p. For all the values p = 3, 4, 5, 6, 7, 8, the fDB, tightness, and stability indices suggest choosing G = 4, similarly to the case p = 4 reported in Fig. S2. Hence in Fig. S13 all curves are plotted with G = 4.
The cross-loglikelihood plot in Fig. S13 suggests choosing p = 4, as it is the value at which the cross-loglikelihood increases to a measure that is maintained steadily up to p = 8, and p = 4 is small enough to be conservative with the number of parameters to estimate and to keep the model flexible but not overfitting. The fDB plot shows that the best clusterings are obtained for p = 3 and p = 4, at which the fDB is smaller. The stability and the tightness values confirm that the best choices are for p = 3 and p = 4.
As the parameter p determines the flexibility of the model, smaller values are associated with models that show larger bias and smaller variance, while larger values are associated with models that better fit the points of the curves, but variance increases. This can be appreciated in Fig. S14, where the cluster centers (the mean curves for each cluster) are plotted for three values of p = 4, 5, 6. As p increases, it is evident that the estimated mean curves exhibit more fluctuations which are the result of a tighter fitting to the data points, but the resulting estimation of the curves and hence of the clustering are deteriorated (as indicated by the resulting indices).
However, we should notice that CONNECTOR is very robust to choices of the parameter p which are less than optimal. Indeed up to p = 5 the indices of the resulting clusterings are all very good.

S7.3 Robustness for decreasing numbers of sampled curves and sampling times.
In this Section, we tackle the performance of CONNECTOR when the dataset is downsampled. There are two cases that lead to a poor dataset in this framework: the first one is that the number of sampled curves is small, the second one is that the curves are sampled at few time points.
The experiment setup is designed as follows: we start from the PDX growth curves dataset analyzed in Section 3 and we perform two degradation steps, each of them deleting the 25% of the sampling time points randomly. After these two steps, we have three datasets: the original dataset, the dataset reduced to a proportion equal approximately to the 75% of the original sampling time points and the dataset reduced to a proportion equal approximately to the 50% of the original sampling time points. The characteristics of the three datasets are illustrated in Fig. S15. The number of sampling time points is actually reduced, see Fig. S15-panel A, and so is the number of curves as well, S15-panel B. Indeed, we choose to run CONNECTOR on the curves with a minimum length equal to three, i. e. with at least three sampling time points, hence a reduction of the number of sampling time points leads to a reduction of the number of curves as well. However, the distribution of the number of sampling time points per curve is similar in the three datasets, see S15-panel C, being around the value 4, with few exceptions.

S8 Performance of different clustering methods on sparse and irregularly sampled curves
In this Section, we review the performances of different clustering procedures that could be implemented on longitudinal data. Clustering with Classical Growth Models. As we address growth curves in this paper, we cannot skip mentioning the classical growth models that have been introduced in the literature. In the following, we review the main growth models (Malthus, Gompertz, and logistic) and the performance of a clustering method based on the direct estimation of the parameters involved. Indeed, each sampled curve could be represented as the array of the estimated values of the parameters of the fitted growth model. In doing so, the full sample of curves becomes a sample of points in R n , where n is the number of parameters of the growth model, and the clustering procedure can be performed using classical clustering methods for data points. Let us briefly review the most popular growth models. These models are usually formulated in terms of ordinary differential equations (ODEs) that relate the growth rate of the tumor to its current state and range from the simple one-parameter exponential growth model to more advanced models that contain more parameters.
The Malthus model was one of the first mathematical models introduced to study and model exponential-linear population growth. It assumes that every cell continuously passes through the cell cycle giving birth to two daughter cells at regular intervals. Then, the number of cancer cells and therefore the volume of the tumor would increase exponentially over time. In particular, the Malthus model is described by the following ODE: where V (t) is the total tumour volume at time t, ↵ is a constant for the growth rate, and V 0 is the tumour volume at the initial time t 0 . Successively, the Gompertz model was derived from a generalisation of the Malthus model, assuming that the relative growth rate is not constant, but time-dependent with a decreasing behavior at constant positive rate : Finally, in the logistic model, the relative growth rate decreases linearly, when the population size increases, according with the assumption that resources are limited. Thus, the ODEs defining the model are: where ↵ is a coefficient related to population kinetic and K is the so-called carrying capacity, which represents the maximum population size a particular environment can support. For each model above reported, a clustering algorithm can be implemented in a two steps procedure: 1. For every i, estimate the model parameters by fitting the i-th sampled curve. The result is an array (p i 1 , . . . , p in ), where n is the number of model parameters. Each sampled curve is now represented as a data point in dimension n. 2. Cluster the full sample of data points (p i 1 , . . . , p in ) N i=1 by a classical clustering algorithm. We chose a k-means algorithm.
The procedure has been tested and the performance is reported here.
The comparison setup. We tested the clustering with classical growth models on four sets of growth curves, see Figure S17. The four CONNECTOR analyses are reported at https://qbioturin.github.io/connector/examples/.
For each sample, we first fitted the Malthus, Gompertz and logistic models. The estimated parameters for each curve in each of the four samples are reported in Tables S4, S5, S6 and S7.
The data points are then clustered through a k-means clustering algorithm. The optimal number of clusters is determined by evaluating the Total Tightness, see eq. (4) (in the main paper), and the fDB indexes, see eq. (5) (in the main paper), calculated using the distance between curves, see eq. (3) (in the main paper). The quality of the final clustering of the corresponding curves is evaluated by the fDB index. The results are reported in Table S8. Those values will be compared to the values of the same index on clustering of the same samples through functional methods, discussed in the following paragraph. As a first comment, notice that the classical models are not flexible enough to fit the curves in test 4, which exhibit more complex dynamics than a "simple" growth.
Review and Comparison of Functional Clustering Methods. Clustering functional data is generally a difficult task because of the infinite dimensional space from which data are sampled. Different approaches have been proposed along the years reviewed in Müller (2005) and Jacques and Preda (2014).
The most popular approach consists of reducing the problem to a finite dimensional setting by approximating data with elements from some finite dimensional space. Afterwards, clustering algorithms for finite dimensional data can be run. The reducing dimension step, often denoted as filtering step, consists in approximating the curves into a finite basis of functions. Spline basis is one of the most common choice because of their optimal properties. Another dimension reduction technique is the functional principal component analysis, based on the Karhunen-Loeve expansion of a square integrable L 2 stochastic process. The implementation of functional principal components also requires some form of regularisation, which can be achieved with smoothing methods, see Ramsay and Silverman (2005).   On the other hand, nonparametric methods for clustering have been proposed, see Ferraty and Vieu (2006a). They consist generally in defining specific distances or dissimilarities for functional data and then apply clustering algorithms as hierarchical clustering or k-means.

Malthus Logistic Gompertz
Finally, model-based clustering techniques have been developed. In this cases the observations are modeled as mixture distributions. Two main currents have been explored: one which models the functional principal components scores Bouveyron and Jacques (2011) and another one which models directly the expansion coefficients in a finite basis of functions James and Sugar (2003).
The advantages and disadvantages of each method are summarized in Jacques and Preda (2014). We do not enter the discussion here and refer the reader to the original paper. As the conclusion of the critical analysis presented in Jacques and Preda (2014) is that model-based techniques are better

Malthus
Logistic Gompertz  performing, we decided to rely on such techniques. In particular on the two methods presented in James and Sugar (2003) and in Bouveyron and Jacques (2011). We tested both techniques on the "general growth curve" type of data which are the object of this paper. The comparison setup. The two selected clustering procedures are both implemented in a software. The clustering procedure introduced in Bouveyron and Jacques (2011) is implemented in the R package 'funHDDC', maintained and available for download from CRAN. The method by James and Sugar (2003) is implemented in an R function 'fclust' available directly from James's webpage http://faculty.marshall.usc.edu/gareth-james/index.html. We tested both methods on four sets of growth curves from Figure S17, the same dataset on which the clustering with classical growth models have been tested, see Table S8. The results are summarised in Table S9. Notice that 'funHDDC' includes a model selection function to help the user set the optimal number of clusters, while 'fclust' has been integrated with the model selection procedures introduced in CONNECTOR. Moreover, the functional latent mixture model presented in Bouveyron and Jacques (2011) can be reduced to different submodels by constraining model parameters within or between groups. Here we tested the submodels which are denoted as [a kj , b k , Q k , d k ], [a k , b k , Q k , d k ], [a kj , b, Q k , d k ] and [a k , b, Q k , d k ].

Malthus Logistic Gompertz
Notice that both fclust and funHDDC result in very good fDB indexes, compared to the fDB indexes reported in Table S8 for the classical models. Both the functional clustering methods are based on very flexible models, that can adapt to data. On the contrary, the simple classical models have too few degrees of freedom and cannot adjust to the sampled curves.
As shown in Table S9, the method by James and Sugar (2003) lead to far better results. The conclusion of the comparison integrates the study performed in Jacques and Preda (2014), as here, strongly sparse and irregularly sampled curves are considered, few points per curve ⇠ 7 to 20 instead of ⇠ 31 to 241.