pyaging: a Python-based compendium of GPU-optimized aging clocks

Abstract Motivation Aging is intricately linked to diseases and mortality. It is reflected in molecular changes across various tissues which can be leveraged for the development of biomarkers of aging using machine learning models, known as aging clocks. Despite advancements in the field, a significant challenge remains: the lack of robust, Python-based software tools for integrating and comparing these diverse models. This gap highlights the need for comprehensive solutions that can handle the complexity and variety of data in aging research. Results To address this gap, I introduce pyaging, a comprehensive open-source Python package designed to facilitate aging research. pyaging harmonizes dozens of aging clocks, covering a range of molecular data types such as DNA methylation, transcriptomics, histone mark ChIP-Seq, and ATAC-Seq. The package is not limited to traditional model types; it features a diverse array, from linear and principal component models to neural networks and automatic relevance determination models. Thanks to a PyTorch-based backend that enables GPU acceleration, pyaging is capable of rapid inference, even when dealing with large datasets and complex models. In addition, the package’s support for multi-species analysis extends its utility across various organisms, including humans, various mammals, and Caenorhabditis elegans. Availability and implementation pyaging is accessible on GitHub, at https://github.com/rsinghlab/pyaging, and the distribution is available on PyPi, at https://pypi.org/project/pyaging/. The software is also archived on Zenodo, at https://zenodo.org/doi/10.5281/zenodo.10335011.


Introduction
As we entered the 21st century, longevity studies became the cornerstone of aging research in various model organisms.The span of these studies ranged from a few days in Caenorhabditis elegans to several weeks in Drosophila melanogaster, extending up to a few years in Mus musculus.This spectrum allowed for manageable daily mortality tracking in the burgeoning field of gerontology.Nonetheless, the Graphical Abstract feasibility of lifespan studies, both in terms of time and cost, remains a significant challenge.The transformative work by Horvath in 2013 marked a pivotal moment, introducing a reliable age predictor and catalyzing a new domain of research focused on the development and refinement of biomarkers of aging, healthspan, and lifespan (Horvath 2013).
Presently, the field boasts over a hundred aging clocksmachine learning models designed to predict various aspects of aging.DNA methylation, undoubtedly the most popular data type for constructing aging biomarkers, is complemented by other molecular signatures like transcriptomics, proteomics, blood chemistry, histone modification, and chromatin accessibility, each offering unique advantages.However, there exists a notable gap in software tools that consolidate these diverse aging clocks for comparative analysis.A few notable initiatives, such as the R packages methylclock (Peleg� ı-Sis� o et al. 2021) and methylCYPHER (Thrush et al. 2022), represent steps toward addressing this need.
Yet, the development of aging biomarkers is not without its challenges, as underscored in a recent perspective (Moqri et al. 2023).Current software tools in this domain face several limitations: (i) while popular in biology, the prevalent use of R, an ad-hoc object-oriented programming system, often lacks the versatility needed for complex models like neural networks, which are more effectively implemented in languages such as Python; (ii) most existing age prediction packages are limited to a handful of clocks, far fewer than the actual breadth of available models; (iii) a focus predominantly on DNA methylation biomarkers narrows the scope for cross-comparison across different molecular layers; (iv) the lack of nonlinear techniques, such as neural networkbased approaches like AltumAge (Galkin et al. 2021, de Lima Camillo et al. 2022); (v) the reliance on CPU processing results in slower inference, especially with larger datasets and more complex models; (vi) a species-specific focus, predominantly on Homo sapiens.

Materials and methods
The development of pyaging commenced with an extensive review of the literature to identify a diverse array of aging clocks, encompassing various data types, computational models, and species, as summarized in Table 1.
Each identified model was subsequently reimplemented with a PyTorch backend to enable GPU-accelerated computations.This approach takes advantage of the fact that aging clocks, at their core, often rely on matrix multiplications, particularly within the domain of linear models, which are prevalent in aging research.
For a linear model, let b represent the vector of coefficients (including the intercept β 0 ), ɛ the vector of error terms, X the matrix of independent variables, and y the vector of dependent variable observations.The linear model can then be expressed algebraically as: This equation can be succinctly represented in matrix form as: where y 2 R m is the vector of dependent variables for m samples, X 2 R m×ðnþ1Þ is the matrix of independent variables (with the first column being a vector of ones to incorporate the intercept term), b 2 R ðnþ1Þ is the vector of coefficients including the intercept, and ɛ 2 R m is the vector of errors.
In the context of PC-based clocks, the model can be extended to incorporate dimensionality reduction via principal component analysis (PCA) before applying the linear model: Here, l 2 R n denotes the mean vector for each independent variable, 1 is a column vector of ones of length m used to broadcast the mean subtraction across all samples, W 2 R n×p represents the rotation matrix derived from PCA, and p is the number of principal components retained.The term ðX − 1l > ÞW thus represents the projection of centered data onto the principal components, upon which the linear model is applied.
This framework demonstrates that a variety of aging clocks fundamentally rely on matrix operations, making them wellsuited for implementations that leverage the computational efficiencies of modern GPU architectures.
The implementation of age prediction in pyaging begins with preprocessing the data matrix.Missing values are imputed using methodologies ranging from simple mean imputation to more sophisticated techniques like KNN imputation.The input matrix is then curated to retain only the features pertinent to the selected clock, with any absent features being substituted with standardized values for the clock of interest if available or with zeros.This approach is adopted to accommodate the diversity of data types handled by pyaging, and users are duly alerted of such substitutions.

Results
To demonstrate the capabilities of the pyaging package, I briefly analyzed 38 methylation aging clocks and biomarkers of aging using data from my previous work on AltumAge (de Lima Camillo et al. 2022).The dataset comprises �13 000 multi-tissue human samples from fetal tissue to centenarians across 142 studies, featuring beta values from overlapping probes of Illumina's 27k, 450k, and EPIC arrays.pyaging facilitates fast and easy comparisons amongst the different models.See Supplementary Data to reproduce the figures and analyses.
First, with a single line of code, the output of 38 different biomarkers can be calculated.Through hierarchical clustering and Spearman correlation, expected patterns emerge (Fig. 1a).For instance, DNAmTL (Lu et al. 2019), and PCDNAmTL (Higgins-Chen et al. 2022), both estimating telomere length, cluster together.Similarly, the three human multi-tissue clocks that predict chronological age, i.e.AltumAge (de Lima Camillo et al. 2022), Horvath2013 (Horvath 2013), and PCHorvath2013 (Higgins-Chen et al. 2022), are grouped.However, there are some interesting observations.For instance, DunedinPACE (Belsky et al. 2022), a measure of the pace of aging, is proximate to PedBE (McEwen et al. 2020), an aging clock for children and adolescents.In addition, PCGrimAge (Higgins-Chen et al. 2022), a predictor of mortality, is near Mammalian1 (Lu et al. 2023), the pan-mammalian clock that predicts chronological age.Overall, pyaging makes it straightforward to contrast the performance of distinct aging clocks.
Second, given that many biomarkers are discordant, samples can be grouped into ageotypes (Ahadi et al. 2020).To visualize such behaviors, I ran PCA on the data matrix with the scaled result of the 38 models, followed by uniform manifold approximation and projection (UMAP) on the top five components for further dimensionality reduction (Fig. 1b-e).A chronological age predictor, a mortality predictor, and a pace of aging predictor not always agree with one another.For instance, there are islands in which AltumAge is low (Fig. 1b) but GrimAge2 (Lu et al. 2022) is high (Fig. 1c).Similarly, some clusters exhibit diverging patterns with DunedinPACE (Fig. 1d).Lastly, given that all samples are from human, the MammalianLifespan (Li et al. 2023) predicts roughly the same number for the entire data.In summary, given that different clocks measure different phenomena, pyaging makes it easy to better understand aging profiles.
Third, a burgeoning field of research within the aging research community is age reversal through epigenetic reprogramming (de Lima Camillo and Quinlan 2021, Simpson et al. 2021, Paine et al. 2023).With the expression of four transcription factors, it has been shown that the predicted age with the methylation clock Horvath2013 (Horvath 2013) is decreased to zero (Olova et al. 2019).To shine more light upon this process, I ran 39 clocks in a reprogramming dataset [GSE54848 (Ohnuki et al. 2014)].To better compare different clocks, I scaled the data with one as the maximum and zero as the minimum of each clock (Fig. 1f).Whilst most clocks indeed show a rejuvenation event, more markedly between days 10 and 20 of reprogramming, a few such as telomere length predictors DNAmTL and PCDNAmTL increase.Others do not change meaningfully, such as the centerian predictor ENCen100 (Dec et al. 2023).Excitingly, some clocks such as AltumAge display a drop in predicted age at Day 3 while others such as some PC clocks only show rejuvenation at Day 11.This type of analysis can guide wet lab experiments as a tentative rejuvenation event might be missed depending on the clock used.
Fourth, one of the main advantages of the package is speed.I compared pyaging with biolearn (Ying et al. 2023), a preliminary CPU-based biomarker package.To compare their performance, I predicted the ages of the AltumAge data with two linear models, Horvath2013 and DNAmPhenoAge (Levine et al. 2018)-more complex clocks that would benefit the most from GPU acceleration, such as AltumAge, were not available on biolearn at the time of writing.I timed the line in which age is predicted for both packages given ten random samples of different sizes (Fig. 1g  and f).At the lower end, pyaging displays a minor advantage with 1024 samples for the average of both clocks (0.233 versus 0.386 s).Nevertheless, the fold difference in time quickly increases with a larger sample size, with a roughly 120-fold difference with 32768 samples (0.642 versus 76.608 s).Moreover, the setting with the highest number of samples, 65536 ran out of memory with biolearn and could not be completed.While the absolute time is not substantial, given increasing data sizes and complexity of models, this will become more significant as the field develops.This becomes increasingly important as age predictors are developed for single cells given the usual large number of observations.Overall, this comparison highlights the power of GPU-acceleration enabled by pyaging.

Conclusion
Despite the abundance of aging clocks developed, a critical gap remains in integrating these diverse models for comprehensive analysis, a need only partially addressed by existing tools like methylclock and methylCYPHER.My contribution, the pyaging package, represents a significant advancement in addressing these challenges.By adopting a Python-centric approach, pyaging overcomes the limitations inherent in the pyaging: a Python-based compendium of GPU-optimized aging clocks R-dominated landscape of current tools, offering greater flexibility for complex models.The incorporation of a wide array of aging clocks, covering various molecular signatures, reflects the commitment to a comprehensive understanding of aging.
In addition, pyaging integrates advanced modeling techniques and leverages GPU processing for enhanced computational efficiency.Its multi-species capability extends its utility across a range of gerontological studies.Overall, pyaging not only marks a substantial progress in the field of biomarkers of aging but also sets a foundation for further scientific inquiries in this rapidly developing domain.

Figure 1 .
Figure 1.Four simple analyses with pyaging.(a) Heatmap showing Spearman's correlation amongst 38 methylation clocks in AltumAge's dataset.Clocks are grouped by hierarchical clustering.(b-e) UMAP plot of the top five principal components from the scaled data matrix of 38 different clocks for AltumAge's data, highlighting AltumAge (b), GrimAge2 (c), DunedinPACE (d), and MammalianLifespan (e).(f) Line plot of 39 different clocks for the reprogramming timecourse dataset GSE54848; 95% confidence intervals are derived from 1000 bootstraps.(g, h) Performance comparison between GPU-enabled age prediction with pyaging versus CPU-only biolearn using Horvath2013 and DNAmPhenoAge.Ten random samples of size n from AltumAge's data were taken to construct the boxplots.Predictions for the 65 536-sample setting for was not computed for biolearn due to memory issues.

Table 1 .
Overview of aging clocks currently available on pyaging.a Detailed tutorials and use-case examples are available on the documentation website: https://readthedocs.org/projects/ pyaging/builds/22654195/.The code is also available on GitHub: https://github.com/rsinghlab/pyaging.The notebook for the example analyses in this manuscript is available in the Supplementary Data.The packages used are pandas v2.1.3(McKinney et al. 2011), numpy v1.26.2