Neural-net-based cell deconvolution from DNA methylation reveals tumor microenvironment associated with cancer prognosis

Abstract DNA methylation is a pivotal epigenetic modification that defines cellular identity. While cell deconvolution utilizing this information is considered useful for clinical practice, current methods for deconvolution are limited in their accuracy and resolution. In this study, we collected DNA methylation data from 945 human samples derived from various tissues and tumor-infiltrating immune cells and trained a neural network model with them. The model, termed MEnet, predicted abundance of cell population together with the detailed immune cell status from bulk DNA methylation data, and showed consistency to those of flow cytometry and histochemistry. MEnet was superior to the existing methods in the accuracy, speed, and detectable cell diversity, and could be applicable for peripheral blood, tumors, cell-free DNA, and formalin-fixed paraffin-embedded sections. Furthermore, by applying MEnet to 72 intrahepatic cholangiocarcinoma samples, we identified immune cell profiles associated with cancer prognosis. We believe that cell deconvolution by MEnet has the potential for use in clinical settings.


Introduction
DNA methylation is recognized as a principal epigenetic modification that dictates cellular identity.When DNA undergoes demethylation, chromatin becomes accessible, leading to the promotion of mRNA transcription of genes in regulatory regions through histone modifications ( 1 ).While gene expression is susceptible to changes in cellular states and environmental conditions, DNA methylation consistently represents the lineage and characteristics of cells ( 2 ,3 ).In recent years, extensive profiling of DNA methylation across various cell types has elucidated their cellular specificity and biological significance ( 3 ).Notably, in immune cells, DNA methylation is used as a robust marker for the suppressive function of regulatory T (Treg) cells ( 4 ).Additionally, DNA methylation has been reported to be useful for identifying the primary tumor sites in cancers of unknown primary origin due to its robustness ( 5 ).
Deconvolution is a computational strategy to infer cell proportion using bulk sample data and reference cell type-specific data.Applying deconvolution to RNA expression or DNA methylation data enables the calculation of cell proportion within tissues.The significance of deconvolution techniques using DNA methylation has been increasingly evident in both clinical and research settings.For example, deconvolution was used in the analysis of tumor-infiltrating immune cells ( 6 ).Additionally, deconvolution of circulating cell-free DNA (cfDNA), which cannot be assessed by RNA-seq, aids in cancer monitoring and in the elucidation of the pathology of diseases such as COVID-19 ( 7 ), myocardial infarction ( 8 ), and neuromyelitis optica ( 9 ).Furthermore, in predicting the efficacy of immune checkpoint inhibitor treatments, profiling tumor-infiltrating immune cells has been shown to be more effective than focusing on the expression of single molecules like PD-L1 ( 10 ).These observations suggest that deconvolution is an attractive approach for analyzing clinical samples compared to flow cytometry and multiplex immunohistochemistry, which are not practical for personalized medicine in a hospital.In addition, DNA would be more suitable for clinical tests compared to RNA due to its stability and ease of handling.Notably, using nanopore sequencers, it is now feasible to directly sequence DNA without bisulfite conver-sion, enabling the profiling of DNA methylation within 3 h in clinics ( 11 ).
Despite the technological advancements, obstacles still remain with cell classification based on DNA methylation, particularly in accurately estimating tumor-infiltrating immune cells.Specifically, existing studies lacked a comprehensive definition for certain subsets of immune cells, including finer distinctions like activated or naïve states within T cells ( 3 , 6 , 12 , 13 ).Moreover, while many existing techniques have been carefully benchmarked, they mainly focused on evaluating peripheral blood mononuclear cell (PBMC) deconvolution in an epigenome-wide association study (EWAS) with assumed cellular frequency corrections (14)(15)(16).For evaluating tumors and cfDNA, it would be crucial to use a large reference including a variety of tissues and immune cells showing different cell states.Methodologically, traditional deconvolution methods often define a signature matrix for each cell type and use linear models or models with limited flexibility, such as SVR ( 3 , 6 , 13 , 17 , 18 ).However, these approaches might not be suitable for inferring multiple cell types due to the low adaptability to large datasets with high variance.
To address these issues, we developed a DNA methylationbased cell deconvolution method using neural-net architecture and named it MEnet (Methylation-based Estimation of cellular NETworks).MEnet is specifically designed to adapt flexibly to complex methylation profiles of large datasets using neural-net.Additionally, we constructed a reference consisting of a variety of cell types and immune cells with different states to overcome the limitations of specified references by integrating data from external datasets such as ENCODE ( 19 ), Roadmap ( 20 ), Blueprint ( 21 ), and the human methylome atlas ( 3 ) and original data.We trained our model using these references to adopt a wide range of cell diversity.
In this study, we also explored the potential application of MEnet to actual diseases by analyzing tumor tissues of intrahepatic cholangiocarcinoma (ICC).ICC is a malignant tumor originating from the bile ducts within the liver.It has a diverse background, with factors ranging from parasites, autoimmune diseases and organic solvents to lifestyle habits, and possesses a tumor microenvironment composed of various cell subtypes ( 22 ,23 ).ICC often has a poor prognosis due to its aggressive nature and resistance to treatment, necessitating the development of new therapeutic strategies and biomarkers ( 22 ).Understanding how the tumor microenvironment relates to the biology and the clinical characteristics of ICC may provide new insights into this intractable cancer.Through the analysis of tumor tissues from 72 ICC patients, we demonstrate that the prognosis of ICC patients, such as life span and recurrence time, can be estimated by the tumor microenvironment inferred by MEnet.

Ethics
Surgically removed tumor tissues and formalin-fixed paraffinembedded (FFPE) samples from patients with kidney, lung, colorectal, ovarian, bladder, gastric cancer, or melanoma were collected.Peripheral blood was obtained from ovarian cancer patients or healthy donors.The study using human samples was reviewed and approved by the Research Ethics Committee of Osaka University (708-10) and Kyoto University (G1023-4) and carried out in accordance with the committee's guidelines and regulations.Written informed consent was obtained from all donors.

DNA extraction from FFPE sections
DNA was extracted from four 10 μm sections using the QI-Aamp DNA FFPE Tissue Kit.During this process, Deparaffinization Solution (QIAGEN) was used for deparaffinization.The obtained DNA was incubated overnight at 65 • C to perform decrosslinking.dard phenol-chloroform extraction method and precipitated with ethanol.
Sorted cells were dissolved in 400 μl of Lysis Buffer (homemade), and Proteinase K (Thermo Fisher Scientific) was added, followed by overnight dissolution at 55 • C. DNA extraction was performed using the phenol-chloroform method, followed by the collection of the aqueous layer from chloroform, and then purified using the Geno Plus Genomic DNA Extraction MiniPrep Kit (VIOGENE).

Tumor tissue
For skin cancer samples, DNA was extracted from fresh tumor slices using the QIAamp DNA Mini Kit (QIAGEN).For ICC samples, frozen specimens preserved in the Department of Surgery of Kyoto University as tumor / non-tumor were used as samples.Genomic DNA was isolated from each frozen tissue using PureLink ® Genomic DNA Kits (Invitrogen).QC (Quality Control) and Whole Genome Bisulfite Sequencing (WGBS) were conducted as described above.cfDNA Serum samples were purified by gradient density centrifugation using Lymphoprep (Axis Shield, Dundee, UK).Serum samples were centrifuged at 20 000 × g for 5 min using a swing rotor to pellet any contaminating cellular debris.The clarified supernatant (1 ml) was then subjected to cfDNA extraction using the QIAamp Circulating Nucleic Acid Kit (QI-AGEN) according to the manufacturer's instructions.

Library preparation and sequence
DNA was fragmented using the S220 Focused-ultrasonicator and ME220 Focused-ultrasonicator (Covaris).The fragmented DNA was then used to prepare libraries with the NEB-Next Enzymatic Methyl-seq Kit (E7120L).The completed libraries were sequenced on the DNBSEQ-G400 (MGI) platform using 100bp paired-end sequencing.
For tumor-infiltrating lymphocytes, DNA processing was outsourced to Takara Bio Inc., where DNA was fragmented using S220 Focused-ultrasonicator and ME220 Focusedultrasonicator (Covaris), followed by library preparation with the NEBNext Enzymatic Methyl-seq kit, and sequencing was performed using the NovaSeq 6000 (Illumina).Some samples of liver tissue and peripheral blood were sequenced using Nanopore, following the protocol described in the following section.

Nanopore sequencing
DNA sequencing was performed using a MinION sequencer (Oxford Nanopore Technologies).The library was prepared using the Rapid Sequencing Kit (SQK-RAD004) following the manufacturer's protocol.R9.4.1 (FLO-MIN106D) flow cells were used for sequencing, which was carried out for 72 h.The raw sequencing data were processed using MinKNOW (version 23.04.3).This procedure was in accordance with the official guidelines provided by Oxford Nanopore Technologies.
For the deconvolution input, to ensure consistency across platforms and to achieve robustness against noise, we performed genome-wide tiling using Bedtools map at 1000 bp intervals.We then calculated the average methylation rate and the total number of methylated and unmethylated CpG reads for downstream analysis.For methylation arrays, we calculated the average methylation rate of the loci contained in each bin and subsequently computed hypothetical methylated and unmethylated read counts to achieve a total read count of 5.

Model comparison
From the projects of Roadmap, ENCODE, AMED-CREST and BLUEPRINT, we collected data from 151 samples spanning 20 categories, ensuring at least five samples per category.We performed genome-wide tiling at 1000 bp intervals and used the average methylation rate of each bin for model comparison.We extracted variably methylated bins for input using the following strategy: For each bin, we conducted a chi-squared test on the total methylated versus unmethylated read counts for each minor group category against all other categories.Referring to the previous study ( 25 ), we sought to identify variable regions.Specifically, to weigh the reliability based on read numbers, we conducted chi-squared tests for each bin.Bins were selected as characteristic regions if they had a false discovery rate within the top 500 and a methylation rate difference of > 60% between the category and all other categories, resulting in 25 486 feature regions.Using the average methylation rate of the extracted regions, we conducted training and testing at the Major group level.For training, we used only samples that included a minimum of three samples, preparing a total of 138 samples.For testing, we used data from cohorts not used in training.Categories with more than 15 samples were limited to 15 samples, resulting in a total of 96 test samples.Missing values were imputed using the median methylation rate of the same bin from the training data.We created simulation data for mixed data by randomly selecting 5 samples from the test samples and summing them, producing 200 simulated samples.

Feature region extractions for MEnet inputs
To ensure the inclusion of regions vital to both Major and Minor Groups, we employed a two-step extraction process.Initially, we calculated the read count for each bin and selected those bins that fell within the 0.5-99.5 percentile range.In the first step, to identify characteristic regions at the Major Group level, we aggregated the methylated and unmethylated read counts for each Major Group category and conducted a chi-squared test between that category and all other categories.In the second step, for each Major Group containing multiple Minor Groups, we conducted tests to pinpoint characteristic regions for each Minor Group.Specifically, we aggregated the methylated and unmethylated read counts for each Minor Group and performed a chi-squared test between that Minor Group and other Minor Groups within the same Major Group.Lastly, bins that exhibited a methylation rate difference of > 40% between the category and all others, and had P -values within the smallest 2000 were selected as feature regions.
For the visualization of the extracted feature regions, we conducted median imputation and then visualized the data using a heatmap.For UMAP-based visualization, after scaling the imputed matrix, we applied PCA and used PC1-PC200 to compute UMAP using umap-learn (v0.5.2).The enrichment analysis of the feature regions was conducted using the Ge-nomicDistributions package in R.

Architecture of MEnet
The main architecture of MEnet is an MLP model built with a Python package PyTorch (v1.6.0).Cross-validation (CV) was conducted using StratifiedShuffleSplit in scikit-learn v0.24.1, where we performed stratified splits for each of the 6-fold based on labels.For handling missing values, we employed median imputation.For hyperparameter tuning, we utilized the CmaEsSampler from Optuna (v2.6.0),selecting from the following search space based on the validation loss calculated by OneHotCrossEntropy: number of layers: [2,10], hidden layers: [30,3000], dropout rate: [0,0.3],activation functions: {relu, tanh, leakyrelu}, optimizers: {Adam, RMSprop, SGD}, and learning rate: [1e-5, 1e-1].We set the maximum number of epochs to 200 000 and implemented EarlyStopping (patience = 1000).For training data augmentation, we randomly selected up to 15 samples, applied random weights, and introduced missing values with a dropout rate of 0.2.These values were then imputed based on the median methylation rate of the training data.The training was conducted using MEnet train on an Intel Xeon Platinum 8180 @ 2.500GHz and NVIDIA Tesla V100 PCIe 32GB.We implemented distributed learning using MySQL.

Subsampling
Sequence data ( Supplementary Table S1 , S2 ) were downloaded from fastq, while fastq files consisting of multiple runs were merged.For subsampling reads, seqkit sample (v2.1.0)was used to extract each proportion of reads.In this case, we used five different seeded reads.For the calculation of methylation rate, trim_galore (RRBS with the -rrbs option), bismark, deduplicate_bismark (WGBS only), and bis-mark_methylation_extractor were used with default parameters, and the reference sequence Gencode v34.For the simulation of mixed samples, we integrated the output files of bismark created by the subsampling step randomly and used the weighted average of sum_len calculated by seqkit stats as the target value.MEnet, ARIC, CIBER SOR T, RPC and CP / QP was used for the prediction on the generated simulation set, and scikit-learn was used to calculate the prediction accuracy.

Calculation of tumor purity and immune cell scores using MEnet
We applied MEnet to the Illumina 450k DNA methylation datasets from various cancer types in The Cancer Genome Atlas (TCGA).For each cancer type, we used the corresponding tissue-specific DNA methylation data to estimate the celltype fractions in the bulk tumor samples.To calculate the tumor purity score, we summed the estimated fractions of all cell types except for blood cells, fibroblasts, adipocytes, endothelial cells, cardiovascular cells, and muscle cells, and then subtracted this sum from 1.This approach assumes that the aforementioned cell types primarily constitute the tumor stroma, while the remaining cell types represent the tumor cells.For the immune cell score, we simply summed the es-timated fractions of all immune cell types present in the reference matrix.We then benchmarked the MEnet-derived tumor purity scores against various established methods, including the gene expression-based ESTIMATE algorithm ( 28 ), the copy number variation-based ABSOLUTE method ( 29 ), immunohistochemistry (IHC), and a consensus approach combining all three methods (consensus purity estimate, CPE) ( 30 ).Similarly, we compared the MEnet-derived immune cell scores with the gene expression-based immune score from the DNA methylation-based leukocyte unmethylation for purity (LUMP) score ( 30 ).

Human tumor dissociation
Tumor samples were minced and enzymatically dissociated with the Tumor Dissociation Kit, human (Miltenyi Biotec), according to the manufacturer's instructions protocol.Cell suspensions were filtered through a 70 μm cell strainer and isolated by Percoll (Sigma-Aldrich) gradient centrifugation.
Flow cytometry analysis was conducted on instruments including the LSR Fortessa (BD Biosciences), NovoCyte 3000 Flow Cytometer, and NovoCyte Quanteon Flow Cytometer (both from Agilent, Santa Clara, CA).Data acquisition and subsequent interpretations were performed utilizing the FlowJo software (BD Biosciences).

Cell quantifications based on histopathological images
The specimens were scanned using a whole-slide imaging scanner (Hamamatsu NanoZoomer 2.0HT; Hamamatsu Photonics K.K., Hamamatsu, Japan) with a 20 × objective lens.Hematoxylin and eosin (HE) stained images captured at a 40 × magnification were subsequently resized to achieve a resolution of 0.25 μm per pixel.Each image was then segmented into 225 equal sections, and cell segmentation was carried out using a self-organizing map (SOM) algorithm.For this purpose, the Seg-SOM algorithm was employed in this study.Lymphocyte counts were then quantified in the segmented images.

Analysis of ICC samples
We investigated 72 patients with ICC who underwent surgery at the Kyoto University Hospital between 2004 and 2016.The characteristics of the participants are summarized in Supplementary Table S3 .We performed cell deconvolution using MEnet.To profile the ICC samples, we decomposed the cell frequency matrix, which was adapted with min-max scaling using NMF.For the computation of NMF, we utilized the Python package scikit-learn (0.24.1).For survival analysis and the calculation and visualization of the concordance index, we utilized the Python package lifelines (0.27.0).For the visualization of Spearman correlation, the R package corrplot was used.

The MEnet algorithm
Firstly, we attempted to select an optimal method for cell deconvolution using DNA methylation.We trained and tested models using DNA methylation datasets from the databases (ENCODE, Roadmap, Blueprint and the human methylome atlas) (Figure 1 A).Differentially methylated regions were selected for each sample and used for the training.In addition to the non-negative least squares, which are widely used in deconvolution using DNA methylation, we prepared models based on common machine learning techniques such as random forest, support vector regression, gradient boosting and neural network (NN) models.NN offers the advantage of making complex non-linear predictions and the flexibility of adding data augmentation.While conventional referencebased deconvolution uses only purified cells as references, real-world datasets often consist of a mixture of multiple cell types, which may not be suitable for the training.To adapt this, the augmentation technique of randomly mixing the training dataset, mixup, is commonly used in deep learning ( 32 ) and is known to outperform existing methods in RNAseq deconvolution ( 33 ).Therefore, we also prepared a model that added mixup to the NN.The NN model recorded the highest accuracy for both purified cells and mixed data (Mean Squared Error 0.0265 in NN + mixup, 0.0263 in NN, 0.0320-0.0557 in other methods for purified cells, and 0.00730 in NN + mixup, 0.0161 in NN, 0.0842-0.0182 in other methods for mixed cells in average), and the NN model was also superior in the computation time for the test dataset ( Supplementary Figure S1 A, B).While the prediction accuracy for purified cells was highest for the NN without mixup, the NN with mixup showed significantly improved accuracy for mixed cells compared to those without mixup.This suggests that adding mixup to NN can provide more accurate estimation via suppressing overfitting, a pronounced problem of the NN model.On the other hand, the training computation time was the longest for the NN + mixup model.When comparing the architecture of NN + mixup with existing deconvolution methods, NN + mixup showed better accuracy than ARIC ( 17 ), CP / QP ( 27 ), CIBER SOR T ( 18 ), RPC ( 16 ) for both purified and mixed cells (Figure 1 B).For instance, in identifying cell populations mixed with B cells, CD4 + T cells, CD8 + T cells, and NK cells at similar proportions, MEnet was able to compute the proportions with less error than other methods (Figure 1 C, Supplementary Figure S2 ).In terms of computation time, while NN + mixup required longer training time, the prediction time was over 781 times faster than that of other existing methods ( Supplementary Figure S1 C).This demonstrated that the NN + mixup possesses high performance in estimating cell population frequencies from complex samples.

Development of MEnet
Based on the optimal method for cell deconvolution determined above, we sought to develop a new deconvolution framework for analyzing tumor microenvironment in cancer patients.NN models have many hyperparameters, such as the number of layers and the number of nodes.Applying crossvalidation and Bayesian optimization to determine the parameters, we developed a cell deconvolution tool based on the DNA methylation by employing NN + mixup, and named it MEnet (Figure 2 A).

Expansion of training data
To achieve effective deconvolution, we expanded the amount and the diversity of the training data by adding original data (samples derived from PBMCs and surgically removed tumors) and curated data ( 4 , 12 , 34-64 ) to the dataset consisting of ENCODE, Roadmap, Blueprint, and the human methylome atlas ( Supplementary Tables S1 and S2 ).Importantly, from the surgically removed tumors, we isolated tumor-infiltrating immune cells with different states, such as exhausted CD8 + T, activated CD8 + T, naive Treg and effector Treg cells and added their whole genome bisulfite sequencing (WGBS) data to the dataset to improve performance in profiling the tumor immune microenvironment.The total number of genome-wide DNA methylation data collected reached 945 (Figure 2 B).These samples are categorized into two layers: Major group and Minor group, with 29 and 39 categories, respectively.

Training and optimization of MEnet
Using this reference dataset, we trained MEnet for the Minor group through 6-fold cross-validation with Bayesian Optimization ( Supplementary Figure S4 A-C).After > 3000 trials, including the epochs where early stopping occurred, we determined an optimal architecture with two hidden layers and a hidden dimension of 1030, along with the model weights.MEnet attained an improved computational speed by GPU in addition to parallel computation through distributed learning, and could efficiently train on large datasets using cloud computing.

Evaluation of sequencing depth requirements
Next, we evaluated the sequencing depth required for MEnet to predict cell proportions.We assessed accuracy using eight  distinct datasets of purified cell data from WGBS or RRBS that were not used for training, as well as mixed data from these datasets (Figure 3 A-D).For the purified data, the root means square error (RMSE) of RRBS saturated at approximately 2 × 10 8 bp, while the RMSE of WGBS did at 2 × 10 9 bp.For the mixed data, the RMSE of RRBS saturated at about 1 × 10 8 bp, and the RMSE of WGBS at 7 × 10 9 bp.We found that RRBS required about ten times less data than WGBS.This is likely because RRBS targets a limited number of methylation sites.In addition, satisfactory accuracy could be achieved with less than 3x genome coverage, even for WGBS data.In this simulation, predictions using RRBS data had inferior accuracy compared to those using WGBS data, especially for the mixed dataset, at both minor and major group resolutions.In addition, MEnet achieved lower RMSE values compared to these other methods, including, ARIC, CIBER SOR T, RPC and CP / QP ( Supplementary Figure S5 A-D).

Benchmarking MEnet
To validate the accuracy of MEnet, we first applied our method to The Cancer Genome Atlas (TCGA) data, where pathological estimates of tumor and immune fractions are available.We compared the tumor and immune cell scores calculated by MEnet with those obtained from different estimation methods, such as ESTIMATE, ABSOLUTE and LUMP, as previously described ( 65 ).The results showed significant positive correlations between MEnet's predictions and the reference values for both tumor and immune cell scores across multiple cancer types (Figure 4 A, B).These findings demonstrate the robustness and reliability of MEnet in estimating tumor purity and immune cell infiltration in real-world cancer datasets.
We further evaluated the performance of MEnet using various real-world datasets (Figure 4 C).A primary concern was the predictive accuracy of MEnet, especially to those related to cellular characteristics such as activation states.Initially, we used 12 types of artificial mixtures consisting of purified cells (over 90% purity by cell sorter) from human PBMCs ( 12 ), to compare MEnet-predicted values with actual values.We found a significant positive correlation between the two values regarding not only broad cell subsets but also finer classifications, including cellular states (Figure 4 D, Supplementary Figure S6 A).Next, using tumor samples (ovary, stomach, skin and colon cancers), we further analyzed the specific states of immune cells, especially the frequencies of activated, naive, or exhausted T cells.By comparing the cell frequency by flow cytometry with those by MEnet calculated from WGBS data, we observed some differences in correlation patterns across different cell subsets.In the lymphoid lineage, a strong correlation was evident, even when considering detailed cell states such as activated CD8 + T cells (Figure 4 E).However, the myeloid lineage showed a weak correlation, potentially due to the decrease in cell numbers caused by cell damage during the cell dissociation process ( Supplementary Figure S6 B).Despite these variations, the overall alignment was consistent when looking at specific classifications, indicating the validity of MEnet-mediated cell subset prediction even in cancer tissues.
To determine the feasibility of using sections after pathological diagnosis, we examined whether DNA methylation data from formalin-fixed paraffin-embedded (FFPE) sections is suitable for MEnet.Using skin cancer samples and performing WGBS on both FFPE and fresh tissues, we observed a strong correlation in cell frequencies predicted by MEnet between them (Figure 4 F).Additionally, to determine the  consistency between MEnet and the standard histopathological method for FFPE sections, we compared the lymphocyte frequencies estimated by MEnet with those determined from HE-stained section images.For the HE-stained sections, we employed Seg-SOM ( 66 ), a computational vision map based on an artificial neural network, to identify and classify cell types.Seg-SOM was trained on a comprehensive dataset of expertly annotated cell types and has been validated for accurate cell type identification in various tissues.The comparison between MEnet and Seg-SOM-based cell type identification in HE-stained sections confirmed the consistency between the two methods (Figure 4 G).
Lastly, we explored the possibility of predicting cell frequency using DNA methylation data from direct DNA sequencing.We sequenced DNA from activated Treg, activated conventional CD4 + T (Tconv), and naive Tconv cells by nanopore sequencers and retrieved DNA methylation data from the sequencing signals.The predicted cell frequencies by MEnet were consistently matched to the original cell types ( Supplementary Figure S6 C).Overall, these results demonstrated the reliability of MEnet and its potential in various clinical applications.

Exploring the heterogeneity of cfDNA
We next evaluated the potential clinical application of MEnet for cfDNA derived from peripheral blood of cancer patients.We performed WGBS on cfDNA from 13 ovarian cancer patients, applied MEnet, and obtained their deconvoluted cell profiles.The cell population corresponding to the origin of cancer (ovary) by MEnet was compared with the tumor fraction inferred from genomic copy number alternation (CNA) ( 31 ).A significant correlation was observed between the tumor fraction percentages determined by CNA and those by MEnet (Figure 5 ).These results suggest the potential of MEnet for a wide range of clinical applications, e.g. for assessing the progression of cancer, selecting individualized treatment methods, and monitoring therapeutic efficacy.

The heterogeneity in intrahepatic cholangiocarcinoma
We next evaluated the scalability and reliability of MEnet in clinical applications using surgically removed samples of intrahepatic cholangiocarcinoma (ICC) (Figure 6 A).First, we compared an ICC tumor tissue with its adjacent normal liver sample in the same patient.By analyzing their DNA methylation data obtained by nanopore direct DNA sequencing, normal liver sites were primarily predicted to consist of liver tissue (Figure 6 B).In contrast, in ICC tumor sites, the proportion of liver was decreased, while the proportions of duct, pancreas, and intestine were increased (Figure 6 B).Since ICC is a cancer originating from the bile duct and not hepatocytes ( 67 ), the increase of these non-liver endoderm tissues is reasonable, and the rise in other digestive tissues might reflect transformations those cancers are undergoing.Interestingly, abundant immune cells were only observed in cancer sites, suggesting the feasibility of MEnet in profiling the tumor microenvironment.
Subsequently, to investigate the correlation between the diversity of ICC and its tumor microenvironment, we collected tumor samples from 72 ICC patients and their medical records spanning a decade.We determined the DNA methylation states of the tumors using WGBS and explored the association between the deconvoluted cellular profiles and clinical outcomes, such as survival and recurrence.Initially, the tumor proportion estimated by CNA and the total proportion of non-tumor cells by MEnet (i.e. the sum of fibroblast, liver, muscle, etc.) showed a significant negative correlation (Figure 6 C, r = −0.61,P = 1.5 × 10 −6 ).We then examined the correlation between clinical information and the predicted cell frequencies ( Supplementary Figure S7 A).The cell frequencies by deconvolution showed only limited correlations with cancer stage, tumor diameter, blood markers for liver function, and tumor markers, such as carcinoembryonic antigen (CEA) and CA19-9.It suggests that the cellular profiles within the tumor are independent of these indicators.
Next, to investigate the correlation between predicted cell profiles and clinical progressions, we decomposed the profiles of the 72 ICC cases and 39 cell types into several patterns (Figure 6 D).Using a non-negative feature matrix (NMF), we decomposed the cell frequency profiles for each patient into ten components (Figure 6 D).This approach not only simplifies the analysis by reducing the number of variables but also effectively overcomes the challenges of multicollinearity, a common issue when multiple cell types exhibit correlated behaviors.Examining the association between these ten components and prognosis, we found that higher NMF4 or NMF7 indicated better outcomes, while higher NMF6 indicated worse outcomes (Figure 6 E).NMF4 had high weights for activated Tconv, exhausted cytotoxic T lymphocyte (CTL), and memory Treg cells, suggesting that abundant immune cells are infiltrating into tumors with high immunogenicity (Figure 6 F).These immunological features may be related to better overall survival.NMF6 was dominated by cardiovascular, hinting at an inverse correlation between vascular invasion and prognosis (Figure 6 F).While the association of NMF with recurrence was less pronounced than with overall survival, it was evident that higher NMF4 led to a lower likelihood of recurrence.Furthermore, when we examined the potential for prognosis prediction based on intratumoral cell profiles using bootstrap simulation, we could predict the overall survival based on the NMF calculated from the cell population.The prediction accuracy surpassed predictions based on the currently used cancer stage classification (Figure 6 G, P = 5.75 × 10 −6 ).The NMF-based prediction was also able to predict the time to recurrence, yet the stage classification was better ( P = 0.00401, Supplementary Figure S8 A, B).Altogether, these results demonstrate the feasibility of predicting cellular profiles by MEnet and its application in clinical settings.

Discussion
In this study, we developed a deconvolution tool based on DNA methylation information, termed MEnet, and applied it to cfDNA from blood and ICC tumor tissues, verifying its clinical significance.Compared to the previous methods, neural networks enhanced robustness through abundant parameters and data augmentation in combination with large datasets, achieving high accuracy of the tool.The utility of MEnet was evaluated not only using commonly utilized peripheral blood samples but also with samples derived from tissues, cfDNAs, FFPE sections and tumors, demonstrating its robustness across diverse scenarios.MEnet showed prominent potential in clinical applications, potentially leading to personalized treatments.In addition, to the best of our knowledge, our model is the only one constructed across various platforms, including nanopore DNA sequencing, WGBS, RRBS and methylation arrays, suggesting its potential for flexible clinical applications.
The complexity of the parameter settings of neural networks for cell deconvolution would be a major obstacle to using them.MEnet addressed this issue by Bayesian optimization for architecture optimization.In addition, with respect to the enormous calculation time and computational resources needed for the training, we have resolved these issues by GPU computing and distributed computing.Meanwhile, with the emergence of data from single-cell WGBS (68)(69)(70)(71) or methylation data inferred from single-cell RNAseq ( 65 ), this model is also designed to be flexible enough to accommodate new data.This flexibility allows for the creation of customized references for specific tissues and cancer types, enhancing its applicability across various research and clinical settings.Considering the difficulty for individual users to prepare and calculate a large volume of datasets, we expect most users to operate with the pre-trained datasets we established.
One potential application of MEnet is the analysis of cfDNA.By applying MEnet to cfDNA, real-time assessment of treatment response and disease progression may be achieved through the detection of tumor-derived cfDNA in the blood ( 72 ).Additionally, with the observed changes in peripheral immune cells associated with immune-related adverse events during immune checkpoint inhibitor treatment ( 73 ,74 ), the ability of MEnet to simultaneously detect immune cells in the blood may be useful for real-time detection of immune-related adverse events.The application of MEnet offers promise for more accurate treatment strategies and enhances the efficacy and safety of cancer immunotherapies.
In this study, we also demonstrated the efficacy of cell profiling in a large dataset of ICC.Historically, ICC has been cat-egorized into two subclasses: inflammation and proliferation subclasses, with the inflammation subclass being shown to have a better prognosis ( 75 ).The high correlation of a favorable outcome of ICC with the extracted feature NMF4, representing expanded tumor-infiltrating lymphocytes, was consistent with this classification.Beyond this, the identification of several tumor microenvironment programs brought us closer to a comprehensive understanding of ICC.Given the absence of tumor markers specific to ICC, specific cell monitoring using DNA methylation might also hold promise.
However, it is important to note that the absolute values calculated by MEnet may deviate for certain subsets, such as naïve Tconv and memory Tconv.Therefore, caution should be exercised when considering the direct application of MEnet for clinical purposes.Despite these discrepancies, the relative abundances within each cell type are well-correlated, suggesting that MEnet remains suitable for comparative analyses within a given cell type.
In summary, cellular deconvolution through DNA methylation using MEnet has the potential to emerge as a new analytical strategy for clinical samples.

Figure 1 .
Figure 1.Model optimization for DNA methylation-based cell deconvolution.( A ) Schematic representation of the deconvolution method utilizing DNA methylation.Cell proportions of the test samples were predicted by a deconvolution algorithm with the DNA methylation signature matrix of cells (Created using Biorender.com).( B ) The panels represent the mean squared error (MSE) and calculation time of the prediction by each method for purified cells or an artificially mixed dataset.Refer to Supplementary Figure S1 for additional details. ( C ) Violin plots represent lymphocyte counts, comparing ground truth with predictions made by neural network (NN) + mixup and other existing tools.(Refer to Supplementary Figure S2 for additional details.)

Figure 2 .
Figure 2. MEnet architecture and cell feature regions for its prediction.( A ) Schematic representation of the training dataset for MEnet (Created using Biorender.com).( B ) Bar plots show the composition of reference samples: methods for DNA methylation analysis (left), major category (middle), and minor category (right).( C ) Heatmap showing mean DNA methylation rate of the feature regions for all reference samples.X-, Y-axes and color indicate samples, regions, and the DNA methylation rate, respectively.The top row indicates the methods for DNA methylation analysis.( D ) UMAP visualization of the reference samples based on DNA methylation rates in the feature regions.The plot displays the key cell populations identified in the dataset, with cell labels based on their cell surface markers or origins.( E ) Bar plot shows the enrichment of the feature regions in the genome.

Figure 3 .
Figure 3. Subsampling analysis for determining the required read depth for the prediction.Root mean squared error (RMSE) of the prediction for subsampled reads from whole genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS) samples not used in model training.( A ) and ( B ) represent the results of subsamples using purified cell data, while ( C ) and ( D ) depict the results of simulated mixture data.(A) and (C) show the results at the Major Group le v el, and (B) and (D) at the Minor Group le v el.Various proportions of sequence reads are sampled with five independent seeds and e v aluated f or accuracy.T he dotted line across the panels represents the a v erage RMSE f or 500 random predictions.Mark ers represent independent samples (A and B).

Figure 4 .
Figure 4. Evaluation of the performance of MEnet.( A ) Systematic validation of the MEnet-derived tumor purity scores in the corresponding Illumina 450kDNA methylation datasets from TCGA (ACC: adrenocortical carcinoma, BLCA: bladder urothelial carcinoma, BRCA: breast invasive carcinoma, CESC: cervical squamous cell carcinoma and endocervical adenocarcinoma, GBM: glioblastoma multiforme, HNSC: head and neck squamous cell carcinoma, KICH: kidney chromophobe, KIRC: kidney renal clear cell carcinoma, KIRP: kidney renal papillary cell carcinoma, LGG: brain lower grade glioma, LIHC: liver hepatocellular carcinoma, LUAD: lung adenocarcinoma, LUSC: lung squamous cell carcinoma, PRAD: prostate adenocarcinoma, READ: rectum adenocarcinoma, SKCM: skin cutaneous melanoma, THCA: thyroid carcinoma).The heatmap depicts the Pearson correlation coefficients (PCC) between the tumor purity estimated using MEnet and the tumor purity estimated with different methods, including ESTIMATE, ABSOLUTE (CNV-based), immunohistochemistry (IHC), and consensus purity estimate (CPE).Asterisks indicate statistical significance le v el ( P value thresholds) as shown in the figure.P v alues w ere deriv ed from a one-tailed correlation test.( B ) T he heatmap depicts the PCC betw een the total immune cell fraction estimated using MEnet and the corresponding total immune cell fraction obtained by other methods (LUMP: the DNA methylation-based leukocyte unmethylation for purity).Asterisks indicate statistical significance le v el ( P value thresholds) as shown in panel B. P values were derived from a one-tailed correlation test.( C ) Schematic view for evaluating the performance of MEnet (Created using Biorender.com).( D ) Scatter plot comparing the proportion of mixed immune cells (y-axis) and the proportion of immune cells predicted by MEnet (x-axis).The prediction was performed against artificially mixed cell DNA of sorted purified peripheral blood mononuclear cells (PBMCs) (GSE167998) (Created using Biorender.com).( E ) Scatter plots comparing immune cell proportion in tumor tissues measured by flow cytometry (y-axis) and those predicted by MEnet (x-axis).n = 11 (Created using Biorender.com).( F ) Scatter plot comparing the proportions of various cell subsets estimated from formalin-fixed paraffin-embedded (FFPE) sample-derived DNA (y-axis) versus those estimated from fresh sample-derived DNA (x-axis).DNA was extracted and analyzed from both FFPE and fresh preparations of the same skin cancer samples (six donors).Each point represents a cell subset from an individual sample.Colors indicate the individuals (Red, Blue, Orange, Purple and Green are melanoma.Brown is squamous cell carcinoma).( G ) Comparison between results by MEnet and cell-type inference from Hematoxylin and eosin (HE) sections.HE staining of melanoma cancer tissue section (left), and the cell classes inferred by Seg-SOM pipeline (center).As demonstrated in the Seg-SOM study, the three epithelial cell types are distinguished based on their nuclear morphology.The right panel shows a scatter plot comparing the proportion of lymphocyte class nuclei estimated by the Seg-SOM pipeline (y-axis) and the proportion of lymphocytes estimated by MEnet (x-axis).n = 8.

Figure 5 .
Figure 5. Cell-free DNA analysis for detecting tumors by MEnet.Scatter plot showing a correlation between ovary ratio predicted by MEnet (x-axis) and tumor ratio inferred by copy number alternation (CNA) (y-axis) in cell-free DNA from o v arian cancer patients ( n = 13).Cell-free DNAs were isolated from the peripheral blood of ovarian cancer patients and subjected to DNA methylation analysis.

Figure 6 .
Figure 6.Cell profiling of 72 ICC samples predicted by MEnet.( A ) An overview of the intrahepatic cholangiocarcinoma (ICC) sample analysis.( B ) Bar plots showing the deconvolution results for the non-tumor sites (upper) and tumor sites (lo w er) in same ICC patients ( n = 3) (Created using Biorender.com).( C ) Scatter plot showing the tumor proportion predicted by CNA (y-axis) and the proportion of non-tumor cell fractions estimated by MEnet (x-axis).Non-tumor cells were calculated by the total values of blood cells, liver, fibroblast, adipocyte, endothelial cells, cardiovascular, and muscle.R = -0.61,P = 1.5 × 10 −6 .( D ) Heatmap showing the cell frequency profile and clinical prognosis of 72 patients.It displays the cell frequency normaliz ed b y cell type (center), cell features decomposed b y NMF (upper), sample features (right), and prognosis (left).P rognosis in v olv es surviv al (status), recurrence (status recurrence), survival duration, and time to recurrence.( E ) Forest plot of the Cox-Hazard analysis for survival duration.The analysis incorporates NMFs and cancer stages (quantified as integers from 1 to 4) as covariates.( F ) NMF4 or NMF6 is associated with good or poor survival duration, respectively.Kaplan-Meier plots for varying covariates of NMF4 (left) and NMF6 (right) are indicated.( G ) Survival duration can be estimated by the cell proportion predicted by MEnet.Shown are distributions of the Concordance-index (C-index) for survival duration prediction by MEnet (MEnet) or cancer stage classification (Stage), calculated by bootstrap.Bootstrap involved performing Cox hazard regression on 62 randomly selected samples within 72 samples and calculating the C -index on the remaining 10 samples, which are repeated 100 times.The distribution of the C -index calculated from MEnet results and cancer stage classification was tested using the Mann-Whitney U -test.