Cell type deconvolution of methylated cell-free DNA at the resolution of individual reads

Abstract Cell-free DNA (cfDNA) are DNA fragments originating from dying cells that are detectable in bodily fluids, such as the plasma. Accelerated cell death, for example caused by disease, induces an elevated concentration of cfDNA. As a result, determining the cell type origins of cfDNA molecules can provide information about an individual’s health. In this work, we aim to increase the sensitivity of methylation-based cell type deconvolution by adapting an existing method, CelFiE, which uses the methylation beta values of individual CpG sites to estimate cell type proportions. Our new method, CelFEER, instead differentiates cell types by the average methylation values within individual reads. We additionally improved the originally reported performance of CelFiE by using a new approach for finding marker regions that are differentially methylated between cell types. We show that CelFEER estimates cell type proportions with a higher correlation (r = 0.94 ± 0.04) than CelFiE (r = 0.86 ± 0.09) on simulated mixtures of cell types. Moreover, we show that the cell type proportion estimated by CelFEER can differentiate between ALS patients and healthy controls, between pregnant women in their first and third trimester, and between pregnant women with and without gestational diabetes.


INTRODUCTION
As cells die, short DNA fragments are released into the bloodstr eam, which ar e collecti v el y called cell-free DN A (cfDNA).The cfDNA in plasma is mostly composed of DNA molecules originating from blood cells ( 1 ).Howe v er, cells in diseased tissues die more ra pidl y, causing diseased tissues to release cfDNA at a faster rate.The discovery of traces of such disease-deri v ed cell types in cfDNA provides a minimally invasi v e alternati v e for tissue biopsies, and is thus fr equently r eferr ed to as a liquid biopsy ( 2 ).Commonly r esear ched applications of liquid biopsies ar e pr enatal testing, organ transplant monitoring, and tumor discovery and monitoring ( 3 ).In all of these applications, howe v er, we know the cell type of interest in advance.Cell type deconvolution, on the other hand, aims to gi v e the full composition of the cell types of circulating cfDNA.Example use cases in which this type of analysis is especially desirable is finding tumor locations in patients with a cancer of unknown primary ( 4 ) and detecting trea tment side-ef fects.Additionally, characterizing changes in cell type proportions is helpful in understanding disease de v elopment and progression ( 5 ).
One method for characterizing the cell type origins of cfDNA is the analysis of methylation signatures.Methylation occurs when a methyl-group is added to the fifth carbon of cytosines (5mC), often with the purpose of silencing gene transcription ( 6 ).This process ha ppens mostl y in the context of CpG sites, and usually over regions spanning multiple CpG sites ( 2 ).Adjacent CpG sites have been found to correlate highly in methylation status ( 7 ).Because the silencing of gene transcription often happens in a cell type-specific manner, these methyla tion signa tures have been found to re v eal the cell type origins of cfDNA ( 3 ).
T raditionally , cell type deconvolution methods calculate the average methylation of all sequencing reads per CpG site, and use these averages as model input (8)(9)(10).These averages are often referred to as ␤ values.Although these methods usually do take the correlation between nearby CpG sites into account by averaging over the ␤ values in a region, the value at each individual CpG site is still assumed to be independent.
In a similar problem setting, namely tumor fraction estimation, Li et al. devised an approach to better incorporate the correlation between sites ( 11 ).Their method, named CancerDetector, calculates the average methylation per individual sequencing read instead of the average methylation 2 NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 2 per CpG site.They showed that this method outperforms a similar previous method that uses ␤ values ( 9 ), having higher sensitivity and specificity ( 11 ). Figure 1 B illustrates how rare cell types can be more sensiti v ely detected using read averages than using ␤ values ( 11 ).In this figure, the tumor-deri v ed read makes up 25% of the cfDNA, whereas such rare cell types are far less prevalent in biological data.Since it is essential that our method can deconvolve lowly abundant cell types, ␤ values might not be appropriate.
A read-based approach has been adopted in multiple other tumor fraction estimation methods, such as DISMIR ( 12 ) and EpiClass ( 13 ).Even though the effecti v eness of this approach has been shown for tumor fraction estimation, it has not yet been used in the related task of cell type deconvolution.The only study that has performed cell type deconvolution by assigning each individual read to the most probable cell type did not use read averages nor compared to methods using beta values ( 14 ).
We hypothesize that read averages can increase the sensitivity of methylation-based cell type deconvolution methods.In order to evaluate the effects of using read averages without being affected by other model decisions, we decided to adapt an existing method, CelFiE (CELl Free DNA Estimation via expectation-maximization) by Caggiano et al. ( 8 ).CelFiE has the advantages that it is able to estimate missing cell types and that it can estimate cell type proportions of cfDNA with a low read covera ge.Ca ggiano et al. demonstrated possible clinical applications of CelFiE by showing its ability to dif ferentia te between pregnant and non-pregnant women by their proportion of placenta deri v ed cfDNA, as well as between ALS patients and healthy individuals by their proportion of skeletal muscle cfDNA.In their work, Caggiano et al. used whole genome bisulfite sequencing (WGBS) of r efer ence cell type DNA and input cfDNA.Since WGBS data covers the entire genome, it has the benefit that it can be used for cell type-specific biomark er disco very by comparing the methylation in all CpG sites ( 15 ).
Next to the use of read averages, we find that the selection of appropriate cell type markers is of crucial importance for the model performance.Using the entire genome as model input is not onl y computationall y infeasible, but it will also likely have a negati v e impact on performance when CpG sites that are not informati v e of the cell type origin are included.By redefining the cell type informati v e mar kers, we improved CelFiE and were able to achie v e better results than those reported in the original publication.The new set of markers is found using regions of 500 bp instead of single CpG sites, and only includes h ypometh ylated markers.
In this r esear ch, we adapted CelFiE to work at the resolution of single reads by changing the input to the average methylation value of single reads and by changing the underlying distributional assumptions accordingly.The complete workflow of the resulting method, named CelFEER (CELl Free DNA Estimation via Expectationmaximization on a Read resolution), is depicted in Figure 1 A. We compared CelFEER to CelFiE on generated data and on simulated cell type mixtures composed of real WGBS data.We further applied CelFEER on the cfDNA of four ALS patients and four controls, and found that CelFEER detects a significant difference in the propor-tion of skeletal m uscle cfDN A. In addition, we compared CelFEER's decomposition of the cfDNA of eight pregnant women in their first trimester to that of eight pregnant women in their third trimester, and found a significant increase in placenta cfDNA.Comparing the cfDNA of eight healthy pregnant women to the cfDNA of se v en pregnant women with gestational diabetes, we found that the latter group has a significantly higher proportion of placenta.Taken together, our experiments indicate that read averages can indeed more sensiti v ely detect rare cell types.The source code for CelFEER is available at https://github.com/pi-zza/CelFEER

CelFiE ov ervie w
As CelFEER is an adaptation of CelFiE, understanding this original method is essential for understanding CelFEER.CelFiE uses an expecta tion-maximiza tion (EM) algorithm to learn the parameters of a Bayesian model of the cell type proportions of cfDNA mixtures.It does this by learning these proportions sim ultaneousl y with the avera ge methylation percenta ge of each cell type at each CpG site.The methylation percentages correspond to the fraction of reads that are methyla ted a t a specific CpG site, and are initialized by transforming the r efer ence data into fractions.The methylation percentages are estimated because the reference cell type data is assumed to be imperfect and incomplete.Ther efor e, CelFiE aims to learn the true methylation percentages from both the cfDNA input and the reference cell type data.The r efer ence data consists of the methylation counts of T cell types, indexed by t , at M CpG sites, index ed by m .Mor e pr ecisely, it tak es the form of tw o T × M matrices: Y and D Y , where Y tm and D Y tm are the number of methylated and total r eads, r especti v ely, at CpG site m in r efer ence cell type t .The r efer ence data ar e assumed to be drawn from a binomial distribution for each CpG site, where the number of trials equals the r efer ence r ead depth and the probabilities the true methylation percentage in the cell type of origin.
CelFiE learns the cell type proportions of multiple individuals sim ultaneousl y, allowing the method to infer information from the methylation values of other individuals.The cfDNA data from N indi viduals inde xed by n is gi v en in two N × M matrices, X and D X .X nm and D X nm are the number of methylated and total r eads, r especti v ely, at CpG site m for individual n .An example of how these matrices are formatted is gi v en in Figure 1 C.Each x nmc refers to the methylation value of a specific read c in individual n and can thus be either 0 or 1.These methylation values are assumed to be drawn from a Bernoulli distribution governed by the methylation percentage in the cell type of origin.D X nm consists of the sum of all x nmc (across reads) while X nm is the sum of all x nmc (across reads) that are equal to 1.
CelFiE estimates two parameters: ␣ and ␤, where ␣ is the final output of the model.␣ nt is the fraction of cfDNA in person n tha t origina ted from cell type t , and ␤ tm is the true unknown methylation percentage of cell type t at position m .
CelFiE models the input cfDNA as a mixture of different cell types.Whether this input originates from a gi v en (or 4 NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 2 unknown) cell type is modeled using a latent variable z , where z tmc = 1 when x mc originates from cell type t , and 0 otherwise.The objecti v e is thus to describe the joint distribution P ( X , z , Y | ␣, ␤).For the complete ma thema tical description of the model and its underlying assumptions, we refer to the supplementary and the original publication ( 8 ).
The model iterati v ely relates the input to probable cell types in the expectation step, and calculates the parameters that make the input and r efer ence data most likely in the maximization step.More mathematically, in the expectation step the posterior distribution ˜ p of z gi v en the input data x and parameters ␣ and ␤ is calculated.These parameters are then updated by the ␣ and ␤ values that maximize the expectation of the joint likelihood under the calculated posterior.
Note that this formulation allows CelFie to estimate the proportion of an unknown cell type by adding an additional cell type in the r efer ence set whose r efer ence methylation values are all initially set to 0. Based on the input cfDNA data, the model can then estimate the proportion of the added cell type, and use this proportion information to update the corresponding reference methylation values during the maximization step.

Read-based approach
CelFEER uses essentially the same model as CelFiE but with read averages as input.This changes the underlying distributions of the model, while the overall structure of the algorithm remains the same.The algorithm is visualized in Figure 1 D. In CelFEER, the single counts per CpG site ar e r eplaced by fiv e counts per 500 bp region.Each count ˆ x nmi ∈ ˆ X nm for individual n mapping to region m equals the number of reads with a discretized read average i , where i ∈ { 0, 0.25, 0.5, 0.75, 1 } .A read average is calculated by dividing the number of methylated CpG sites by the total number of CpG sites on a r ead, wher e only r eads with thr ee or more CpG sites are used.This heuristic is adopted from previous methods ( 11 , 12 ).The read average is then rounded to the closest value i .E.g. a read c from individual n mapping to region m with one out of three CpG sites methylated (and ther efor e a r ead average of 1 / 3) would be r epr esented as ˆ x nmc = { 0 , 1 , 0 , 0 , 0 } .Hence each read is effecti v ely onehot encoded.Summing all one-hot encoded reads that fall into the same 500 bp r egion r esults in the fiv e counts which are used as input to the model.This process is depicted in Figure 1 C.
The r efer ence data has the same composition as the input data, but instead of a set of counts per individual per site, the r efer ence data contains a set of counts per cell type per site.Since the r efer ence data has a different format in CelFEER compared to CelFiE, the ␤ values take on a different form as well.ˆ β tmi is now the proportion of reads originating from cell type t and mapping to region m that have a read average i .Adapting CelFie's unknown cell type estima tion stra tegy to read averages, the methyla tion values for an unknown cell type are set to { 0, 0, 0, 0, 0 } in the r efer ence da ta tha t is passed to CelFEER.This way, the estima ted methyla tion percentages for an unknown cell type are initialized to { 0.2, 0.2, 0.2, 0.2, 0.2 } .
As in CelFiE, the model aims to describe the joint distribution of the input ˆ X , the r efer ence ˆ Y , and the latent variable z, which are all assumed to be independent.In order to describe the full data likelihood, we first split it into three parts: , describes the likelihood of observing the read av erages gi v en that we know what cell type each read originates from and how the read averages of each cell type are distributed across the 500 bp windows.In this likelihood we look at each read c individually, and not yet at the total counts of all reads in a region.The probability for a read c at region m to have the value ˆ x mc can be described as a categorical distribution where each category corresponds to a possible read average and ˆ β tmi is the probability of originating from cell type t and belonging to category i .This holds for e v ery indi vidual n : The different cell types , individuals , reads and regions are all assumed to be independent.The log-likelihood of the first part can hence be calculated as follows: The second part of the full likelihood describes how likely it is that a read c originates from each cell type t .The probability of observing a specific cell type in the cfDNA is governed by the cell type proportions.This probability can be described using a Bernoulli distribution: which makes the second part of the log-likelihood: The final term is the only term that does not depend on the latent variables z .The r efer ence data is assumed to be m ultinomiall y sampled with probabilities ˆ β tmi and a number of trials equal to the r efer ence r ead depth, which can be obtained by summing over all read average counts: which makes the third part of the full data likelihood equal to: Because of the presence of the latent variables z , there is no closed form solution for maximizing the log-likelihood ( 16 ) .Instead, we maximize the expected value of the loglikelihood under the posterior distribution of these latent variables using the EM algorithm.The posterior distribution of the latent variable z ntmc is calculated by a ppl ying the NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 2 5 Bayes rule as follows: where the distribution of P ( ˆ x nmc | ˆ β) follows from the fact that each read originates from only one cell type t , thus summing over all cell types gives the full data distribution of the reads.
Since the read averages are one-hot encoded, there will be fiv e possib le values for the posterior ˜ p ntmc .Following from this fact, we can remove the read index c and can start looking at the total sum of reads that have either of the five possib le read av erages.For each read c where ˆ x nmci = 1 , ˜ p ntmc will be equal to: For the expectation step in the EM algorithm, we need to define the expectation of the latent variable z over the full data likelihood at iteration j .Let ␣ ( j ) and ␤ ( j ) equal the parameters estimated at iteration j , and p ( j ) : = p ( ␣ ( j ) , ␤ ( j ) ).The expectation, also called the Q function, is deri v ed in the supplementary and is defined as follows: Finally ␣ and ˆ β are updated by maximizing Q j ( α, ˆ β) , resulting in the following update equations.For the full derivation, see the supplementary.
Each run of CelFEER performs the optimization 10 times independently, because EM is not guaranteed to converge to a global optimum.The log-likelihood is compared for each restart and CelFEER returns the output from the restart with the highest log-likelihood.In all simulations, we run CelFEER 50 times to capture the variance of the model output.

Marker selection
The markers define which CpG sites will be used as input to the model.The methylation values of CpG sites at marker locations should be consistently different for different cell types, such that the methylation values at these sites can be used to distinguish between cell types.The markers are found using an adaptation of the method used by Caggiano et al.The complete process of adapting the markers is described in the supplementary.The original method by Caggiano et al. ( 8 ) works as follows: All CpG sites are compared by measuring the distance between the methylation percentage of one cell type to the median methylation percentage of all cell types.The 100 markers with the largest distance are then selected as markers.Consequently, the total amount of markers found equals (maximally) 100 times the number of cell types.The markers have to satisfy three requirements in the original method; the first is that a marker is only allowed to be a marker of one cell type.If the same CpG site is in the top 100 of two or more cell types, that site is not used as a marker.The second requirement is that each cell type should have at least one read at a marker location.The last r equir ement enfor ces that the median read depth of all cell types at a marker position equals at least 15.
This last r equir ement, howe v er, still allows the cell type for which the CpG site is a marker to have a read depth less than 15, as long as the median read depth of all cell types is sufficient.A CpG site could be a marker for a cell type as long as it is covered by at least one read in that cell type.To remove the possibility of getting this type of marker, we introduced an extra check to ensure this cell type has a read depth at least as large as the median read depth threshold.Besides, we included one more requirement to ensure marker uniqueness.Instead of comparing only the top 100 markers of each cell type, we compared the top 150 markers of each cell type.After this comparison, again only the top 100 markers are used.This extra step prevents the situation where a marker is in the top 101 of one cell type and in the top 99 of another, which could lead to the inclusion of less differential markers.
The original method should, in theory, be able to find both hypo-and hypermethylated markers.In practice, it finds almost solely h ypometh ylated markers.Comparing each cell type's methylation percentage to the median methylation percentage can mak e mark ers less distinct, as is shown in Supplementary Figure S1c.Therefore, we adapted Caggiano et al. 's method ( 8 ) to compare the methylation percentage of each cell type to the minimum methylation percentage of all other cell types (illustrated in Supplementary Figure S1d).We found that h ypometh ylated markers are best a t dif ferentia ting between cell types (Supplementary Supplementary Figure S10c-e, Supplementary Figure S11).
Originally, CelFiE uses as input, and as r efer ence data, the methylation values at the marker CpG sites summed with the methylation values of CpG sites in the ±250 bp surrounding the marker sites.We improved CelFiE by first summing the methylation values at CpG sites into 500 bp windows which are subsequently used to find marker regions (Supplementary Figure S10a-c).Otherwise, markers on r egions ar e found using the exact same approach as markers found on single CpG sites.The difference between finding markers on single CpG sites and on regions is shown in Supplementary Figures S1a and b.As there is no Figure 2. Illustration of the method for determining the distance between two cell types using read averages.First, the average of all read averages is determined for each cell type.These are then compared to find the distance between cell types.r equir ement for the amount of CpG sites in a region and only for the minimum read coverage of a region, the amount of CpG sites per marker can differ.Because summing the CpG sites into 500 bp windows substantially increased the read coverage at potential marker regions, we increased the read depth threshold to 150.To find the value for this threshold, we tried a range of increasing values and compar ed the r esulting markers by their distance between cell types.
Finding the markers using the read average data largely follows the same approach.First, each chromosome is split into 500 bp windows into which the r eads ar e mapped.For each cell type, the r ead averages ar e av eraged ov er all reads that map to the same window.The CelFEER markers are found by comparing these averages.This process is illustrated in Figure 2 .For the read avera ges, we a gain optimized the read depth threshold and observed that the best markers were found using a read depth threshold of 20.The large difference with the read depth threshold for the CelFiE input (after summing in 500 bp windows) can be explained by the large difference in the scale of the input of CelFiE and CelFEER.This difference in scale is due to the fact that all CpG sites on a read contribute to a single value in CelFEER, and to multiple values in CelFiE.
Since the approach to summarize read averages into bins is slightly different from the approach used to bin the CpG count data, we bin the CpG count data in the same manner as the read averages when comparing CelFiE and CelFEER.

Simulations with artificial data
In order to validate if CelFEER works under the model assumptions, simulations with artificial data were set up as follows, such that he input and r efer ence data ar e generated according to the distributions assumed by the model.The simulations use the same parameters as originally used by Caggiano et al. in their artificial simulations.In each random restart, ␣ is randomly initialized by drawing from a uniform distribution and normalizing to ensure the values sum to one.␤ i is initialized by taking Y i j Y j .
We also ran simulations where one or two cell types are considered unknown, i.e. they are not included in the reference.In these cases, we created the true cell type proportions and true cell type methylation as before, but masked the un-known cell type by setting its r efer ence methylation values to 0.

Simulations on WGBS data
To further evaluate the method, we sim ulated cfDN A data by mixing WGBS data of different cell types.The cell type data was obtained from ENCODE ( 17 ) and Blueprint ( 18 ), and is composed of T-cell CD4, monocyte , macrophage , memory B cell, neutrophil, adipose , pancreas , small intestine, stomach and tibial nerve data.The sample identifiers of the used data can be found in Supplementary Table S1.The data is a mixture of paired-end and single-end reads, and consists of the same datasets used by Caggiano et al.For each cell type, one sample was used to compose the r efer ence matrix and one to simulate a cfDNA mixture.Both sex chromosomes were removed, to make the reference matrix applicable to both sexes and to ensure that random methylation due to X chromosome inactivation is not seen as r elevant.Furthermor e, all SNPs in dbSNP ( 19 ) were removed.
To ensure that each dataset contained an equal amount of r eads befor e cr eating a mixtur e, the total r ead coverage of each cell type was normalized by dividing by the total amount of reads of all cell types and m ultipl ying with the average amount of reads.Next, the methylation values of each cell type were multiplied with the desired proportion for that cell type.These proportions were always ensured to add up to one by dividing each cell type's proportion by the sum of all cell types' proportions.In the original publication ( 8 ), WGBS mixtur es wer e cr eated in a similar manner, with the difference that we corrected for differences in read depth among the different cell types before downsampling the r eads.The mixtur es of read averages were created similarly.First, all read counts were normalized such that each cell type occurred in equal quantities before m ultipl ying the input with the desired proportions.For both methods the r efer ence data was not normalized.During parameter convergence, the only equation where the reference data is used is Equation ( 11 ), where it is transformed to a proportion.The absolute counts of the r efer ence data only matter in their proportion to the input data in Equation ( 11 ).It does, howe v er, make sense to not normalize the r efer ence data here since it is logical that r efer ence data with a higher coverage is more reliable and should therefore weigh more in the calculation of ␤.
To test the behavior of CelFEER on data partially corrupted by noise, we performed an additional simulation using se v en different tissues downloaded from ENCODE (Supplementary Table S1, all tissues except skeletal muscle are used).On the test data, we simulated noise by independently flipping the status of each CpG site of each read with probability p , where p takes the values 0 (no noise added), 1  1000 , 1 100 , 1 50 , 1 40 , 1 30 , 1  20 , and 1 10 (10% of CpG sites are flipped).For each value of p , we repeated this 50 times and calculated the mean correlation across these 50 runs.

Comparison to other methods
We benchmarked CelFEER against two other cell type decomposition methods on simula ted da ta: (i) a simple non-negati v e least squares regression, similar to the approach by Moss et al. ( 10 ), and (ii) the method used by Houseman et al. ( 20 ) that achie v ed the lowest tissue decomposition error in the systematic evaluation by Jeong et al. ( 21 ).Houseman's method (or Houseman for short) uses array data, so we converted the methylated and unmethylated counts used for the CelFiE input to beta-values.Since the least squares regression and Houseman essentially both use beta-values , like CelFiE does , we used CelFiE's marker regions as marker regions for all three methods.Houseman was designed to work on single CpG sites, so we used all CpG sites overlapping with these marker regions as input for Houseman.

Simulations using generated data
To test whether CelFEER works as expected, we followed Caggiano et al. in genera ting da ta to simula te cfDNA input and cell type DNA r efer ence data.Using generated data, they showed that CelFiE (i) estimates proportions correlated to the true cell type proportions, (ii) is able to detect small differences between two groups of individuals and (iii) is able to estimate the proportions of unknown cell types (i.e.cell types that ar e pr esent in the input data, but not in the r efer ence).
The results of these simulations are not an accurate reflection of the model performance, as the simulations for neither CelFiE nor CelFEER model any correlation between sites.As a result, the input of adjacent sites is not summed together as is done for WGBS data, e v en though Caggiano et al. have shown that the original method does not return sensible results on WGBS data without summing adjacent sites.The simulations do serve as a way of investigating whether CelFEER has the same properties as CelFiE.

CelFEER estimates of g ener ated data corr elate to true pr opor tions .
As a first evaluation of the read-based method, the performance of CelFEER is compared to the performance of CelFiE on genera ted da ta.The simula tions follow the approach of ( 8 ), meaning that 50 replicates were run, each with 25 cell types, 6000 CpG sites and 1 individual.The read depth at each CpG site was drawn from a Poisson distribution centred around 10.
To compare the performance, we measured the Pearson's correlation between the estimated and true cell type proportions.CelFEER performed slightly worse than CelFiE, with a mean Pearson's correlation r = 0.84 ± 0.05 compared to r = 0.87 ± 0.07 for CelFiE (Figure 3 ).The result of CelFiE found by us is, howe v er, not as good as the r esult r eported in ( 8 ), where equivalent simulations result in r = 0.96 ± 0.01.

CelFEER and CelFiE do not detect a significant difference betw een tw o gr oups.
Ev en in indi viduals with cfDNA originating from aberrant cell types, most of the cfDNA is usually deri v ed from hematopoietic origins ( 10 ).In other words, the actual amount of cfDNA from an aberrant cell type can be very small.Ther efor e, it is important to be able to dif ferentia te between a group tha t does not have this cell type and a group that has only a very small amount of it.To this end, we simulated a cell type that appeared in 1% (a proportion of 0.01) of the cfDNA of fiv e indi viduals (group A) and 0% of the cfDNA of fiv e other indi viduals (group B).Ten cell types were used in total on an input of 1000 CpG sites.The remaining nine cell types had a proportion drawn from a uniform distribution between 0.5 and 1, which were then normalized such that all proportions summed to one.
Supplementary Figure S2 shows the estimated proportion of the rare cell type for both groups, using both CelFiE and CelFEER.Averaged over 50 replicates, CelFiE estimated a proportion of 0.03 ± 0.01 in group A and 0.025 ± 0.007 in group B, while CelFEER estimated proportions of 0.031 ± 0.01 and 0.026 ± 0.008 for the two groups respecti v ely.A two-sample t-test done for each individual showed no significant difference between the average proportions estimated by both methods in neither groups ( P > 0.1 for all indi viduals).Moreov er, the proportion of the rare cell type is highly overestimated in both groups.
CelFEER estimates pr opor tions of unkno wn cell types.One of the advantages of CelFiE ov er pre vious deconvolution methods is its ability to infer cell type information from the methyla tion sta tes of other individuals.This way it can estimate the cell type proportions of cell types that are not present in the reference data.As in the original paper, we genera ted simula ted cfDNA for 1000 CpG sites, 10 cell 8 NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 2 types and 10 individuals at a read depth of 10.In the reference data, we set the methylation states of the last cell type to 0 at each CpG site.The true proportion of this unknown cell type was drawn from a normal distribution centred around 0.2 with a standard deviation of 0.1, and clipped if smaller than 0 or larger than 1.All other cell type proportions were drawn from a uniform distribution between 0 and 1, and together with the unknown cell type the proportions were made to sum to 1.This was done for each individual separately.
We measured the root mean squared error (RMSE) of the estimated proportion of the missing cell type.Averaged over all individuals, CelFEER resulted in an RMSE of 0.0009, and CelFiE in an RMSE of 0.0010.This shows that CelFEER is also capable of estimating proportions of unknown cell types in simulated data.

Results of simulations using WGBS data
Since there are no ground truth cell type proportions available for real cfDN A data, we sim ulated mixtures of cfDNA by mixing WGBS data of different cell types at known proportions.To this end, we used the same data used by Caggiano et al.Howe v er, we were limited to using se v en different cell types because of the availability of read data at the time of testing.

Comparison between CelFiE and CelFEER.
To compare the performance of CelFEER to the performance of CelFiE, we again simulated cfDNA mixtures by artificially mixing WGBS data of different cell types.Again we followed the same approach as CelFiE to create the true cell type proportions for 100 individuals.The marker regions of both models were found using their reference data and were ther efor e differ ent for the two models, since one set of regions was found by comparing CpG site averages (CelFiE) and the other by comparing read averages of different cell types (CelFEER).
Supplementary Figure S3 shows the results of 50 replicate runs for a randomly selected individual.Without unknown cell types in the r efer ence data, CelFEER r esults in a corr elation of r = 0.94 ± 0.04 while CelFiE results in a correlation of r = 0.86 ± 0.09, which is higher than the performance reported in the original CelFiE publication ( 8 ).This increase in performance arises from our improved method for finding marker regions (Supplementary Figure S6).In short, we only use h ypometh ylated regions and differentiate between regions by using the total methylation over a 500bp window (Supplementary Section S4).Next to CelFiE, the deconvolution method of Houseman et al. ( 20 ) also benefited from using our improved set of markers (Supplementary Table S6).
We find that the difference in correlation between CelFEER and CelFiE is significant ( t -test, t (9998) = 58.11,P < 0.001).To examine whether this would go at the expense of runtime, we measured the time it takes each method to run one replicate.On our system, CelFEER requires ∼1.1 times the time needed by CelFiE.
Since one of the assets of CelFiE is its ability to infer the proportions of unknown cell types, we expected CelFEER to outperform CelFiE on this aspect as well.Similar to the  1 and Supplementary Figure S3).When small intestine cells are masked as well, the correlation between the estimated and true cell type proportions decreases e v en more.
In addition to comparing the estimated cell type proportions and their correlation to the true proportions, we investiga ted the estima ted cell type methyla tion values.We measur ed the corr ela tion between the estima ted cell type methylation percentages and the methylation percentages obtained by normalizing the methylation values of the reference data to sum to one.It is remar kab le how this correlation is consistently higher for CelFiE (Table 1 ).This implies that the methylation percentages estimated by CelFiE di v erge only very little from the r efer ence methylation.This probably means that CelFEER takes the input of other individuals more into account when estimating the methylation values, and ther efor e indir ectl y w hen estimating the cell type proportions.
Another advantage of CelFiE over previous methods is that it works with low coverage input data.A higher read coverage means higher sequencing costs, and it is ther efor e desirable that CelFEER performs sufficiently on low coverage data as well.To test this, we normalized the read coverage of each cell type to equal the total amount of input regions multiplied with a constant, n , before mixing the cell types.This way, each cell type covered each region with n reads on average.For each n ∈ { 2, 5, 10, 50 } the average correlation over 50 replicates and 100 individuals was measured.The cell type proportions were generated in the same manner as before, and no unknowns were estimated.The relation between the correlation and the coverage is shown in Figure 4 .We can conclude that for a stable performance, the coverage should be 10 or higher.Interestingly, the correlation between the estimated and true cell type proportions increases a little for CelFiE when n = 5.It is possible that lowering the coverage acts as a noise reduction on the CelFiE input.Even on the lowest coverage, CelFEER outperforms CelFiE, showing that CelFEER is a suitable method for low coverage data.
CelFEER is robust to noisy input.Since WGBS of cfDNA is inherently noisy, we evaluated the behavior of CelFEER on noisy data by creating sim ulated cfDN A cell type mixtur es wher e the methyla tion sta tus of CpG sites in reads was randomly flipped with probability p .We used CelFEER to estimate cell type proportions of these mixtures and calculated the correlation between the true and estimated proportions.The results, shown in Supplementary Table S2, show that CelFEER can still accurately recover cell type proportions for values of p below 1 / 30.Its robustness makes CelFEER suitable for noisy data, such as data from cfDNA profiling.
CelFEER outperforms other beta-value-based methods on simulated data.After showing that the read-average-based method CelFEER outperforms its beta-value-based alternati v e CelFiE, we wer e inter ested in comparing CelFEER to other beta-value-based methods.To this end, we compared CelFEER to Houseman ( 20 ) and to a non-negati v e least squares regression implemented by Caggiano et al. ( 8 ) to resemble the approach by Moss et al. ( 10 ).Using the same conditions as for comparing CelFEER to CelFiE (i.e. the same se v en tissues and the same randomly drawn tissue proportions, Table 1 ), we find that CelFEER outperformed all methods in terms of correlation between the true and predicted proportions.Specifically, CelFEER achie v ed a correlation of 0.94 ± 0.04, Houseman 0.69 ± 0.19 and the least squares method 0.73 ± 0.16 (Supplementary Table S5).Supplementary Figure S9 depicts the proportions for a randomly chosen individual, also showing deteriorated performances of the least squares regression and Houseman approaches.
Mar k ers f ound on r ead aver ag es ar e differ ent fr om mar k ers found on count input.Finally we were interested in whether the markers found using read averages differ from the markers found using CpG site averages.We hypothesised that CelFEER works better with markers found on the read averages of the r efer ence data, on the grounds that CelFEER dif ferentia tes cell types by their read averages.Additionally, as reasoned in the introduction, read averages should be more sensiti v e to dif ferences in methyla tion sta tus between cell types.We again performed the same experiments, using a simulated mixture of se v en different cell types.
We firstly checked the overlap in markers found using both methods.Of all 700 detected markers found by each method, 130 markers were found by both.Each of the seven cell types has markers that are found by both methods.Ther e ar e no r egions that ar e a marker for one cell type in one method and a marker for another cell type in the other method.
Using the markers found by CelFiE, CelFEER performed similarly with a correlation of r = 0.94 ± 0.04 (Supplementary Figure S4).The correlation between the cell type proportions estimated by CelFiE using CelFEER's markers is r = 0.69 ± 0.21, indica ting tha t the markers found by CelFEER are not suitable for the input of CelFiE.Av eraged ov er all cell types, the difference in methylation percentage between cell types at CelFiE's marker locations is 0.65 for both the r efer ence and input data, where the reference data showed slightly less variation with a standard deviation of 0.19 compared to 0.20 for the input data.For CelFEER, this difference is 0.66 ± 0.20 for the input and 0.64 ± 0.22 for the r efer ence.Supplementary Figur e S5 does show that for some cell types the variation in the distance from the median is substantially larger for the CelFEER markers.In addition, restricting the marker selection to only CpG islands or methylation haplotype blocks ( 22 ) lowered the performance (Supplementary Material S4, Supplementary Tables S3 and S4)

CelFEER is applicable on real datasets
Caggiano et al. showed that CelFiE is able to differentiate between Amyotrophic Lateral Sclerosis (ALS) patients and a control group by the estimated proportion of skeletal muscle deri v ed cfDNA.Although it is interesting to see if CelFEER is also able to distinguish between the ALS and the control group, it is hard to evaluate the method based on its cell type proportion estimates since there are no ground truth cell type proportions availab le.Moreov er, while Caggiano et al. used 28 case and 25 control samples, we only used four case and four control samples.The reference data consists of all 19 cell types gi v en in Supplementary Table S1.
We firstly fully decomposed the cfDNA, thus estimating the proportions of each of the 19 cell types present in the r efer ence.The fiv e cell types with the highest proportions estimated by CelFEER were, in both groups, the following: neutr ophil, monocyte, erythr oblast, spleen and eosinophil.CelFiE estimated similar proportions, but instead of spleen it estimated adipose to be the fourth highest in proportion.In their own work ( 8 ), howe v er, neither spleen nor adipose, b ut macropha ge cells are in this top fiv e.Still, these results mostly correspond to the findings of Moss et al. ( 10 ).The full decomposition can be seen in Supplementary Figure S7.
Ne xt, we specifically e xamined the skeletal muscle cell proportions in both groups.CelFiE estimated an average proportion of 5.5e −3 ± 3.1e −3 in the ALS case group, and 1.5e −3 ± 1.1e −3 in the control group.A two-sample t -test did not indicate a significant difference between the two groups (t(6) = 2.09, P = 0.08).CelFEER, on the contrary, did find a significant difference, with an average proportion of 1.2e −3 ± 5.4e −4 for the ALS case group and 7.7e −5 ± 1e −4 for the contr ol gr oup (  P = 0.01).Clearly, CelFEER is able to detect small fractions of rare cell types in cfDNA.
In addition, we applied CelFEER on a dataset of pregnant women along different stages of pregnancy, and with or without gestational diabetes (GD) ( 14 ).We used the same r efer ence data as for the ALS dataset, but with the inclusion of placenta tissue DNA (as listed in Supplementary Table S1).We then decomposed the cfDNA of eight pregnant women in their first trimester, and eight pregnant women in their third trimester.The full cell type decomposition can be seen in Supplementary Figure S8.We found a significant difference between the proportions of placenta cfDNA (Figure 5 B) between the first and the third trimester ( t -test, t (14) = 4.7, P = 0.0003).
Next, we looked a t the dif ference in placenta cfDNA between healthy individuals and individuals with GD.Del Vecchio et al. ( 14 ) showed that cfDNA in individuals with GD contains a higher proportion of placenta compared to healthy individuals.When comparing the cfDNA of eight healthy individuals in the first trimester to se v en indi viduals with GD in the first trimester (Figure 5 C), we find that women with GD have a significantly higher proportion of placenta cfDNA ( t -test, t (13) = 3.4,P = 0.004).Our findings agree with the decomposition of del Vecchio et al. ( 14 ).In addition, in accordance with the results of ( 14), we found that the single obese pregnant individual in our dataset had a low proportion of placenta cfDNA (0.2%).
Finally, we measured the computational needs of CelFEER on this real-life cfDNA dataset of pregnant women.For estimating the proportions of 20 cell types in 16 individuals, CelFEER runs on a single-core machine using 160MB of RAM in less than 5 min.

Figure 1 .
Figure 1. ( A ) Workflow of cell type deconvolution with CelFEER.Sequenced and aligned cfDNA fragments are intersected with cell type marker regions in the genome that are found using r efer ence cell type data.The r efer ence cell data and the cfDNA input data are used as model input for CelFEER, w hich subsequentl y outputs the estimated cell type proportions in the cfDNA.( B ) Toy example illustrating how a tumor-deri v ed read (in orange) can be distinguished from other reads more easily by comparing read averages ( r ) instead of CpG site averages ( ␤).Values in red are differential between the cancer and nor mal sample.( C ) For matting of the input for CelFiE (bottom left) and CelFEER (top right).On the top left, three partially methylated reads aligning to a 500 bp marker are depicted.For CelFiE, the input is gi v en in two numbers, one equalling the sum of methylated reads at each CpG site and the other equalling the sum of the total amount of reads at each CpG site.For CelFEER, the read averages ( r ) are first rounded to the closest value in { 0, 0.25, 0.5, 0.75, 1 } , then one-hot encoded and summed to obtain the input.The r efer ence data is formatted in the same way.( D ) Underlying mechanism of CelFEER for three individuals and three cell types.On the left side of the figure, the reference and input data are depicted.On the right side, the estimated methylation percentages (top) and estimated cell type proportions (bottom) are depicted.The center part illustrates the posterior distribution of the latent variable z , which indicates what cell type each separate read is deri v ed from.

Figur e 3 .
Figur e 3. Sim ula tions on genera ted da ta for one individual.Each boxplot displays the estimated proportion of a cell type for replicate model runs.The red dots indicate the true cell type proportions for 25 cell types.

Figure 4 .
Figure 4. Relation between the input coverage and the correlation between the estimated and true cell type proportions.The full range of the correlations of 100 individuals and 50 replicates is highlighted.

Figure 5 .
Figure 5. Results of the decomposition of real cfDNA da ta. ( A ) Estima ted proportions of skeletal muscle cfDNA in ALS patients ( n = 4) and a control group ( n = 4).( B ) Estimated proportions of placenta cfDNA in healthy pregnant women ( n = 8) in their first and third trimester.( C ) Estimated proportions of placenta cfDNA in healthy pregnant women ( n = 8) and pregnant women with GD ( n = 7) in the first trimester of their pregnancies.

Table 1 .
Pearson's correlation ( r ) between true and estimated cell type proportions ( ␣ estimates) and between r efer ence methylation and estima ted methyla tion values ( ␤ estima tes) of a simula ted mixture of se v en different cell types