iCpG-Pos: an accurate computational approach for identification of CpG sites using positional features on single-cell whole genome sequence data

Abstract Motivation The investigation of DNA methylation can shed light on the processes underlying human well-being and help determine overall human health. However, insufficient coverage makes it challenging to implement single-stranded DNA methylation sequencing technologies, highlighting the need for an efficient prediction model. Models are required to create an understanding of the underlying biological systems and to project single-cell (methylated) data accurately. Results In this study, we developed positional features for predicting CpG sites. Positional characteristics of the sequence are derived using data from CpG regions and the separation between nearby CpG sites. Multiple optimized classifiers and different ensemble learning approaches are evaluated. The OPTUNA framework is used to optimize the algorithms. The CatBoost algorithm followed by the stacking algorithm outperformed existing DNA methylation identifiers. Availability and implementation The data and methodologies used in this study are openly accessible to the research community. Researchers can access the positional features and algorithms used for predicting CpG site methylation patterns. To achieve superior performance, we employed the CatBoost algorithm followed by the stacking algorithm, which outperformed existing DNA methylation identifiers. The proposed iCpG-Pos approach utilizes only positional features, resulting in a substantial reduction in computational complexity compared to other known approaches for detecting CpG site methylation patterns. In conclusion, our study introduces a novel approach, iCpG-Pos, for predicting CpG site methylation patterns. By focusing on positional features, our model offers both accuracy and efficiency, making it a promising tool for advancing DNA methylation research and its applications in human health and well-being.


Introduction
DNA methylation is a crucial epigenetic marker and performs an important role in a wide range of biological processes, such as chromosomal instability, cell differentiation, X-chromosome inactivation, cancer progression, and gene regulation (Robertson 2005, Suzuki and Bird 2008, Laird 2010, Jones 2012. DNA methylation is related to the functional state of regulatory regions and affects DNA replication and gene transcription. These features are closely connected to various human diseases such as immune diseases, malignant tumors, and Alzheimer's (Gao et al. 2017, Wan et al. 2015, Stieglitz et al. 2017, Yan et al. 2017. Recent research has shown that methylation levels are closely linked to age and life expectancy (Smallwood et al. 2014). In particular, previous research has indicated that DNA methylation levels vary with age (Horvath 2013). Therefore, it is crucial to study DNA methylation.
Until now, well-established approaches have been developed to quantify average levels of DNA methylation in cell populations. Recent advances in technologies have made it possible to profile DNA methylation at single-cell resolution via reduced representation protocols (scRRBS-seq) (Smallwood et al. 2014) or whole genome bisulphite sequencing (scBS-seq) (Clark et al. 2017). These methods have disclosed the dynamics of epigenetic patterns in heterogeneous cell populations and enabled explaining epigenetic networks with unprecedented resolution and scale (Hou et al. 2016).
Since genetic material per cell is limited, analyzing singlecell methylation is inherently limited by moderate CpG coverage. Smallwood et al. (2014) found that the scBS-seq method covers 20%-40% of CpG sites, while the scRBS-seq method only covers 1%-10% of CpG sites (Farlik et al. 2015, Guo et al. 2013, Hou et al. 2016. A decrease in CpG coverage could result in the loss of critical information. Therefore, the first critical step is to identify unknown methylation states to enable whole-genome analyses. Various approaches have been proposed to detect the average DNA methylation profiles in cell populations (Zhang et al. 2015, Stevens et al. 2013, Ernst and Kellis 2015, Liu et al. 2015, Whitaker et al. 2015. However, these methods fail to explain cell-specific variability, and they require predefined features and genomic annotation, which are often limited to a narrow range of cell conditions and types. Therefore, it is crucial to develop more efficient computational approaches to predict DNA methylation and make methylation identifications more reliable (Zhang et al. 2015).
Identifying methylation states in tissue samples is a wellknown challenge in bioinformatics, and one approach is to leverage dependencies between CpG sites. For instance, some studies used variational autoencoders to reduce the dimensionality of methylation data (Levy et al. 2020). Other methods employ machine learning techniques such as random forests (RFs) (Zhang et al. 2015), autoencoders (Qiu et al. 2018), gradient boosting (Zou et al. 2018), linear regression (Di Lena et al. 2019), or mixture models (Yu et al. 2020) to impute individual CpG sites in tissue samples. Some of these approaches also consider intrasample dependencies between adjacent CpG sites and leverage information from multiple tissue samples to identify CpG sites (Zou et al. 2018, Yu et al. 2020. Most recent studies on DNA methylation imputation in single cells have leveraged both intercellular and intracellular correlations between methylation states. One such method is Melissa, which first identifies a genomic region of interest (e.g. a specific promoter region) and then performs generalized linear model regression on that region to identify CpG sites (Kapourani and Sanguinetti 2019). This model effectively clusters cells by leveraging information from other cells through a shared prior distribution determined by the Bayesian mixture model.
The nature of methylation sites must be vectorized so that computational models can identify them. A lot of previous studies (Zhou et al. 2012, Bhasin et al. 2005, Pavlovic et al. 2017 have shown that the sequence of adjacent nucleotides of a methylation site is specific and that the methylation states are closely associated with the sequence composition. Bhasin et al. (2005) proposed the Methylator method which used conventional binary sparse encoding to convert sequences directly into feature vectors. Das et al. (2006) developed a method that extracts sequences with a window size of 800 bp, counts the methylation propensity, and uses principal component analysis with recursive feature removal for feature selection. Recently, Pan et al. (2018) used an k-gram, multivariate reciprocal information (Ding et al. 2016), Discrete Wavelet Transform (Shensa et al. 1992), and Pseudo Amino Acid Composition (Chou 2001) to extract DNA sequence features for training sparse Bayesian learning model. All of these techniques have established that the model performance is directly dependent on the selection of the classifier and the feature extractor.
Zhang's method proposes a RF based model to handle differences in local CpG profiles between different cells (Zhang et al. 2015). Zhang's technique trained the RF classifier using four factors: chromosomal location features, DNA sequence features, cis-regulatory elements, and the levels and lengths between adjacent CpG sites. Further, another model LightCpG was proposed to identify CpG sites (Jiang et al. 2019). The LightCpG model uses novel positional features, structural features and n-gram features to train the model. We are highly inspired by the positional features used in the LightCpG architecture, however, we believe the potential of these positional features is still unexplored. Therefore, in this research, we propose an ensemble learningbased method using positional features to predict the methylation sites. Achieved results have demonstrated that the proposed model outperformed existing techniques and a significant improvement is observed.

Materials and methods
As can be seen in Fig. 1, there are three major milestones in the creation and evaluation of iCpG-Pos. The first step is to collect the dataset which is obtained from the literature (Angermueller et al. 2017). The second step is model construction which is the biggest milestone. The model construction consists of feature extraction, classifier selection, and ensemble learning strategy selection. Selection of the optimized end-to-end model is a critical step towards the development. The last milestone is to evaluate the different models and propose the best model.

Dataset
Similar to LightCpG, we have used two benchmark datasets. The first dataset (GSE65364) consists of 25 human hepatocellular carcinoma cells (HCCs). Though there are 26 HCCs but Ca26 is eliminated since the pattern of transcriptional activity is substantially anomalous, according to the work by Hou et al. (2016). The second dataset contains six human hepatoblastomaderived (HepG2) cells attained from (GSE65364). Both datasets were profiled using single-cell reduced-representation-bisulphitesequencing (scRRBS-seq). Using the liftOver tool (http://www.ge nome.ucsc.edu/cgi-bin/hgLiftOver), each location of a CpG site is aligned to hg19 (Raney et al. 2014). In our study, we looked at these sites as research items that had at least four reads on them. Figure 1 visualizes the amount of data in each cell of HCCs and HepG2 datasets used for training and testing. We applied the same validation strategy across all datasets, drawing inspiration from the LightCpG. Table 1 shows the dataset distribution for training, testing, and validation.

Feature extraction
We have used three feature encoding/extraction techniques to decide the best among them. These techniques are one-hot encoding, n-gram and positional features. Herein, these features are individually used towards the development of iCpG-Pos. These feature encoding techniques are discussed below.

One-hot encoding
The one-hot encoding strategy is the most basic and widely used encoding technique utilized in computational biology (Rehman et al. 2022b,c). In this encoding system, each nucleotide is translated to a numerical value, which is then allocated a specific binary vector that contains all "0" values except the integer's index, which is maintained as "1." The one-hot encoding for four nucleotides in a DNA sequence looks like this:

n-gram features
To better know the importance of the positional features, we included the sequence feature. To get the sequence features we have used n-gram features. N-gram feature extraction entails extracting continuous sequences of n elements from a given sequence, such as words or letters (Ganapathiraju et al. 2002). These sequences, known as n-grams, can give useful information about the sequence data's local context and sequential relationships. The sequence is tokenized into distinct parts, in order to extract n-gram characteristics. The set of ngrams is then formed by generating all feasible n-length sequences. The frequency of each occurrence of an n-gram is frequently estimated to illustrate its value or relevance within the sequence. Every n-gram feature is represented by a pair of values ðo i ; t i Þ, where o i is a feature that may be retained as combination of "n" number of nucleotides while t i is the number of occurrences of o i being repeated in the given sequence. t i is calculated by the following equation, where Nðo i Þ is the number of times o i is repeated in the sequence while M represents the length of the input sequence.
In this research work, we have taken n to be 1, 2, and 3 to represent a single sequence. The final attained feature vector is of length 84.

Positional features
Applying knowledge regarding the CpG regions and the separation between nearby CpG sites, the authors of LightCpG proposed a method to derive positional characteristics of CpG sites (Jiang et al. 2019). To extract positional features, the states of both the target CpG site and its adjacent CpG sites, as well as the distance between them, are taken into consideration. Meaningful information about the states along with finding the distance among adjacent sites is required to extract positional features.
A new method termed as skip K method is utilized to determine the correlation between two adjacent CpG sites. The skip-K method comprises of two subparts "skip-K 1 " and "skip-K 2 ," they refer to different steps in this analysis. "skip-K 1 " is the preliminary step that checks for the existence of similarity between adjacent sites. Once similarity has been established between adjacent CpG sites, the "skip-K 2 " analyzes the extent of similarity between CpG sites within a certain window size.
The skip-K 1 method is a technique that extracts features from all CpG sites at specific distances from the target CpG site. It selects one CpG site at the upstream end which is located at the K th 1 distance from the reference point (i.e. the target CpG site) and another CpG site at the downstream end which is also at the K th 1 distance from the reference point. Once these CpG sites are extracted, the states are determined using the skip-K 1 method. The method then calculates the distance between the extracted states and the target CpG site. As the distance between the CpG sites increases, the correlation between adjacent sites becomes weaker.
The skip-K 2 method extracts features from CpG sites located at a specific distance from the target CpG site. This involves identifying the CpG sites closest to the target CpG site at a distance of K 2 from both the upstream and downstream ends. The method then determines the states of these extracted CpG sites and calculates the distance between the states and the target CpG site. Using this information, the method predicts the state of the target CpG site. It is important to note that as the window size increases, the prediction accuracy of the skip-K 2 method tends to decrease. Therefore, in order to ensure optimal prediction accuracy, the size of both K 1 and K 2 is set to 1.
In addition to that, the skip-K 1 and skip-K 2 methods, one CpG site state is also individually extracted from both the upstream and downstream ends near the target site. The states of these two CpG sites are then determined and the distance between these extracted states and the target site is calculated. This information is used to predict the state of the target site.
To handle the challenge of extracting features from CpG sites with similar methylation states that exist in different cells, we adopt the following approach. Let us assume there are w cells in the dataset. For each chromosome in the t th cell, Figure 1. Cell-wise methylated and non-methylated, training, and testing data size visualization of HCCs and HepG2 datasets.
With this approach, we can extract the features of CpG sites that are similar to one another across different cells. This enables us to capture the patterns of methylation states across multiple cells, which can improve the accuracy of our predictions.
To maximize the number of features extracted from sites that are similar but exist in different cells, we use Y t to devise methods for feature extraction. If a CpG site exists in the r th cell and satisfies the condition p c t ¼ p l r , then we denote F c t;r as the distance between the nearest CpG sites with methylation states on both ends of the CpG site in question, where the nearest CpG site is the one in the l th position in the j th cell. F c t;r ¼ fP lÀ1 t;r ; S lÀ1 t;r ; P lþ1 t;r ; S lþ1 t;r g: Where; P lÀ1 t;r ¼ p lÀ1 If a single CpG site exists in the r-th cell and has an unknown methylation status, then we select the two adjacent sites which satisfy the condition p l r < p c t < p lþ1 r . In this case, F c t;r is denoted as follows: F c t;r ¼ fP l t;r ; S l t;r ; P lþ1 t;r ; S lþ1 t;r g: Where; P l t;r ¼ p l r À p c t ; P lþ1 t;r ¼ p lþ1 r À p c t ; S l t;r ¼ s l r ; S lþ1 t;r ¼ s lþ1 r : The features which are extracted for the CpG sites are similar in nature but they exist in different cells, these features are also represented and the following equation is used to extract feature vector, Here, D c t;r represents the feature vector for the CpG site c in cell t, where c is similar to a CpG site in cell r with position p l r . If p c t ¼ p l r and t 6 ¼ r, then the feature vector is given by the tuple ðP l t;r ; P lþ1 t;r ; S l t;r Þ, which contains the distance to the nearest CpG sites on either side of site c in cell r, as well as the methylation state of the site l in cell r. If the condition is not satisfied, meaning the sites are not similar, then the feature is set to "unknown".
The aim is to extract features of CpG sites that are similar in nature but exist in different cells. However, some of these CpG sites have unknown methylation states. To solve this, Q and H features are used for feature extraction. F represents two factors, namely the distance and methylation states of the two nearest CpG sites to a specific CpG site that satisfies the condition p c t ¼p l r and is present in the rth cell. Following is the equation used in this case, D features are obtained when a single CpG site in the rth cell satisfies the condition p c t ¼p l r and t is not equal to r. In this case, D represents two factors, distance, and methylation state, of the CpG sites on both sides of the target CpG site. These CpG sites are (lÀ1) and (l þ 1) sites. Figure 2 illustrates these details.
The features obtained are a combination of the distance and methylation states of the CpG sites. F and D features are represented by F c t;r and D c t;r , respectively. Figure 3 shows the pictorial description of these details. Where F c t;r represents the F4 features and D c t;r represents the D8 features.

iCpG-Pos development
At the initial stage, we used weka software to select the baseline classifiers. Several classifiers were run on different features to get the list of best classifiers among all. The four selected classifiers are commonly used in Machine Learning techniques (AdaBoost, Random Forest, XGB, and CatBoost classifier). At first, each classifier is optimized on every feature vector individually. The optimization is performed using the OPTUNA algorithm. OPTUNA utilizes trial histories to identify which hyper-parameter variables to test subsequently (Akiba et al. 2019). It predicts a potential region and attempts parameters in that region while using the given range. According to the current finding, Optuna anticipates an even more competitive region. This technique repeatedly utilizes historical data from previous experiments. It implements a Bayesian optimization approach known as Tree-structured Parzen Estimator. The optimum hyper-parameter values for which the optimization algorithm Fit returns the best score. As a result, the fitness metric is calculated as follows: where MCC i represents the achieved Matthews correlation coefficient (MCC) on j th fold with the selected parameters. N FCV represents the number of fold cross-validation which in our case is 5. The higher the value of Fit shows better the performance by the selected parameters. Intending to enhance CpG site prediction, we used a stacked and voting ensemble learning architecture to create the iCpG-Pos. Out of both the ensemble learning frameworks, the stacking algorithm exhibited to be better. Stacking, unlike the voting ensemble procedure, successfully examines and learns how to combine diverse baseline models to generate a more precise model. In the stacking algorithm, an extra-tree classifier (ETC) is used as a meta-predictor. To incorporate the distinct strengths of several baseline models, the ETC predictor is developed and optimized using the new feature vector. But it is being observed that the stacking algorithm showed almost similar results to the CatBoost classifier using positional features.

Evaluation metrics
We used five assessment indicators to assess the suggested tool. As discussed in the dataset section, we have used individual chromosomes for testing which do not overlap with the training dataset. Therefore independent testing is used to compute these assessment indicators. Sensitivity (Sen), Specificity (Spe), Accuracy (ACC), F1 Score, and MCC are the figure of merits. These criteria are extensively used in the literature (Rehman et al. 2022a(Rehman et al. , 2021, and we have incorporated them to allow for a fair comparison with current tools. These measures can be stated numerically as, where, These assessment indicators are frequently used to assess the classification performance, subsequently quantifying various aspects of the model. MCC is the most revealing and reliable parameter if both positive and negative classes of the dataset are equally important in the analysis, and if accurately categorizing the present ground truth data samples is equally important in the study (Chicco et al. 2021). F1 Score is the harmonic mean of Precision and Recall, and it provides a more accurate assessment of erroneously categorized instances.

Evaluation of different features and classifiers
In order to show the supremacy of positional features with respect to encoded features, we have carried out a comparative analysis between them. Moreover, analysis is also carried out between different classifiers. Figure 4 illustrates these analyses. According to both benchmark datasets, characteristics from the positional features have outperformed other feature encoding schemes. Positional features have shown the best performance followed by n-gram and then One-Hot encoding. Figure 5 shows the t-distributed stochastic neighbor embedding plot of positional features where the discrimination between CpG and Non-CpG sites can be observed. Findings imply that positional features have the potential to greatly improve the performance of CpG site identification. Therefore, we further explored the best classifiers using positional features. Figure 4 illuminates the results achieved by different classifiers being trained using positional features. Among all Catboost classifier is able to achieve better performance. The CatBoost algorithm achieved the maximum average F1-score of 0.8632 for the HCCs dataset and 0.8241 for the HepG2 dataset while the maximum average MCC achieved is 0.8102 and 0.6884 for HCCs and HepG2 datasets respectively. XGB classifier has also showcased good results but during quantitative analysis, it is slightly lower in performance when compared to CatBoost. Figure 4 shows the boxplots of different features with different classifiers, while the cell-wise detailed performance analysis is shown in Supplementary File S1.
In our research, the stacking strategy, which mixes many models to improve predictive performance, did not outperform CatBoost. This result might be related to the unavailability of an adequate and diverse database for training the stacking algorithm. Even though CatBoost is one of the fundamental models in the used stacking strategy, it is vital to remember that stacking adds complexity and interdependence among the models. This intricacy may make it difficult to successfully leverage CatBoost's strengths inside the stacking architecture.

Evaluation of different ensemble learning techniques
As previously stated, iCpG-Pos is a stacked ensemble model that is created by combining multiple machine learning-based models to construct a more accurate CpG site predictor. To demonstrate the importance of the stacking approach, we compared the effectiveness of the stacking technique used in iCpG-Pos with another prominent ensemble procedure, voting. Results in Supplementary File S1 clearly suggest that the stacking approach has illustrated better results than the voting technique. In some cells, stacking algorithm exhibited improved results than the CatBoost algorithm but in maximum average comparison, CatBoost illustrated better results in the majority of metrics.

Comparison of iCpG-Pos with existing techniques
Similar to existing techniques, training and testing dataset chromosomes are used to thoroughly assess and evaluate iCpG-Pos performance to that of existing state-of-the-art approaches. In general, assessing the quality of the model using the same learning and testing dataset is a more objective method for eliminating biasness. As a result, the efficiency of the proposed model is compared to two existing machine learning techniques, including LightCpG and Zhang's method. Figure 6 represents the box-plot illustrating the distribution of evaluation metrics for different proposed techniques.
As can be seen in Fig. 6a, the proposed CatBoost-based algorithm has shown the best result for all metrics in the case of HCCs dataset. The second best algorithm is the proposed stacking-based model which holds a slight difference from CatBoost-based algorithm. Notably, the F1 score and MCC of the Catboost based algorithm have demonstrated high improvement when compared to both existing models which are LightCpG and Zhang's algorithm.
Further in Fig. 6b, it can be comprehended that the highest sensitivity, accuracy, MCC, and F1 score is attained by iCpG-Pos architecture using the CatBoost algorithm for the HepG2 algorithm. However, in the case of specificity, the best performance is showcased by the LightCpG algorithm. Moreover, iCpG-Pos architecture with a stacking algorithm has also shown slightly better results than the CatBoost algorithm. But it is important to look into the quantitative difference between the sensitivity and specificity of LightCpG, which is huge. This shows that for the HepG2 dataset, the LightCpG algorithm is comparatively biased towards the Non-CpG class. As represented in Fig. 1 the ratio of unmethylated sites is much larger in the dataset when compared with methylated sites; therefore, the biasness problem is expected in such condition. However, such biasness is not replicated in the case of iCpG-Pos algorithm, interpreting that the proposed algorithm is able to learn the insight features of the CpG sites.

Conclusion
In this work, we presented the iCpG-Pos technique for enhanced detection of CpG sites. The proposed architecture only uses positional features extracted from the single-cell whole-genome sequencing data. iCpG-Pos presents two techniques which are CatBoost-based and stacking-based. Overall CatBoost-based technique has showcased better results, however, in the case of a few cells stacking-based algorithm illustrated improved results than the CatBoost-based algorithm. All the classification algorithms used in this study are optimized using the OPTUNA framework. An extensive set of experiments have demonstrated that the proposed algorithm serves as a more reliable and  iCpG-Pos 7 efficient predictor than the current technique. This work can be used to uncover the direct linkage between methylation and diseases by comprehending the complicated biological mechanisms that enable methylation.