CpG Transformer for imputation of single-cell methylomes

Abstract Motivation The adoption of current single-cell DNA methylation sequencing protocols is hindered by incomplete coverage, outlining the need for effective imputation techniques. The task of imputing single-cell (methylation) data requires models to build an understanding of underlying biological processes. Results We adapt the transformer neural network architecture to operate on methylation matrices through combining axial attention with sliding window self-attention. The obtained CpG Transformer displays state-of-the-art performances on a wide range of scBS-seq and scRRBS-seq datasets. Furthermore, we demonstrate the interpretability of CpG Transformer and illustrate its rapid transfer learning properties, allowing practitioners to train models on new datasets with a limited computational and time budget. Availability and implementation CpG Transformer is freely available at https://github.com/gdewael/cpg-transformer. Supplementary information Supplementary data are available at Bioinformatics online.

• Encoded nucleotides with an embedding layer instead of one-hot encoding.
• Used different chromosomes in holdout validation and testing.
• Used all cells for the Ser dataset.
• Improved the efficiency of preprocessing: batches are constructed on the fly without having to precompute the neighbors and write them to disk. • Used linear learning rate warm-up over the first 1000 steps, through which more-stable learning trajectories are obtained.
CaMelia We: • Encoded genomic positions of CpG sites as 64-bit integers instead of 32-bit floats.
• Excluded bulk cells from the analysis.
• Did not discard samples with an even number of methylated and unmethylated reads. Instead, labels are encoded as positive (methylated) label when #(readspositive) #(reads total ) ≥ 0.5. • Did not rule out samples which originate from CpG sites that do not have a label in any other cell.
• Did not discard samples for which no other cell has a local similarity of larger than 0.8. CaMelia excludes these samples because the locally paired similarity feature is undefined in these cases. In order to obtain a prediction for every methylation state in every cell (as is possible with CpG Transformer and DeepCpG), the locally paired similarity feature for these problematic samples is instead assigned with NaN. The default processing mode for NaN values in CatBoost is processing them as the minimum value for that feature. • Used different chromosomes in holdout validation and testing, instead of using cross-validation.
• Improved the efficiency of preprocessing: locally paired similarity features are computed in a vectorized way wherever possible. Feature vectors are computed and encoded in a memory-efficient way, so that they do not have to be written to disk.

Pseudocode for self-attention operations
Algorithm 1: Column-wise full self-attention # b = batch size, n = number of rows/cells, m = number of columns/CpGsites, dmodel = hidden dimensionality # To perform attention separately for each column: batch and column dimension are collapsed. # In addition, attention heads are splitted in a separate dimension.
Algorithm 2: Row-wise sliding-window self-attention, with positional encodings as in Transformer-XL.
# b = batch size, n = number of rows/cells, m = number of columns/CpGsites, dmodel = hidden dimensionality Data: X ∈ R b×n×m×d model , Positional encoding: P ∈ R (bn)×m×n head ×d head , trainable bias vectors: # To perform attention separately for each row: batch and row dimension are collapsed. # In addition, attention heads are splitted in a separate dimension. Q ← Reshape(Q) ∈ R (bn)×m×n head ×d head K ← Reshape(K) ∈ R (bn)×m×n head ×d head V ← Reshape(V) ∈ R (bn)×m×n head ×d head # To perform sliding window self-attention: m sliding windows of size w are taken from the column dimension. Tables supporting the results section   c Attained when training on 1024 CpG sites in parallel (bin size), as opposed to the 128 CpG sites in parallel (batch size) used in DeepCpG. When CpG Transformer is trained with a bin size of 128 CpG sites, 1909MB GPU memory is used. Table 3: Ablation study of self-attention mechanisms on Ser dataset. The final proposed CpG Transformer consists of an axial self-attention mechanism where sliding window row attention is performed first followed by column-wise full self-attention. Full 2D sliding window self-attention refers to a self-attention mechanism where both row-wise sliding window and column-wise self-attention are combined into one operation, as opposed to following each other as in axial attention. Best performers are indicated in bold.

Model
Sliding window size ROC AUC PR AUC  Table 4: Ablation study of hyperparameters on the Ser dataset. The model is compared to smaller, bigger, deeper and more shallow versions of itself, as well as a version where the order of row-wise and column-wise self-attention is switched. Reported test performances are the mean ± standard deviation over three independent seeds. It can be seen that no considerable performance gains can be made from further scaling up CpG Transformer's weights. The final model (in bold) uses a model size that is as minimal as possible, without negatively inhibiting performance to a considerable extent.   Rows indicate the different datasets. From top to bottom: Ser, 2i, Hemato, HCC and MBL. The biggest gradient in performance is observed for the local sparsity direction. No coverage dependency could be obtained for the MBL dataset as no read data is available for this dataset. Generally, local sparsity is more indicative of low performance than coverage of the label. (1) CpG contribution is higher when the model is more confident (with a small prediction error), (2) higher sparsity translates in lower contribution of CpG embeddings, (3) differences in relative contributions between cells are negligible, and (4) importance of other CpG sites for prediction decreases with absolute distance from site. A notable exception to statement (3) is found for the Hemato dataset. For this dataset, cell embeddings show a considerably higher contribution for the hematopoietic stem cell type than for the other (progenitor) cell types.