scAMACE: model-based approach to the joint analysis of single-cell data on chromatin accessibility, gene expression and methylation

Wangwu, Jiaxuan; Sun, Zexuan; Lin, Zhixiang

doi:10.1093/bioinformatics/btab426

Abstract

Motivation

The advancement in technologies and the growth of available single-cell datasets motivate integrative analysis of multiple single-cell genomic datasets. Integrative analysis of multimodal single-cell datasets combines complementary information offered by single-omic datasets and can offer deeper insights on complex biological process. Clustering methods that identify the unknown cell types are among the first few steps in the analysis of single-cell datasets, and they are important for downstream analysis built upon the identified cell types.

Results

We propose scAMACE for the integrative analysis and clustering of single-cell data on chromatin accessibility, gene expression and methylation. We demonstrate that cell types are better identified and characterized through analyzing the three data types jointly. We develop an efficient Expectation–Maximization algorithm to perform statistical inference, and evaluate our methods on both simulation study and real data applications. We also provide the GPU implementation of scAMACE, making it scalable to large datasets.

Availability and implementation

The software and datasets are available at https://github.com/cuhklinlab/scAMACE_py (python implementation) and https://github.com/cuhklinlab/scAMACE (R implementation).

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Recent developments in single-cell technologies enable multiple measurements of different genomic features (Lahnemann et al., 2020). Sequencing technologies include single-cell RNA sequencing (scRNA-seq) which measures transcription, single-cell ATAC sequencing (scATAC-seq) and the assay based on combinatorial indexing (sci-ATAC-seq) (Cusanovich et al., 2018b) that measure chromatin accessibility, and single-nucleus methylcytosine sequencing (snmC-seq) (Luo et al., 2017) which measures methylome at the single-cell resolution. High technical variation is presented in single-cell datasets due to the limited amount of genomic materials and the experimental procedures to amplify the signals (Lahnemann et al., 2020).

Because cell types are usually unknown beforehand, clustering methods are needed to identify the cell types. Majority of existing clustering algorithms only take one single dataset as input. Beside the widely used K-Means clustering algorithm, hierarchical clustering (Ward, 1963) forms hierarchical groups of mutually exclusive subsets on the basis of their similarity with respect to specified characteristics by considering the union of all possible $\frac{k (k - 1)}{2}$ pairs and accepting the union with which an optimal value of the objective function is associated. Spectral Clustering (Ng et al., 2001) uses the top k eigenvectors of a matrix derived from the distance between points simultaneously for clustering. Several algorithms are developed specifically for scRNA-seq data. SC3 (Kiselev et al., 2017) combines multiple clustering outcomes through a consensus approach. SIMLR (Wang et al., 2017) learns a distance metric by multiple kernels and clusters with affinity propagation. CIDR (Lin et al., 2017) imputes the gene expression profiles, calculates the dissimilarity based on the imputed gene expression profiles for every pair of single cells, performs principal coordinate analysis using the dissimilarity matrix, and finally performs clustering using the first few principal coordinates. SOUP (Zhu et al., 2019) semi-softly classifies both pure and intermediate cell types: it first identifies the set of pure cells by special block structure and estimates a membership matrix, then estimates soft membership for the other cells. For the analysis of single-cell chromatin accessibility data, scABC (Zamanighomi et al., 2018) first weights cells and applies weighted K-medoids clustering, then calculate landmarks for each cluster, and finally clusters the cells by assignment to the closest landmark based on Spearman correlation. Cusanovich (Cusanovich et al., 2018a) makes use of singular value decomposition on TF-IDF transformed matrix and density peak clustering algorithm. cisTopic (Bravo González-Blas et al., 2019) uses latent Dirichlet allocation with a collapsed Gibbs sampler to iteratively optimize the region-topic distribution and the topic-cell distribution. SCALE (Xiong et al., 2019) combines the variational autoencoder framework with the Gaussian Mixture Model which extracts latent features that characterize the distributions of input scATAC-seq data, and then uses the latent features to cluster cell mixtures into subpopulations. Clustering methods are also developed for single-cell methylation data. BPRMeth (Kapourani and Sanguinetti, 2016) uses probabilistic machine learning to extract higher order features across a defined region and to cluster promoter-proximal regions by Binomial distributed probit regression (BPR) and mixture modeling. PDclust (Hui et al., 2018) leverages the methylation state of individual CpGs to obtain pairwise dissimilarity (PD) values, and calculates Euclidean distances between each pair of cells using their PD values and performed hierarchical clustering. Melissa (Kapourani and Sanguinetti, 2019) implements a Bayesian hierarchical model that jointly learns the methylation profiles of genomic regions of interest and clusters cells based on their genome-wide methylation patterns. pCSM (Yin et al., 2019) implements a semi-reference-free procedure to perform virtual methylome dissection using the non-negative matrix factorization algorithm. It first determines putative cell-type-specific methylated loci and then clusters the loci into groups based on their correlations in methylation profiles.

Studies based on single-omic data provide only a partial landscape of the entire cellular heterogeneity (Ma et al., 2020). High technical noise and the growth of available datasets measuring different genomic features encourage integrative analysis (Lahnemann et al., 2020). By combining complementary information from multiple datasets, the cell types may be better separated and characterized (Corces et al., 2016; Duren et al., 2017). The integrative analysis of gene expression and chromatin activity may better define cell types and lineages, especially in complex tissues (Duren et al., 2018). Seurat V3 (Stuart et al., 2019) uses Canonical Correlation Analysis (CCA) to reduce the dimension of the datasets. It identifies the pairwise correspondences of single cells across datasets, termed ‘anchors’, and then transfers labels from a reference dataset onto a query dataset. coupleNMF (Duren et al., 2018) is based on the coupling of two non-negative matrix factorizations, where a ‘soft’ clustering can be obtained following the matrix factorizations. It enables integrative analysis of scRNA-seq and scATAC-seq data. LIGER (Welch et al., 2019) integrates multimodal datasets via integrative non-negative matrix factorization (iNMF) to learn a low-dimensional space defined by dataset-specific factors and shared factors across datasets, and then build a neighborhood graph based on the shared factors to identify joint clusters by performing community detection on this graph. scACE (Lin et al., 2020) is a model-based approach that jointly analyzes single-cell chromatin accessibility and scRNA-Seq data, and it quantifies the uncertainty of cluster assignments. MAESTRO (Wang et al., 2020) integrates scRNA-seq and scATAC-seq data from multiple platforms. It also provides comprehensive functions for pre-processing, alignment, quality control and quantification of expression and accessibility. coupleCoC (Zeng et al., 2020) performs co-clustering of the cells and the features simultaneously in the source data and the target data, and it also matches the cell clusters between the source data and the target data through minimizing the distribution divergence. scMC (Zhang and Nie, 2021) integrates multiple scRNA-Seq datasets or multiple scATAC-Seq datasets, where it learns biological variation via variance analysis to subtract technical variation inferred in an unsupervised manner. The three data types, including gene expression, chromatin accessibility and methylation, have distinct characteristics and complex relationships with each other. The aforementioned methods for integrative analysis are not designed to integrate all three data types. Moreover, these methods (except scACE) do not provide statistical inference on the cluster assignments, which may be important when there are cells at the intermediate stages during development.

In this work, we extend scACE (Lin et al., 2020) to scAMACE (integrative Analysis of single-cell Methylation, chromatin ACcessibility and gene Expression). scAMACE considers the biological and technical variabilities when integrating multiple data types, and it can provide statistical inference on the assignment of clusters. We reason that by combining complementary biological information from multiple data types, better cell type separation can be achieved. We present our model in Section 2, and statistical inference using the Expectation–Maximization (EM) algorithm in Section 3. Simulation study and real data applications are presented in Sections 4 and 5, respectively. The conclusion is presented in Section 6.

2 Materials and methods

An overview of scAMACE is presented in Figure 1.

Fig. 1.

Open in new tab Download slide

Graphical representation of scAMACE

2.1 Model for scRNA-Seq data

The model specification for scRNA-Seq data is as the following.

\begin{matrix} ω_{k g}^{rna} \to \overset{z_{l k}}{\to} u_{l g} \to v_{l g} \to y_{l g} \forall g, \\ P (z_{l k} = 1) = ψ_{k}^{rna}, \\ u_{l g} | z_{l k} = 1 \sim Bernoulli (ω_{k g}^{rna}), \\ v_{l g} | u_{l g} \sim u_{l g} B ernoulli (π_{l 1}) + (1 - u_{l g}) Bernoulli (π_{l 0}), \\ π_{l 0} \sim Beta (α = 1, β = 1), π_{l 1} \sim 1 (π_{l 1} \geq π_{l 0}) Beta (α = 1, β = 1), \\ p (y_{l g} | v_{l g}) = v_{l g} g_{1} (y_{l g}) + (1 - v_{l g}) g_{0} (y_{l g}) . \end{matrix}

We assume that there are K cell clusters in total, the random variable z_lk denotes whether cell l belongs to cluster $k \in {1, \dots, K}$ and $z_{l \cdot}$ follows categorical distribution with probability $ψ_{k}^{rna}$ for cluster k.

$ω_{k g}^{rna}$ denotes the probability that gene g is active in cluster k. u_lg is a binary latent variable representing whether gene g is active in cell l and u_lg = 1 represents that it is active. v_lg denotes whether gene g is expressed in cell l and v_lg = 1 represents that it is expressed.

When gene g is active in cell l (u_lg = 1), the probability that gene g is expressed in cell l (v_lg = 1) is $π_{l 1}$ ⁠, while the probability that gene g is expressed is $π_{l 0}$ if the gene is not active (u_lg = 0). Since genes are more likely to be expressed when they are active, we assume that $π_{l 1} \geq π_{l 0}$ and the prior distributions of $π_{l 1}$ and $π_{l 0}$ are assumed to be flat.

Let y_lg denote the observed gene expression for gene g in cell l (after normalization to account for sequencing depth and gene length), and we assume that $y_{l g} | v_{l g}$ follows a mixture distribution, where $g_{1} (.)$ and $g_{0} (.)$ are density functions of the expression level conditional on v_lg.

2.2 Model for single-cell chromatin accessibility (scCAS) data

The model specification for scCAS data is as the following.

\begin{matrix} ω_{k g}^{acc} \to \overset{z_{l k}}{\to} u_{i g} \to o_{i g} \to x_{i g} \forall g, \\ P (z_{i k} = 1) = ψ_{k}^{acc}, \\ u_{i g} | z_{i k} = 1 \sim Bernoulli (ω_{k g}^{acc}), \\ o_{i g} | u_{i g} \sim u_{i g} B ernoulli (π_{i 1}) + (1 - u_{i g}) Bernoulli (π_{i 0}), \\ π_{i 1} \sim Beta (α_{acc} = 1, β_{acc} = 1), set π_{i 0} = 0, \\ p (x_{i g} | o_{i g}) = o_{i g} f_{1} (x_{i g}) + (1 - o_{i g}) f_{0} (x_{i g}), \\ ω_{k g}^{acc} | ω_{k g}^{rna} \sim Beta (μ_{k g}^{acc}, ϕ^{acc}), logit (μ_{k g}^{acc}) = f (ω_{k g}^{rna}) . \end{matrix}

The random variables $ω_{k g}^{acc}$ ⁠, z_ik, $ψ_{k}^{acc}$ and u_ig have similar interpretations to their corresponding variables in the model for scRNA-Seq data. We use a different notation i to represent that the cells in the scCAS data are different from the cells in the scRNA-Seq data.

x_ig denotes the observed gene score for gene g in cell i. The gene score summarizes the accessibility of the regions around the gene body (Cusanovich et al., 2018a). We model it by a mixture distribution with density functions $f_{1} (.), f_{0} (.)$ and binary latent variable o_ig. o_ig = 1, and 0 represent the mixture components with high (f₁) and low (f₀) gene scores, respectively. Accessibility tends to be positively associated with activity of the gene. We model this positive relationship by the distribution $o_{i g} | u_{i g}$ ⁠. When gene g is active in cell i (u_ig = 1), the probability that it has high gene score (o_ig = 1) is $π_{i 1}$ ⁠; When gene g is inactive in cell i (u_ig = 0), the probability that it has high gene score (o_ig = 1) is $π_{i 0}$ ⁠. We assume that $π_{i 1} \geq π_{i 0}$ to represent the positive relationship. In practice, we found that fixing $π_{i 0} = 0$ leads to good real data performance, and we set $π_{i 0} = 0$ by default. The prior distribution $π_{i 1} \sim Beta (α = 1, β = 1)$ ⁠. In real data example 1, the observed data is promoter accessibility and we use the same model as that for gene score.

We assume that $ω_{k g}^{acc}$ follows Beta distribution with mean $μ_{k g}^{acc}$ and precision $ϕ^{acc}$ ⁠. The variable $μ_{k g}^{acc}$ is connected with $ω_{k g}^{rna}$ in scRNA-Seq data through the logit function: $logit (μ_{k g}^{acc}) = f (ω_{k g}^{rna})$ ⁠. Details on the specification of $f (\cdot)$ are presented in Section 2.6.

2.3 Model for single-cell methylation data

The model specification for sc-methylation data is as the following.

\begin{matrix} ω_{k g}^{met} \to \overset{z_{d k}}{\to} u_{d g} \to m_{d g} \to t_{d g} \forall g, \\ P (z_{d k} = 1) = ψ_{k}^{met}, \\ u_{d g} | z_{d k} = 1 \sim Bernoulli (ω_{k g}^{met}), \\ m_{d g} | u_{d g} \sim u_{d g} B ernoulli (π_{d 1}) + (1 - u_{d g}) Bernoulli (π_{d 0}), \\ π_{d 0} \sim Beta (α = 1, β = 1), π_{d 1} \sim 1 (π_{d 1} \leq π_{d 0}) Beta (α = 1, β = 1), \\ p (t_{d g} | m_{d g}) = m_{d g} h_{1} (t_{d g}) + (1 - m_{d g}) h_{0} (t_{d g}), \\ ω_{k g}^{met} | ω_{k g}^{rna} \sim Beta (μ_{k g}^{met}, ϕ^{met}), logit (μ_{k g}^{met}) = g (ω_{k g}^{rna}) . \end{matrix}

The random variables $ω_{k g}^{met}$ ⁠, z_dk, $ψ_{k}^{met}$ and u_dg have similar interpretations to their corresponding variables in the model for scRNA-Seq data. We use a different notation d to represent that the cells in the sc-methylation data are different from the cells in the scRNA-Seq data.

The binary random variable m_dg denotes whether gene g is methylated in cell d, and m_dg = 1 represents that it is methylated. Methylation of a gene (promoter methylation/gene body methylation) tends to be negatively associated with activity of the gene, and we model this negative relationship with the model $m_{d g} | u_{d g}$ ⁠: when the gene g is active in cell d (u_dg = 1), it is less likely to be methylated (m_dg = 1), as we assume that $π_{d 1} \leq π_{d 0}$ ⁠.

t_dg denotes the observed methylation level for gene g in cell d, and we assume that $t_{d g} | m_{d g}$ follows a mixture distribution, where $h_{1} (.)$ and $h_{0} (.)$ are density functions conditional on m_dg. The technologies/features differ for the two real data applications to be presented: promoter methylation for the gene (Pott, 2017), and gene body methylation at non-CG sites (Luo et al., 2017).

Similar to scCAS data, we connect $μ_{k g}^{met}$ ⁠, which is the mean of $ω_{k g}^{met}$ ⁠, and $ω_{k g}^{rna}$ through the logit function: $logit (μ_{k g}^{met}) = g (ω_{k g}^{rna})$ ⁠. Details on specification of $g (\cdot)$ are presented in Section 2.6.

2.4 More on model specification

Methylation and chromatin accessibility regulate gene expression biologically. Our model is specified in the reverse order, so gene expression plays a central role. This is because scRNA-Seq data is usually less noisy compared with scCAS data and sc-methylation data, the model specified this way will improve the clustering performance of scCAS data and sc-methylation data, without sacrificing much the clustering performance of scRNA-Seq data.

2.5 Prior specifications

We assume the following priors for

ψ^{acc}, ψ^{rna}, ψ^{met}, ω_{k g}^{rna}

⁠.

ψ^{acc} \sim Dir (2, \dots, 2), ψ^{rna} \sim Dir (2, \dots, 2), ψ^{met} \sim Dir (2, \dots, 2), ω_{k g}^{rna} \sim Beta (α_{1} = 2, β_{1} = 2)

The prior specification $Beta (α = 2, β = 2)$ improves the stability of the EM algorithm in Section 3 over uniform distritbution.

2.6 Determination of $f (ω_{k g}^{rna}), g (ω_{k g}^{rna}), ϕ^{acc}$ and $ϕ^{met}$

We assume that $f (ω_{k g}^{rna}) = η + γ ω_{k g}^{rna} + τ {(ω_{k g}^{rna})}^{2}$ and $g (ω_{k g}^{rna}) = δ + θ ω_{k g}^{rna}$ ⁠. The parameters { $η, γ, τ, δ, θ, ϕ^{acc}, ϕ^{met}$ } are estimated empirically from the datasets. We first set the number of clusters K = 1 and use the model to estimate $ω_{k g}^{rna}, ω_{k g}^{acc}$ and $ω_{k g}^{met}$ separately without considering the links on $ω$ across the three datasets, and then fix ${\hat{ω}}_{k g}^{rna}, {\hat{ω}}_{k g}^{acc}$ and ${\hat{ω}}_{k g}^{met}$ to estimate { $η, γ, τ, δ, θ, ϕ^{acc}, ϕ^{met}$ } by beta regression (Silvia and Francisco, 2004). The rationale for fixing K = 1 to estimating the parameters in the functions $f (.)$ and $g (.)$ that link the three modalities is that the majority of the features may not change much across the cell types. We fix { $\hat{η}, \hat{γ}, \hat{τ}, \hat{δ}, \hat{θ}, {\hat{ϕ}}^{acc}, {\hat{ϕ}}^{met}$ } when implementing the EM algorithm in Section 3. Estimating { $η, γ, τ, δ, θ, ϕ^{acc}, ϕ^{met}$ } separately from the EM algorithm improves computational efficiency and avoids problematic local modes. Distributions of ${\hat{ω}}_{k g}^{rna}$ v.s. ${\hat{ω}}_{k g}^{acc}$ and ${\hat{ω}}_{k g}^{rna}$ v.s. ${\hat{ω}}_{k g}^{met}$ for the two real data applications are presented in Supplementary Figures S4 and S5, we can see from Supplementary Figures S4 and S5 that the linear and quadratic models capture the trends on how ${\hat{ω}}_{k g}^{acc}$ and ${\hat{ω}}_{k g}^{met}$ changes with ${\hat{ω}}_{k g}^{rna}$ ⁠.

2.7 The mixture components

For scCAS data, we apply $f_{1} (x) = 0, f_{0} (x) = 1$ if x = 0 and $f_{1} (x) = 1, f_{0} (x) = 0$ if x > 0, due to the sparsity of the data matrix.

For scRNA-Seq data, we first normalize read counts to TPM (transcripts per million) or FPKM (fragments per kilobase of exon model per million reads mapped) to account for sequencing depth and gene length, then fit a two-component gamma mixture model for the non-zero entries, through pooling ln(TPM + 1) or ln(FPKM + 1) over all the samples, and then the remaining zero entries are merged with the mixture component that has a smaller mean. The log transformation takes into account the very large values in the data matrix.

sc-methylation data represents the proportion of methylated sites within a given genomic interval, where the entries in the data matrix take values between 0 and 1. In the two real data applications, majority of entries in the data matrix take small values, for the methylation data in each cell, we first divide the entries by $(1 - entries)$ to map them into $[0, \infty)$ ⁠. We then normalize the entries by dividing the median of non-zero entries in each cell, and then take square of the entries to boost the signals. This transformation helps to align the three modalities and it improves the clustering results significantly (Supplementary Tables S5 and S6). Because the transformed entries represent the relative evidence of the methylation status, we input the transformed entries directly as the ratio $\frac{h_{1} (\cdot)}{h_{0} (\cdot)}$ in the EM algorithm. Histograms for the distributions of the sc-methylation data are presented in Supplementary Figure S1.

2.8 Feature selection

scRNA-Seq data is usually the least noisy data type, compared with scCAS and sc-methylation data. We use scRNA-Seq data for feature selection before implementing scAMACE. We first cluster scRNA-Seq data with SC3 and then use the cluster assignments to select top 1000 features with large mean shift across different clusters. More specifically, denote the data matrix as $X_{n \times p}$ (x_ij denotes the observation for the ith cell and jth feature), the cluster assignments as $L_{n \times 1}$ (l_i = k denotes that the ith cell belongs to the kth cluster) and total number of clusters as K. For feature j, we first calculate the difference between the mean of the cells within one cell type and the mean of cells in other cell types; the differences are represented as $D (j) = (d_{1 j}, \dots, d_{K j})$ ⁠, where $d_{k j} = \underset{i : l_{i} = k}{mean} (x_{i j}) - \underset{i : l_{i} \neq k}{mean} (x_{i j})$ ⁠. We take the maximum entry in D(j): $m (j) = {max}_{k} D (j)$ ⁠. When m(j) is large, it represents that feature j has high expression in one cluster, compared with all other clusters. Finally, we select the top 1000 features with highest values in m(j).

2.9 Determination of the number of clusters K

We determine the number of clusters K for the three single-cell datasets separately before we apply scAMACE. We first run K-Means for each K and calculate the average silhouette width of observations (Kaufman and Rousseeuw, 1990). Silhouette width measures how well an observation has been classified. For each observation i, the silhouette value s(i) is calculated as follows. First denote by A the cluster to which observation i has been assigned and then calculate

a (i) = average Euclidean distance of i to all other objects of A .

Now consider any cluster C different from A and define

\begin{matrix} d (i, C) = average Euclidean distance of i to all objects of C . \\ b (i) = \min_{C \neq A} d (i, C) . \end{matrix}

Then $s (i) = \frac{b (i) - a (i)}{\max (b (i), a (i))}$ ⁠. When cluster A contains only a single observation, we simply set s(i) = 0. The average of s(i) for $i = 1, 2, \dots, n$ is denoted by $\bar{s} (k)$ ⁠, and it is called the average silhouette width for the entire dataset. $\bar{s} (k)$ is used for the selection of K. Higher value in $\bar{s} (k)$ indicates better clustering outcome. We select K that has the maximum average silhouette width. Details for selecting K in the two real data applications are presented in Supplementary Figures S2 and S3. When the similarity of the cell types is high, the Silhoutte method may choose a smaller K than the number of cell types (Supplementary Fig. S3), and we may choose a larger K instead.

3 Statistical inference: EM algorithm

Given the observed scCAS data $X$ ⁠, scRNA-Seq data $Y$ ⁠, and sc-methylation data $T$ ⁠, we treat the latent variables $Γ = {Z, U, O, V, M}$ as missing data, and use the EM algorithm to estimate the parameters $Φ = {ψ^{acc}, ω_{}^{acc}, π_{i}, ψ^{rna}, ω_{}^{rna}, π_{l}, ψ^{met}, ω_{}^{met}, π_{d}}$ ⁠. The Q-function is $Q (Φ | Φ_{old}) = E_{old} (l n (P (Φ, Γ | obs .)))$ ⁠, where the expectation is over $Γ$ under distribution $P (Γ | Φ^{old}, obs .)$ ⁠.

In the M-step, we maximize

Q (Φ | Φ_{old})

with respect to

Φ

and update parameters as follows.

\begin{matrix} ψ_{k}^{acc} = \frac{1 + \sum_{i} E_{old} (z_{i k})}{K + n_{acc}}, \\ ψ_{k}^{rna} = \frac{1 + \sum_{l} E_{old} (z_{l k})}{K + n_{rna}}, \\ ψ_{k}^{met} = \frac{1 + \sum_{d} E_{old} (z_{d k})}{K + n_{met}}, \\ π_{i 1} = \frac{\sum_{k} \sum_{g} E_{old} (z_{i k} u_{i g} o_{i g}) + α_{acc} - 1}{\sum_{k} \sum_{g} E_{old} (z_{i k} u_{i g}) + α_{acc} + β_{acc} - 2}, \\ π_{l 1} = \frac{\sum_{k} \sum_{g} E_{old} (z_{l k} u_{l g} v_{l g})}{\sum_{k} \sum_{g} E_{old} (z_{l k} u_{l g})}, \\ π_{l 0} = \frac{\sum_{k} \sum_{g} E_{old} [z_{l k} (1 - u_{l g}) v_{l g}]}{\sum_{k} \sum_{g} E_{old} [z_{l k} (1 - u_{l g})] - 1}, \\ π_{d 1} = \frac{\sum_{k} \sum_{g} E_{old} (z_{d k} u_{d g} m_{d g})}{\sum_{k} \sum_{g} E_{old} (z_{d k} u_{d g})}, \\ π_{d 0} = \frac{\sum_{k} \sum_{g} E_{old} [z_{d k} (1 - u_{d g}) m_{d g}] - 1}{\sum_{k} \sum_{g} E_{old} [z_{d k} (1 - u_{d g})] - 1}, \\ ω_{k g}^{acc} = \frac{\sum_{i} E_{old} (z_{i k} u_{i g}) + μ_{k g}^{acc} ϕ^{acc} - 1}{\sum_{i} E_{old} (z_{i k}) + ϕ^{acc} - 2}, \\ ω_{k g}^{met} = \frac{\sum_{d} E_{old} (z_{d k} u_{d g}) + μ_{k g}^{met} ϕ^{met} - 1}{\sum_{d} E_{old} (z_{d k}) + ϕ^{met} - 2} . \end{matrix}

We use grid search to update $ω_{k g}^{rna}$ because its optimal value does not have an explicit form.

We iterate between E-step and M-step until converge. $E (Z_{i .}), E (Z_{l .})$ and $E (Z_{d .})$ in the last iteration are used for clustering. Details for the derivations are presented in Supplementary Materials.

4 Simulation studies

To validate scAMACE, we generated three different types of simulated data $x, y$ and $t$ following the model assumption. In the simulated data, the sample sizes n_x = 900, n_y = 1100 and n_t = 1000. The number of features p = 1000. The number of clusters $K_{x} = K_{y} = K_{t} = 3$ ⁠, and $ψ_{x} = ψ_{y} = ψ_{t} = (\frac{1}{3}, \frac{1}{3}, \frac{1}{3})$ ⁠. $f (ω_{k g}^{y}) = η + γ ω_{k g}^{y} + τ {(ω_{k g}^{y})}^{2} = - 1 + 7 ω_{k g}^{y} - 2 {(ω_{k g}^{y})}^{2}, g (ω_{k g}^{y}) = δ + θ ω_{k g}^{y} = - 2 + 5 ω_{k g}^{y}, ϕ^{x} = 10$ and $ϕ^{t} = 10$ ⁠. The detailed simulation scheme is presented in Supplementary Materials.

For the first data type, x, we set $f_{1} (x) = 0$ if x = 0, and $f_{0} (x) = 0$ if x = 1. We fit a two-component gamma mixture model for $y$ using ‘gammamixEM’ in R (Young et al., 2019) and beta mixture model for $t$ using ‘betamix’ in R (Cribari-Neto and Zeileis, 2010; Grun et al., 2012) to estimate the mixture densities. We apply the method in Section 2.6 to estimate parameters in $f (ω_{k g}^{y})$ and $g (ω_{k g}^{y})$ ⁠. We then implement scAMACE using the estimated densities and $\hat{η}, \hat{γ}, \hat{τ}, {\hat{ϕ}}^{x}, \hat{δ}, \hat{θ}, {\hat{ϕ}}^{t}$ ⁠.

We use purity, rand index, adjusted rand index and normalized mutual information to evaluate the clustering results. We implement scAMACE either on the three data types separately [‘scAMACE (separate)’] without borrowing information or jointly [‘scAMACE (joint)’]. Table 1 presents the simulation results. We also compared scAMACE with other existing methods under four additional simulation schemes: imbalanced datasets where the number of cells varies across the three datasets (Supplementary Table S1), unequal numbers of clusters in the three datasets (Supplementary Table S2), imbalanced cluster sizes (Supplementary Table S3) and smaller number of features (Supplementary Table S4). scAMACE performs the best compared with the other methods in the above simulation settings. This is likely due to integration of information from all three datasets.

Table 1.

Mean and SD (in parentheses) of purity, rand index, adjusted rand index (ARI) and normalized mutual information (NMI) for 50 independent runs are shown

	Data type	Purity	Rand index	ARI	NMI
	$x$	0.690(0.025)	0.683(0.018)	0.288(0.041)	0.245(0.035)
scAMACE (joint)	$y$	0.897(0.009)	0.874(0.010)	0.716(0.022)	0.637(0.022)
	$t$	0.704(0.021)	0.693(0.016)	0.310(0.034)	0.265(0.030)
	$x$	0.659(0.028)	0.662(0.018)	0.241(0.041)	0.205(0.034)
scAMACE (separate)	$y$	0.838(0.012)	0.810(0.012)	0.573(0.028)	0.498(0.026)
	$t$	0.643(0.020)	0.651(0.012)	0.216(0.028)	0.185(0.024)
	$x$	0.383(0.020)	0.558(0.004)	0.007(0.008)	0.008(0.007)
K-means	$y$	0.714(0.036)	0.702(0.026)	0.331(0.058)	0.283(0.049)
	$t$	0.388(0.021)	0.560(0.004)	0.010(0.008)	0.106(0.008)
	$x$	0.360(0.011)	0.488(0.047)	0.001(0.001)	0.003(0.002)
Hierarchical clustering	$y$	0.366(0.011)	0.520(0.026)	0.002(0.002)	0.003(0.002)
	$t$	0.360(0.010)	0.532(0.022)	0.001(0.001)	0.002(0.001)
	$x$	0.395(0.020)	0.561(0.004)	0.012(0.009)	0.013(0.008)
Spectral clustering	$y$	0.722(0.018)	0.708(0.018)	0.344(0.041)	0.295(0.034)
	$t$	0.400(0.025)	0.562(0.005)	0.014(0.012)	0.015(0.011)

	Data type	Purity	Rand index	ARI	NMI
	$x$	0.690(0.025)	0.683(0.018)	0.288(0.041)	0.245(0.035)
scAMACE (joint)	$y$	0.897(0.009)	0.874(0.010)	0.716(0.022)	0.637(0.022)
	$t$	0.704(0.021)	0.693(0.016)	0.310(0.034)	0.265(0.030)
	$x$	0.659(0.028)	0.662(0.018)	0.241(0.041)	0.205(0.034)
scAMACE (separate)	$y$	0.838(0.012)	0.810(0.012)	0.573(0.028)	0.498(0.026)
	$t$	0.643(0.020)	0.651(0.012)	0.216(0.028)	0.185(0.024)
	$x$	0.383(0.020)	0.558(0.004)	0.007(0.008)	0.008(0.007)
K-means	$y$	0.714(0.036)	0.702(0.026)	0.331(0.058)	0.283(0.049)
	$t$	0.388(0.021)	0.560(0.004)	0.010(0.008)	0.106(0.008)
	$x$	0.360(0.011)	0.488(0.047)	0.001(0.001)	0.003(0.002)
Hierarchical clustering	$y$	0.366(0.011)	0.520(0.026)	0.002(0.002)	0.003(0.002)
	$t$	0.360(0.010)	0.532(0.022)	0.001(0.001)	0.002(0.001)
	$x$	0.395(0.020)	0.561(0.004)	0.012(0.009)	0.013(0.008)
Spectral clustering	$y$	0.722(0.018)	0.708(0.018)	0.344(0.041)	0.295(0.034)
	$t$	0.400(0.025)	0.562(0.005)	0.014(0.012)	0.015(0.011)

Open in new tab

Table 1.

Mean and SD (in parentheses) of purity, rand index, adjusted rand index (ARI) and normalized mutual information (NMI) for 50 independent runs are shown

	Data type	Purity	Rand index	ARI	NMI
	$x$	0.690(0.025)	0.683(0.018)	0.288(0.041)	0.245(0.035)
scAMACE (joint)	$y$	0.897(0.009)	0.874(0.010)	0.716(0.022)	0.637(0.022)
	$t$	0.704(0.021)	0.693(0.016)	0.310(0.034)	0.265(0.030)
	$x$	0.659(0.028)	0.662(0.018)	0.241(0.041)	0.205(0.034)
scAMACE (separate)	$y$	0.838(0.012)	0.810(0.012)	0.573(0.028)	0.498(0.026)
	$t$	0.643(0.020)	0.651(0.012)	0.216(0.028)	0.185(0.024)
	$x$	0.383(0.020)	0.558(0.004)	0.007(0.008)	0.008(0.007)
K-means	$y$	0.714(0.036)	0.702(0.026)	0.331(0.058)	0.283(0.049)
	$t$	0.388(0.021)	0.560(0.004)	0.010(0.008)	0.106(0.008)
	$x$	0.360(0.011)	0.488(0.047)	0.001(0.001)	0.003(0.002)
Hierarchical clustering	$y$	0.366(0.011)	0.520(0.026)	0.002(0.002)	0.003(0.002)
	$t$	0.360(0.010)	0.532(0.022)	0.001(0.001)	0.002(0.001)
	$x$	0.395(0.020)	0.561(0.004)	0.012(0.009)	0.013(0.008)
Spectral clustering	$y$	0.722(0.018)	0.708(0.018)	0.344(0.041)	0.295(0.034)
	$t$	0.400(0.025)	0.562(0.005)	0.014(0.012)	0.015(0.011)

	Data type	Purity	Rand index	ARI	NMI
	$x$	0.690(0.025)	0.683(0.018)	0.288(0.041)	0.245(0.035)
scAMACE (joint)	$y$	0.897(0.009)	0.874(0.010)	0.716(0.022)	0.637(0.022)
	$t$	0.704(0.021)	0.693(0.016)	0.310(0.034)	0.265(0.030)
	$x$	0.659(0.028)	0.662(0.018)	0.241(0.041)	0.205(0.034)
scAMACE (separate)	$y$	0.838(0.012)	0.810(0.012)	0.573(0.028)	0.498(0.026)
	$t$	0.643(0.020)	0.651(0.012)	0.216(0.028)	0.185(0.024)
	$x$	0.383(0.020)	0.558(0.004)	0.007(0.008)	0.008(0.007)
K-means	$y$	0.714(0.036)	0.702(0.026)	0.331(0.058)	0.283(0.049)
	$t$	0.388(0.021)	0.560(0.004)	0.010(0.008)	0.106(0.008)
	$x$	0.360(0.011)	0.488(0.047)	0.001(0.001)	0.003(0.002)
Hierarchical clustering	$y$	0.366(0.011)	0.520(0.026)	0.002(0.002)	0.003(0.002)
	$t$	0.360(0.010)	0.532(0.022)	0.001(0.001)	0.002(0.001)
	$x$	0.395(0.020)	0.561(0.004)	0.012(0.009)	0.013(0.008)
Spectral clustering	$y$	0.722(0.018)	0.708(0.018)	0.344(0.041)	0.295(0.034)
	$t$	0.400(0.025)	0.562(0.005)	0.014(0.012)	0.015(0.011)

Open in new tab

In the following two real data applications, we apply methods mentioned in Section 2.7 instead of fitting a beta mixture model to sc-methylation data.

5 Application to real data

5.1 Application 1: K562 and GM12878 scRNA-Seq, scATAC-Seq and sc-methylation data

We evaluate scAMACE by jointly clustering scRNA-Seq, scATAC-Seq and sc-methylation data generated from two cell types, K562 and GM12878 (Buenrostro et al., 2015; Li et al., 2017; Pott, 2017). We set K = 2, and use the true cell labels as a benchmark to evaluate the performance of the clustering methods. Tables 2 and 3 presents the clustering results. scAMACE performs well in seperating the cell types. scRNA-Seq is perfectly separated, while there are only three cells that are not classified correctly in the sc-methylation dataset and eleven misclassifications in the scATAC-Seq dataset. In addition, the two cell types are correctly matched across the three datasets. Compared with the clustering results given by implementing scAMACE separately on the three datasets, jointly clustering the three datasets improves the overall clustering performance, especially for scATAC-Seq data, which is likely due to the integration of information across the three datasets.

Table 2.

Clustering tables for K562, GM12878 scRNA-Seq, scATAC-Seq and sc-methylation data

scAMACE (joint)

scAMACE (separate)

1

2

1

2

scATAC-Seq

GM12878

368

5

254

119

K562

6

660

171

495

scRNA-Seq

GM12878

128

0

128

0

K562

0

73

0

73

sc-methyl

GM12878

16

3

7

12

K562

0

11

0

Seurat V3

LIGER

1

2

3

1

2

3

scATAC-Seq

GM12878

346

27

K562

499

167

scRNA-Seq

GM12878

101

2

25

127

0

1

K562

0

73

0

10

63

0

sc-methyl

GM12878

19

K562

11

Open in new tab

We compared scAMACE with Seurat V3 (Stuart et al., 2019), LIGER (Welch et al., 2019) and scMC (Zhang and Nie, 2021), which are methods for integrative analysis of single-cell data. Examples were presented in Seurat V3 (Stuart et al., 2019) where scRNA-Seq and scATAC-Seq data were integrated. So we implemented Seurat V3 to integrate these two data types. Seurat V3 did not perform well for scATAC-Seq data (Table 2). Seurat V3 is not applicable to integrate sc-methylation data with the other two datasets. Examples were presented in LIGER (Welch et al., 2019) where scRNA-Seq data and sc-methylation data were integrated. So we implemented LIGER to integrate these two data types. LIGER did not perform well on sc-methylation data (Table 2). We also implemented LIGER to integrate all three datasets, and LIGER still did not perform well on sc-methylation data (Supplementary Tables S7 and S8), this may be due to the small sample size in sc-methylation data. scMC (Zhang and Nie, 2021) was developed for the integrative analysis of multiple single-cell datasets with the same data type. Since the features in scATAC-Seq data, scRNA-Seq data and sc-methylation data are linked, scMC can be implemented in principle. scMC did not perform well on scATAC-Seq data and sc-methylation data (Supplementary Tables S7 and S8). This may be due to the fact that the characteristics of different data types are very different, and ignoring the difference leads to suboptimal performance.

Table 3.

Comparison of the performance of different methods on the K562, GM12878 dataset by adjusted rand index

	scAMACE (joint)	scAMACE (separate)	Seurat V3	LIGER	scMC
scATAC-Seq	0.958	0.192	0.033		0.000
scRNA-Seq	1.000	1.000	0.713	0.800	0.771
sc-methyl	0.628	0.260		0.000	0.000

	scAMACE (joint)	scAMACE (separate)	Seurat V3	LIGER	scMC
scATAC-Seq	0.958	0.192	0.033		0.000
scRNA-Seq	1.000	1.000	0.713	0.800	0.771
sc-methyl	0.628	0.260		0.000	0.000

Open in new tab

Table 3.

Comparison of the performance of different methods on the K562, GM12878 dataset by adjusted rand index

	scAMACE (joint)	scAMACE (separate)	Seurat V3	LIGER	scMC
scATAC-Seq	0.958	0.192	0.033		0.000
scRNA-Seq	1.000	1.000	0.713	0.800	0.771
sc-methyl	0.628	0.260		0.000	0.000

	scAMACE (joint)	scAMACE (separate)	Seurat V3	LIGER	scMC
scATAC-Seq	0.958	0.192	0.033		0.000
scRNA-Seq	1.000	1.000	0.713	0.800	0.771
sc-methyl	0.628	0.260		0.000	0.000

Open in new tab

5.2 Application 2: mouse neocortex scRNA-Seq, sci-ATAC-Seq and sc-methylation data

In this example, we evaluate scAMACE for the joint analysis of single-cell datasets where the cell types are different across the datasets.

We collected single-cell datasets generated from mouse neocortex. There are five cell types in scRNA-Seq data (Tasic et al., 2018), including astrocytes (Astro), glutamatergic neurons in layer 4 (L4), corticothalamic glutamatergic neurons in layer 6 (L6 CT), oligodendrocytes (Oligo) and Pvalb+ GABAergic neurons (Pvalb). There are three cell types in sci-ATAC-Seq data (Cusanovich et al., 2018b), including astrocytes (Astro), excitatory neurons CPN (Ex. neurons CPN) and oligodendrocytes (Oligo). There are three cell types in sc-methylation dataset (Luo et al., 2017), including excitatory neurons in layer 4 (L4), excitatory neurons in layer 6 [labeled as L6-2 in (Luo et al., 2017)] and Pvalb+ GABAergic neurons (Pvalb). In the three datasets, the optimal numbers of clusters chosen by the Silhoutte method, $\hat{K} = 2$ ⁠, tend to be smaller than the numbers of cell types, which is likely due to the similarity of the neuronal subtypes. We set K = 5 when we implement scAMACE, instead of the value given by the Silhoutte method. The true cell labels are used as a benchmark for evaluating the performance of the clustering methods.

The clustering results are presented in Tables 4 and 5. Even though K is larger than the number of cell types in sci-ATAC-Seq data and sc-methylation data, scAMACE still determines the correct number of cell types in sci-ATAC-Seq data. Although the cells in sc-methylation data fall into four clusters, there are only seven cells in cluster 4. Cell types in all three datasets are well separated. Astrocytes and oligodendrocytes are matched across scRNA-Seq data and sci-ATAC-Seq data. Excitatory neurons CPN in sci-ATAC-Seq data are matched with glutamatergic neurons in layer 4 in the scRNA-Seq data. We note that most excitatory neurons are glutamatergic neurons. Excitatory neurons in layers 4 and 6, and Pvalb+ GABAergic neurons are matched between scRNA-Seq data and sc-methylation data.

Table 4.

Clustering tables for the mouse neocortex scRNA-Seq, sci-ATAC-Seq and sc-methylation data

scAMACE (joint)

scAMACE (separate)

1

2

3

4

5

1

2

3

4

5

sci-ATAC-Seq

Astro

550

0

1

550

0

1

Ex. neurons CPN

0

1391

0

1

1390

0

Oligo

0

1

457

0

458

scRNA-Seq

Astro

368

0

368

0

L4

0

1401

0

1401

0

L6 CT

0

960

0

960

0

Oligo

25

0

66

0

27

0

64

0

Pvalb

0

1337

0

1337

sc-methyl

L4

411

1

0

412

L6-2

20

703

6

0

729

Pvalb

0

1

153

154

Seurat V3

LIGER

1

2

3

4

5

6

7

8

9

1

2

3

4

5

6

7

sci-ATAC-Seq

Astro

296

29

1

2

223

Ex. neurons CPN

110

461

243

0

577

Oligo

64

90

1

0

303

scRNA-Seq

Astro

0

368

0

6

0

362

0

L4

1028

0

373

0

1401

0

L6 CT

0

960

0

2

0

958

0

Oligo

0

60

31

0

60

Pvalb

0

647

498

0

192

0

1337

0

sc-methyl

L4

0

11

679

0

L6-2

0

13

399

0

Pvalb

68

0

86

Open in new tab

Compared with implementing scAMACE on the three datasets separately, the joint analysis leads to improvement in clustering, especially for sc-methylation dataset. This is likely because the joint model borrows information across the three datasets. Similar to application 1, we implemented Seurat V3 to integrate scRNA-Seq and sci-ATAC-Seq data. Seurat V3 (Stuart et al., 2019) does not perform well on sci-ATAC-Seq data (Table 4). We implemented LIGER (Welch et al., 2019) to integrate scRNA-Seq and sc-methylation data. LIGER does not separate excitatory neurons in layer 4 and layer 6 in sc-methylation data (Table 4). We also integrated all three datasets by LIGER (Welch et al., 2019) and scMC (Zhang and Nie, 2021). LIGER and scMC did not perform well (Supplementary Tables S9 and S10). Overall, scAMACE performed the best compared with the other methods.

Table 5.

Comparison of the performance of different methods on the mouse neocortex dataset by adjusted rand index

	scAMACE (joint)	scAMACE (separate)	Seurat V3	LIGER	scMC
sci-ATAC-Seq	0.998	0.998	0.058		0.019
scRNA-Seq	0.997	0.997	0.697	0.983	0.145
sc-methyl	0.932	0.000		0.316	0.001

	scAMACE (joint)	scAMACE (separate)	Seurat V3	LIGER	scMC
sci-ATAC-Seq	0.998	0.998	0.058		0.019
scRNA-Seq	0.997	0.997	0.697	0.983	0.145
sc-methyl	0.932	0.000		0.316	0.001

Open in new tab

Table 5.

Comparison of the performance of different methods on the mouse neocortex dataset by adjusted rand index

	scAMACE (joint)	scAMACE (separate)	Seurat V3	LIGER	scMC
sci-ATAC-Seq	0.998	0.998	0.058		0.019
scRNA-Seq	0.997	0.997	0.697	0.983	0.145
sc-methyl	0.932	0.000		0.316	0.001

	scAMACE (joint)	scAMACE (separate)	Seurat V3	LIGER	scMC
sci-ATAC-Seq	0.998	0.998	0.058		0.019
scRNA-Seq	0.997	0.997	0.697	0.983	0.145
sc-methyl	0.932	0.000		0.316	0.001

Open in new tab

5.3 Computational cost

LIGER, Seurat V3 and scMC only provide the versions that are implemented on CPU, while scAMACE can be implemented on both CPU and GPU. We summarized the computational time for scAMACE (CPU version and GPU version in python), LIGER (Welch et al., 2019), Seurat V3 (Stuart et al., 2019) and scMC (Zhang and Nie, 2021) (Supplementary Tables S11–S13). We implemented scAMACE, LIGER and scMC to cluster the three types of data simultaneously, and we implemented Seurat V3 to cluster scCAS data and scRNA-Seq data.

On real data application 2 (∼8000 cells), the computational time for scAMACE are 418.858 s on one 3.4 GHz Intel Xeon Gold CPU and 69.652 s on one 3.1 GHz Dual Intel Xeon Gold GPU. Compared with LIGER (80.389 s on one 3.4 GHz Intel Xeon Gold CPU), scMC (372.323 s on one 3.4 GHz Intel Xeon Gold CPU) and Seurat V3 (116.688 s for scRNA-Seq and sci-ATAC-Seq data on one 3.4 GHz Intel Xeon Gold CPU), scAMACE has competitive computational speed, especially the GPU version.

Next, we generated a dataset with sample size = 30,000 (⁠ $n_{acc} = n_{rna} = n_{met} = 10 000$ ⁠) by sampling the cells with replacement from real data application 2. The computational time for scAMACE are 1534.631 s on one 3.4 GHz Intel Xeon Gold CPU and 250.089 s on one 3.1 GHz Dual Intel Xeon Gold GPU. Compared with LIGER (555.574 s on one 3.4 GHz Intel Xeon Gold CPU), scMC (3667.878 s on one 3.4 GHz Intel Xeon Gold CPU) and Seurat V3 (290.640 s for scRNA-Seq and sci-ATAC-Seq data on one 3.4 GHz Intel Xeon Gold CPU), scAMACE has competitive computational speed on datasets with larger scale.

6 Conclusion

Unsupervised methods including dimension reduction and clustering are essential to the analysis of single-cell genomic data as the cell types are usually unknown. We have developed scAMACE, a model-based approach for integratively clustering single-cell data on chromatin accessibility, gene expression and methylation. scAMACE provides statistical inference of cluster assignments and achieves better cell type separation combining biological information across different types of genomic features. In the two real data applications, the scRNA-Seq data are generated from the SMART-Seq platform (Li et al., 2017; Tasic et al., 2018). To implement scAMACE on UMI-based scRNA-Seq data (10x data), we may need to modify the distributions of the mixture components $g_{0} (\cdot)$ and $g_{1} (\cdot)$ ⁠. The cells in our real data examples are differentiated and mature cells. In the future, we will investigate the performance of scAMACE on immature cells undergoing differentiation.

Data availability

Real data application 1 human K562-GM12878 scRNA-Seq data was retrieved from NCBI Gene Expression Omnibus (GEO) with the accession number GSE81861, scATAC-Seq data was collected from GEO with the accession number GSE65360, and sc-methylation data is available at GEO with the accession number GSE83882. Real data application 2 mouse neocortex scRNA-Seq data are available at GEO with the accession number GSE115746, scATAC-Seq datasets were downloaded from https://atlas.gs.washington.edu/mouse-atac/data/, and sc-methylation data are available at GEO with the accession number GSE97179.

Acknowledgements

The authors thank the reviewers’ constructive comments and suggestions, and Jinwen Yang, Wenyu Zhang, Pengcheng Zeng for the helpful discussions.

Funding

This work was supported by the Chinese University of Hong Kong direct grants [4053360, 4053423], the Chinese University of Hong Kong startup grant [4930181], the Chinese University of Hong Kong’s Project Impact Enhancement Fund (PIEF) and Science Faculty’s Collaborative Research Impact Matching Scheme (CRIMS) and Hong Kong Research Grant Council [ECS 24301419, GRF 14301120].

Conflict of interest: The authors declare that they have no conflict of interest.

References

Bravo González-Blas

C.

et al. (

2019

)

cisTopic: cis-regulatory topic modeling on single-cell ATAC-seq data

.

Nat. Methods

,

16

,

397

–

400

.

Buenrostro

J.D.

et al. (

2015

)

Single-cell chromatin accessibility reveals principles of regulatory variation

.

Nature

,

523

,

486

–

490

.

Corces

M.R.

et al. (

2016

)

Lineage-specific and single-cell chromatin accessibility charts human hematopoiesis and leukemia evolution

.

Nat. Genet

.,

48

,

1193

–

1203

.

Cribari-Neto

F.

,

Zeileis

A.

(

2010

)

Beta regression in r

.

J. Stat. Softw

.,

34

,

1

–

24

.

Google Scholar

Crossref

WorldCat

Cusanovich

D.A.

et al. (

2018a

)

The cis-regulatory dynamics of embryonic development at single-cell resolution

.

Nature

,

555

,

538

–

542

.

Cusanovich

D.A.

et al. (

2018b

)

A single-cell atlas of in vivo mammalian chromatin accessibility

.

Cell

,

174

,

1309

–

1324.e18

.

Duren

Z.

et al. (

2017

)

Modeling gene regulation from paired expression and chromatin accessibility data

.

Proc. Natl. Acad. Sci. USA

,

114

,

E4914

–

E4923

.

Google Scholar

Crossref

WorldCat

Duren

Z.

et al. (

2018

)

Integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations

.

Proc. Natl. Acad. Sci. USA

,

115

,

7723

–

7728

.

Google Scholar

Crossref

WorldCat

Grun

B.

et al. (

2012

)

Extended beta regression in r: shaken, stirred, mixed, and partitioned

.

J. Stat. Softw

.,

48

,

1

–

25

.

Google Scholar

Crossref

WorldCat

Hui

T.

et al. (

2018

)

High-resolution single-cell DNA methylation measurements reveal epigenetically distinct hematopoietic stem cell subpopulations

.

Stem Cell Rep

.,

11

,

578

–

592

.

Google Scholar

Crossref

WorldCat

Kapourani

C.-A.

,

Sanguinetti

G.

(

2016

)

Higher order methylation features for clustering and prediction in epigenomic studies

.

Bioinformatics

,

32

,

i405

–

i412

.

Kapourani

C.-A.

,

Sanguinetti

G.

(

2019

)

Melissa: Bayesian clustering and imputation of single-cell methylomes

.

Genome Biol

.,

20

,

61

.

Kaufman

L.

,

Rousseeuw

P.J.

(

1990

)

Finding Groups in Data: An Introduction to Cluster Analysis

.

John Wiley, New York

.

Kiselev

V.Y.

et al. (

2017

)

Sc3: consensus clustering of single-cell RNA-seq data

.

Nat. Methods

,

14

,

483

–

486

.

Lahnemann

D.

et al. (

2020

)

Eleven grand challenges in single-cell data science

.

Genome Biol

.,

21

,

31

.

Li

H.

et al. (

2017

)

Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors

.

Nat. Genet

.,

49

,

708

–

718

.

Lin

P.

et al. (

2017

)

CIDR: ultrafast and accurate clustering through imputation for single-cell RNA-seq data

.

Genome Biol

.,

18

,

59

.

Lin

Z.

et al. (

2020

)

Model-based approach to the joint analysis of single-cell data on chromatin accessibility and gene expression

.

Statist. Sci

.,

35

,

2

–

13

.

Google Scholar

Crossref

WorldCat

Luo

C.

et al. (

2017

)

Single-cell methylomes identify neuronal subtypes and regulatory elements in mammalian cortex

.

Science

,

357

,

600

–

604

.

Ma

A.

et al. (

2020

)

Integrative methods and practical challenges for single-cell multi-omics

.

Trends Biotechnol

.,

38

,

1007

–

1022

.

Ng

A.Y.

et al. (

2001

). On spectral clustering: analysis and an algorithm. In

Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01

.

MIT Press

,

Cambridge, MA, USA

, pp.

849

–

856

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Pott

S.

(

2017

)

Simultaneous measurement of chromatin accessibility, DNA methylation, and nucleosome phasing in single cells

.

eLife

,

6

,

e23203

.

Silvia

F.

,

Francisco

C.-N.

(

2004

)

Beta regression for modelling rates and proportions

.

J. Appl. Stat

.,

31

,

799

–

815

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Stuart

T.

et al. (

2019

)

Comprehensive integration of single-cell data

.

Cell

,

177

,

1888

–

1902.e21

.

Tasic

B.

et al. (

2018

)

Shared and distinct transcriptomic cell types across neocortical areas

.

Nature

,

563

,

72

–

78

.

Wang

B.

et al. (

2017

)

Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning

.

Nat. Methods

,

14

,

414

–

416

.

Wang

C.

et al. (

2020

)

Integrative analyses of single-cell transcriptome and regulome using maestro

.

Genome Biol

.,

21

,

198

.

Ward

J.H.

(

1963

)

Hierarchical grouping to optimize an objective function

.

J. Am. Stat. Assoc

.,

58

,

236

–

244

.

Google Scholar

Crossref

WorldCat

Welch

J.D.

et al. (

2019

)

Single-cell multi-omic integration compares and contrasts features of brain cell identity

.

Cell

,

177

,

1873

–

1887.e17

.

Xiong

L.

et al. (

2019

)

Scale method for single-cell atac-seq analysis via latent feature extraction

.

Nat. Commun

.,

10

,

4576

.

Yin

L.

et al. (

2019

)

Virtual methylome dissection facilitated by single-cell analyses

.

Epigenet. Chromatin

,

12

,

66

.

Google Scholar

Crossref

WorldCat

Young

D.S.

et al. (

2019

)

Finite mixture-of-gamma distributions: estimation, inference, and model-based clustering

.

Adv. Data Anal. Classif

.,

13

,

1053

–

1082

.

Google Scholar

Crossref

WorldCat

Zamanighomi

M.

et al. (

2018

)

Unsupervised clustering and epigenetic classification of single cells

.

Nat. Commun

.,

9

,

2410

.

Zeng

P.

et al. (

2020

)

Coupled co-clustering-based unsupervised transfer learning for the integrative analysis of single-cell genomic data

.

Brief. Bioinf

.,

bbaa347

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Zhang

L.

,

Nie

Q.

(

2021

)

SCMC learns biological variation through the alignment of multiple single-cell genomics datasets

.

Genome Biol

.,

22

,

10

.

Zhu

L.

et al. (

2019

)

Semisoft clustering of single-cell data

.

Proc. Natl. Acad. Sci. USA

,

116

,

466

–

471

.

Google Scholar

Crossref

WorldCat

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Associate Editor:

Download all slides

Month:	Total Views:
June 2021	54
July 2021	21
August 2021	75
September 2021	74
October 2021	43
November 2021	81
December 2021	59
January 2022	38
February 2022	38
March 2022	41
April 2022	36
May 2022	43
June 2022	25
July 2022	16
August 2022	13
September 2022	18
October 2022	11
November 2022	38
December 2022	36
January 2023	17
February 2023	16
March 2023	21
April 2023	46
May 2023	26
June 2023	29
July 2023	34
August 2023	26
September 2023	29
October 2023	34
November 2023	24
December 2023	43
January 2024	35
February 2024	30
March 2024	21
April 2024	12

Article Contents

scAMACE: model-based approach to the joint analysis of single-cell data on chromatin accessibility, gene expression and methylation

Abstract

1 Introduction

2 Materials and methods

2.1 Model for scRNA-Seq data

2.2 Model for single-cell chromatin accessibility (scCAS) data

2.3 Model for single-cell methylation data

2.4 More on model specification

2.5 Prior specifications

2.6 Determination of $f (ω_{k g}^{rna}), g (ω_{k g}^{rna}), ϕ^{acc}$ and $ϕ^{met}$

2.7 The mixture components

2.8 Feature selection

2.9 Determination of the number of clusters K

3 Statistical inference: EM algorithm

4 Simulation studies

5 Application to real data

5.1 Application 1: K562 and GM12878 scRNA-Seq, scATAC-Seq and sc-methylation data

5.2 Application 2: mouse neocortex scRNA-Seq, sci-ATAC-Seq and sc-methylation data

5.3 Computational cost

6 Conclusion

Data availability

Acknowledgements

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

scAMACE: model-based approach to the joint analysis of single-cell data on chromatin accessibility, gene expression and methylation

Abstract

1 Introduction

2 Materials and methods

2.1 Model for scRNA-Seq data

2.2 Model for single-cell chromatin accessibility (scCAS) data

2.3 Model for single-cell methylation data

2.4 More on model specification

2.5 Prior specifications

2.6 Determination of f(ωkgrna), g(ωkgrna), ϕacc and ϕmet

2.7 The mixture components

2.8 Feature selection

2.9 Determination of the number of clusters K

3 Statistical inference: EM algorithm

4 Simulation studies

5 Application to real data

5.1 Application 1: K562 and GM12878 scRNA-Seq, scATAC-Seq and sc-methylation data

5.2 Application 2: mouse neocortex scRNA-Seq, sci-ATAC-Seq and sc-methylation data

5.3 Computational cost

6 Conclusion

Data availability

Acknowledgements

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only

2.6 Determination of $f (ω_{k g}^{rna}), g (ω_{k g}^{rna}), ϕ^{acc}$ and $ϕ^{met}$