Airpart: interpretable statistical models for analyzing allelic imbalance in single-cell datasets

Abstract Motivation Allelic expression analysis aids in detection of cis-regulatory mechanisms of genetic variation, which produce allelic imbalance (AI) in heterozygotes. Measuring AI in bulk data lacking time or spatial resolution has the limitation that cell-type-specific (CTS), spatial- or time-dependent AI signals may be dampened or not detected. Results We introduce a statistical method airpart for identifying differential CTS AI from single-cell RNA-sequencing data, or dynamics AI from other spatially or time-resolved datasets. airpart outputs discrete partitions of data, pointing to groups of genes and cells under common mechanisms of cis-genetic regulation. In order to account for low counts in single-cell data, our method uses a Generalized Fused Lasso with Binomial likelihood for partitioning groups of cells by AI signal, and a hierarchical Bayesian model for AI statistical inference. In simulation, airpart accurately detected partitions of cell types by their AI and had lower Root Mean Square Error (RMSE) of allelic ratio estimates than existing methods. In real data, airpart identified differential allelic imbalance patterns across cell states and could be used to define trends of AI signal over spatial or time axes. Availability and implementation The airpart package is available as an R/Bioconductor package at https://bioconductor.org/packages/airpart. Supplementary information Supplementary data are available at Bioinformatics online.


Gene clustering
The Gaussian Mixture Model (GMM) method implemented in mclust (Scrucca et al., 2016) and the hierarchical clustering method implemented in dynamicTreeCut (Langfelder et al., 2007) (both methods included as alternatives in airpart) were used in order to detect gene clusters C u for u ∈ {1, ..., U } according to the allelic ratio where U is the number of gene clusters. For both methods, instead of directly providing the observed allelic ratios, first dimensionality reduction was performed on the observed allelic ratios via principal component analysis (PCA), for speed concerns and noise reduction. The data were clustered in a reduced dimensional space of twice the number of cell types, which is set in advance, either via prior knowledge or clustering cells using total counts. The transformed data in the reduced PC space were provided to the gene clustering methods. Briefly, the model-based clustering determines the most likely number and assignment of genes to clusters according to geometric properties of the data (distribution, volume, and shape). Within the GMM, an Expectation-Maximization algorithm is used for maximum likelihood estimation. The best model was selected according to Bayes Information Criteria, where mclust was run with modelNames="EII" both in the main model and initialization, which assumes the clusters all have a spherical distribution with equal volume. This model type was found to provide the most coherent clusters in terms of DAI in exploratory analysis. As an alternative, the hierarchical clustering method was used. The dynamicTreeCut algorithm can automatically choose the best tree by analyzing the shape of the branches that iteratively decomposing and combining clusters until find a stable solution. dynamicTreeCut was run with Manhattan distance and Ward agglomeration method. In the main Methods, for partition and allelic ratio inference, a single cluster is considered at a time. For notational simplicity, we omit index u when referring to data from a gene cluster in the partitioning or allelic ratio inference sections.

Data preprocessing
The following steps were applied to all four non-simulated datasets (Larsson et al. (2019), Deng et al. (2014), Combs and Fraser (2018), Gutierrez-Arcelus et al. (2020)), regardless of whether bulk or single cell. Cells/samples in which > 5% of the total counts (UMI counts for scRNA-seq) mapped to the mitochondrially encoded genes as well as those who exceed 3 median absolute deviations from the median detected genes were discarded. Genes that had 20 total reads in at least 25% cells/samples (Khansefid et al., 2018) within each cell type or condition were kept for the Combs dataset, while five reads were used as a threshold for the Gutierrez-Arcelus dataset, and one read for the two single-cell datasets. As we were primarily interested in detecting cis genetic regulation (consistent imbalance relative to the genotypes), and not imprinting or genome activation, the genes on the chromosome X were removed. Finally, only the gene set with largest ordered weighted mean allelic ratio difference ≥ 0.05 that potentially exists differential allelic imbalance were modeled. For the Combs dataset, five reciprocal crosses (i.e. with either a D. melanogaster mother or a D. simulans mother) had different number of slices varying among 25-27, so we manually grouped the slices into 19 bins according to the correlation of consecutive slices across replicates. Three samples (a slice from a particular embryo) had relatively low correlation with other nearby slices and were dropped, and three samples were filtered out because of low total counts. Then, we performed a likelihood ratio test to detect genes with a significant parent-of-origin effect on allelic imbalance, and these were removed in order to focus on genetic effects. For data from Gutierrez-Arcelus, as signals from the same gene are tightly correlated and it would disrupt the inference if we dealt with each SNP independently, so we chose the SNP with largest total counts per gene. For polymorphic genes, such as genes in the HLA region, we note that we ignore the allelic complexity of these regions, by grouping alleles into two arbitrary groups (based on the SNP with the largest total count). This approach is not recommended for analysis of polymorphic genes. airpart only supports modeling of bi-allelic data (only two alleles present across all samples), and so this simplification was used for demonstration only. After exploratory data analysis of the allelic ratio, we determined there were individual differences in the temporal dynamics of allelic expression, and so we included a per-individual coefficient in the GFL and a per-individual offset in the allelic ratio estimation models. Additionally, from visualization of PCA plots, and allelic ratio heatmaps of prospective clusters, we determined that genes often had distinctive patterns of allelic imbalance after T-cell stimulation; we recommend such exploratory analysis before the modeling step and provide visualization functions in the associated Bioconductor package. We therefore decided to perform partitioning and allelic ratio estimation gene-by-gene. As there were too few observations to robustly select via cross-validated deviance per gene, we manually set the maximum regularization parameter λ and derived the optimal λ controlling the amount of fusing of time points (??) according to BIC within specific range. Following the convention used in Gutierrez-Arcelus et al. (2020), we define the allelic ratio as the proportion of reference over total read counts for this experiment. Finally, in order to focus on the most interesting patterns of allelic imbalance after T-cell stimulation, we calculate the weighted average allelic ratio for each time point (weighting by the cell's total count) and each gene, then compute the variance over time points. We prioritize the top 43 most-varying genes for analysis for this dataset.

Spatially varying allelic imbalance
We evaluated airpart's performance on a spatial bulk dataset, where Drosophila F1 cross embryos had been sliced along the anterior-posterior axis and the RNA sequenced (Combs and Fraser, 2018). The analysis of this dataset had revealed a number of genes with spatially varying allelic imbalance, indicating that the two strains have diverged in the positioning of cis genetic regulation during development. We created a graph Γ with edges only between consecutive groups of slices, reflecting the positional nature of the data. After removing the genes having clear parent-of-origin expression patterns that we interpreted as due to maternal deposition and quality control, we had 222 genes remaining. After exploratory data analysis (gene clustering followed by examination of allelic ratio heatmaps), we chose to set a large number of gene clusters (60) because there were a lot of divergent across genes and 4 gene clusters were especially interested shown in Figure S10 and others in Figure S11.
Raw observed allelic ratio with loess regression was used as a reference for comparison of fitted values from different models ( Figure S10A). We saw the overall trend as well as the small up-turn at the anterior position in gene tutl and CG13868 were both successfully captured by airpart. All genes in this cluster have allelic imbalance in anterior and posterior regions, e.g CG30015 rejected the null hypothesis and there exists allelic imbalance among grouped slices 1-2,3-4 and 18-19 with credible interval (0.316, 0.385), (0.408, 0.471), (0.216,0.293), respectively.
scDALI was used with the radial basis function (RBF) kernel to allow for cells with similar x value to have similar fitted allelic ratio. We also performed airpart without the grouping step and it accepted basis matrix with 5 degrees of freedom to generate natural cubic splines with hierarchical modeling on dispersion parameter. The fitted allelic ratios ( Figure S10C,D) were more similar to the loess fitted line in Figure S10A. Given that the spatially varying allelic imbalance for these embryos are likely a result of continuous gradients in concentration of regulatory proteins along the anterior-posterior axis (Combs and Fraser, 2018), the radial basis function (RBF) kernel and splines are therefore a good modeling choice for this dataset.
For some genes, scDALI was sensitive to fluctuations in the count ratios (e.g. tutl middle positions) compared to airpart. One of the genes aos, with large allelic ratio variation, failed during modeling with scDALI. More examples of airpart's fusion of consecutive grouped slices are provided in Figure S11.       Figure S1: Statistical estimates for real and simulated single-cell datasets. A) Dispersion parameter estimates in Larsson's data. B) Dispersion parameter estimates in Deng's data. c) Simulated total count distribution with the higher mean count of NB = 10 and φ=20. D) Gene clustering was performed to aggregate signal across genes with similar allelic ratio pattern. The true allelic ratios across 4 cell type of first 50 genes were {0.2,0.2,0.8,0.8}, the middle 50 genes were {0.5,0.5,0.5,0.5} and the last 50 genes were {0.7,0.7,0.9,0.9}. Figure S2: t-SNE plot of true cell types(left) and k -means estimated clusters with average misclassfication rate 10% (right). Red circles indicated misclassified cells. Figure S3: Boxplot of partition accuracy among 3 variants of airpart for θ = 20 under fixed misclassification rates across true cell types. The y axis is adjusted Rand index (which takes discrete values, as can be seen in some of the repeated values for median and box limits) for 200 iterations. cnt: the higher mean count, g: number of genes within a gene cluster. Figure S4: Boxplot of partition accuracy among 3 variants of airpart for θ = 20 under random and heterogeneous misclassification rates per true cell types. The y axis is adjusted Rand index for 200 iterations. cnt: the higher mean count, g: number of genes within a gene cluster. Figure S5: Boxplot of computation time (in seconds) for 1 core across 200 iterations for θ = 20 and n = 40 under random and heterogeneous misclassification rates per true cell type. cnt: the higher mean count, g: number of genes within a gene cluster. Figure S6: Boxplot of partition accuracy among 3 variants of airpart for θ = 3 under fixed misclassification rates across true cell types. The y axis is adjusted Rand index for 200 iterations. cnt: the higher mean count, g: number of genes within a gene cluster. Figure S7: Boxplot of RMSE of estimating allelic ratio on n = 40 among 400 iterations under fixed misclassification rates across true cell types. Each gene has a U-shape pattern described in the Section 2.5 Figure S8: Performance comparison of airpart and scDALI on additional simulations. A) Boxplot shows RMSE of estimating allelic ratio on n = 100 among 400 iterations. Each gene has a U-shape pattern described in the Section 2.5 B) Barplot of 95% credible interval coverage over those observations in which the algorithms converged with n=40 setting on each cell type and each methods. C) Barplot of 95% credible interval coverage over those observations in which the algorithms converged with n = 100 setting on each cell type and each methods.  Figure 3C using the Deng's dataset. Color represents different partition groups. Figure S10: Evaluation of genes with differential allelic imbalance at different locations for Combs' data.
x axis is group slices. For each gene, anterior is left and posterior is right with higher allelic ratio indicating D. simulans biased expression, and lower indicating D. melanogaster biased expression. A) Raw ratio with loess as smoothing blue line. Color represents different replicates. B) Estimates using airpart. C) Estimates using scDALI. D)Estimates using airpart.no group