-
PDF
- Split View
-
Views
-
Cite
Cite
Mohammadamin Edrisi, Monica V Valecha, Sunkara B V Chowdary, Sergio Robledo, Huw A Ogilvie, David Posada, Hamim Zafar, Luay Nakhleh, Phylovar: toward scalable phylogeny-aware inference of single-nucleotide variations from single-cell DNA sequencing data, Bioinformatics, Volume 38, Issue Supplement_1, July 2022, Pages i195–i202, https://doi.org/10.1093/bioinformatics/btac254
- Share Icon Share
Abstract
Single-nucleotide variants (SNVs) are the most common variations in the human genome. Recently developed methods for SNV detection from single-cell DNA sequencing data, such as SCI and scVILP, leverage the evolutionary history of the cells to overcome the technical errors associated with single-cell sequencing protocols. Despite being accurate, these methods are not scalable to the extensive genomic breadth of single-cell whole-genome (scWGS) and whole-exome sequencing (scWES) data.
Here, we report on a new scalable method, Phylovar, which extends the phylogeny-guided variant calling approach to sequencing datasets containing millions of loci. Through benchmarking on simulated datasets under different settings, we show that, Phylovar outperforms SCI in terms of running time while being more accurate than Monovar (which is not phylogeny-aware) in terms of SNV detection. Furthermore, we applied Phylovar to two real biological datasets: an scWES triple-negative breast cancer data consisting of 32 cells and 3375 loci as well as an scWGS data of neuron cells from a normal human brain containing 16 cells and approximately 2.5 million loci. For the cancer data, Phylovar detected somatic SNVs with high or moderate functional impact that were also supported by bulk sequencing dataset and for the neuron dataset, Phylovar identified 5745 SNVs with non-synonymous effects some of which were associated with neurodegenerative diseases.
Phylovar is implemented in Python and is publicly available at https://github.com/NakhlehLab/Phylovar.
1 Introduction
With the advent of the first single-cell sequencing (SCS) techniques (Navin et al., 2011; Tang et al., 2009), the fields of single-cell genomics, transcriptomics, proteomics and epigenetics have witnessed remarkable growth over the last decade. Single-cell sequencing technologies have impacted our understanding in different fields of biology including developmental biology, immunology, microbiology and cancer biology (Kashima et al., 2020; Lim et al., 2020; Navin, 2014; Tang et al., 2019; Wang and Navin, 2015). Single-cell DNA sequencing (scDNAseq), as one of the SCS technologies, provides insights into the somatic evolutionary process by sequencing the genomic contents of a complex tissue at a single-cell resolution (Navin, 2014; Navin et al., 2011). Preparing scDNAseq data requires a whole-genome amplification (WGA) process to amplify the DNA material of a single cell to suffice the amount of DNA needed for sequencing. (Kashima et al., 2020; Zafar et al., 2018). WGA technologies, such as multiple displacement amplification (MDA) (Dean et al., 2002; Spits et al., 2006) and multiple annealing and looping-based amplification cycles (MALBAC) (Zong et al., 2012) can elevate the noise level in scDNAseq data. The scDNAseq technical errors include allelic dropout (ADO), false-positive (FP) errors, false-negative (FN) errors and non-uniform coverage (Navin, 2014; Zafar et al., 2018). ADO refers to cases where only one of the two alleles in a heterozygous mutation is amplified, resulting in the loss of the mutated allele. FP artifacts can appear due to uneven amplification or at the early stages of the amplification when the original nucleotide is substituted randomly. The non-uniform coverage over different genomic loci may result in missing data due to zero or insufficient coverage. The scDNAseq-specific technical errors fueled the development of tools such as Monovar (Zafar et al., 2016) and SCcaller (Dong et al., 2017) for detecting single-nucleotide variations (SNVs) from scDNAseq data. Although Monovar and SCcaller account for uneven coverage and scDNAseq-specific errors, more recent methods, SCI (Singer et al., 2018) and scVILP (Edrisi et al., 2019), showed further improvement in overcoming the scDNAseq-specific technical errors by simultaneously inferring the cells’ phylogeny and SNVs. SCI uses a Markov chain Monte Carlo (MCMC) algorithm to sample the joint posterior distribution of SNVs and the phylogenetic tree of the single cells and reports the tree(s) with the best posterior probability and the corresponding genotypes. scVILP is formulated as an instance of Mixed Integer Linear Programming (MILP) and it aims to find maximum likelihood estimation (MLE) of the observed read counts given the underlying genotype matrix. Here, the MILP solver is restricted to proposing only the genotype matrices that satisfy three-gamete condition in order to maximize the likelihood function (see Estabrook et al., 1976; Fernández-Baca, 2001; Gusfield, 1991, 1997; Meacham, 1983; Semple and Steel, 2003 for more details on work related to inference under the three-gamete condition).
Although ‘regularizing’ the mutation detection by using a tree as a guide is a promising direction (Edrisi et al., 2019; Kuipers et al., 2020; Markowska et al., 2021; Singer et al., 2018), applying SCI and scVILP to datasets with large number of loci such as in Evrony et al. (2015) and Wang et al. (2014) is challenged by either very long running time or large memory consumption of the methods—the major issues in SCI and scVILP, respectively. Indeed, scVILP runs out of memory on all of the datasets considered in our study here, except for the smallest ones, which is why we do not report on the performance of scVILP. To address this challenge, we developed Phylovar, a likelihood-based method for phylogeny-aware inference of SNVs from scDNAseq datasets consisting of a large number of loci. To simplify likelihood calculations for large-scale data, we assume that mutations occur following an infinite-sites assumption (ISA) (Deshwar et al., 2015; Jahn et al., 2016; Singer et al., 2018). Using this model, Phylovar finds the tree topology and the placement of mutations on ancestral single cells that maximize the likelihood of the erroneous observed read counts given the genotypes. Utilizing a vectorized formulation for likelihood calculations, Phylovar benefits from the vectorized operations in matrix manipulation packages such as NumPy (Harris et al., 2020) to scale up to many loci. We compared the SNV calling accuracy, memory consumption and running time of Phylovar against those of the existing methods, Monovar and SCI, through a simulation study. We found that Phylovar outperforms SCI in terms of running time with the same accuracy, while being more accurate than Monovar. Furthermore, we applied our method to two biological datasets: a triple-negative breast cancer (TNBC) dataset (Wang et al., 2014) consisting of 32 single cells and 3375 candidate loci, as well as the dataset from Evrony et al. (2015) containing 16 normal human neuron cells and 2 489 545 candidate loci. For the TNBC data, Phylovar inferred 652 SNVs with ‘high’ or ‘moderate’ functional impact, out of which 550 (84%) were also supported by bulk sequencing. For the neuron cells, Phylovar identified 5745 SNVs with non-synonymous effects some of which were related to neurodegenerative diseases. To the best of our knowledge, Phylovar is the first scDNAseq SNV caller that can utilize the underlying tree structure even when the dataset contains millions of genomic loci.
2 Materials and methods
The input to Phylovar consists of the reference and variant count matrices, denoted by and , where N and M represent the number of single cells and candidate loci, respectively. Each entry in R and V represents the number of reference and variant counts, respectively, at cell i and site j. These count matrices are obtained from an input file in mpileup format. Here, candidate loci are defined as the genomic loci with a significant number of variant reads. This significance is measured by a statistic test. Note that these loci may not necessarily contain SNVs since the variant reads might be artifacts of scDNAseq technical errors. In all experiments reported below, we used SCI’s likelihood ratio test described in Singer et al. (2018) to identify candidate loci for the analyses. If the total read coverage at a cell and a candidate site is less than λ, the corresponding entry is treated as missing data. We used λ = 1 in practice.
2.1 Single-cell genotype error model
2.2 Single-cell read count model
The variables μ0 and μ1 are the success probabilities associated with reference and alternate alleles, respectively. In practice, we set μ0 to 0.001, which is at the same order of magnitude for the error rate in different Illumina sequencing platforms (Stoler and Nekrutenko, 2021). We used 0.5 for the value of μ1, which is the mean of variant read counts for a heterozygous mutation.
2.3 Tree model
Our tree model consists of two components: a binary tree topology —where V denotes the set of nodes, and E is the set of edges—and a mutation placement for each genomic locus j, . The latter is a binary vector of length containing binary elements for each leaf or internal node in V. We take to denote that a mutation occurred at node q during the evolutionary history of locus j. In our model, we assume mutations evolve following the ISA. According to this model, at most one element in is allowed to be 1. This vector requires each node to have an index from . For simplicity, we map indices to the leaves/single cells and use the same mapping for the single cells in all tree topologies.
2.4 Log-likelihood function
2.5 Hill-climbing search algorithm
The search procedure terminates when the log-likelihood does not improve after a user-specified number of iterations or when it reaches the maximum number of iterations.
2.6 Proposing new error rates
Here, the number of 0 entries in that were ‘corrected’ to 1 in provides a measure of what a more realistic α would be through the hill-climbing trajectory. The same rationale applies to proposing a new value of β. In Equations (8) and (9), and indicate the new values of α and β at iteration t, respectively. Here, and denote the entries of the initial genotype matrix and the genotype matrix at iteration t–1, respectively. The term indicates the entries with non-zero read counts. The symbol represents the logical AND operator.
2.7 Finding the best mutation placement
3 Results and discussion
3.1 Simulation study
We first compared the computational efficiency and SNV calling accuracy of Phylovar, SCI and Monovar using synthetic datasets simulated under five scenarios: (i) varying the number of mutations, (ii) varying the number of cells, (iii) varying the ADO rates, (iv) copy number effect and (v) violation of ISA. We simulated the datasets using the simulator introduced in Singer et al. (2018). In the first scenario, we investigated how Phylovar performs compared to the other methods when increasing the number of mutations dramatically to a large extent.
We simulated datasets containing 16 single cells with 1000, 104 and 105 mutations. For each mutation value, ten datasets were generated. Phylovar’s accuracy was comparable to that of SCI in terms of F1 measure, while both SCI and Phylovar were more accurate than Monovar because of accounting for evolutionary history (Fig. 1a). To measure the running time of each method, we used the CPU clock time (in seconds and log scale) during its execution. Figure 1f shows that the running time of each method increased with the number of mutations. For the largest dataset with 105 mutations, Phylovar was approximately two orders of magnitude faster than SCI.

Summary statistics of different benchmarking experiments. (a–e) F1 accuracy of the methods from simulated data with different number of mutations (a), number of cells (b), ADO rate (c), copy number rate (d) and fraction of ISA violations (e). (f, g) Runtime of the methods on simulated data with varying number of mutations (f) and varying number of cells (g). (h) Linear regression between estimated false-negative error rates (β’s) and actual ADO rates used for simulated data
In the second scenario, we sought to answer how the methods’ performances depend on the number of cells. Here, we fixed the number of mutations at 105 and varied the number of cells, . For each setting, we generated ten datasets. The F1 accuracy scores of all methods improved as the number of cells increased (Fig. 1b). Again, Phylovar’s accuracy was comparable to that of SCI while it outperformed Monovar. We observed that increasing the number of single cells improved the accuracy of all methods more than increasing the number of mutations. As demonstrated in Figure 1g, similar to the first scenario, the running time of Phylovar was almost two orders of magnitude less than that of SCI.
In the third scenario, the ADO rate was varied while cells and mutations were fixed at 16 and 1000. We selected ADO rates from the values , and generated ten datasets for each ADO value. Figure 1c shows that both SCI and Phylovar were more robust to high ADO rates than Monovar due to the utilization of the underlying single-cell phylogeny.
Since Phylovar can estimate the false-negative error rates (denoted by β), we measured the correlation between the ADO rates used for generating the simulated datasets and the estimated β’s. As demonstrated in Figure 1h, these two values were highly correlated (the Pearson correlation coefficient was 0.991). It is worth noting that based on the linear regression line, the estimated β was almost half of the true ADO, pointing to the difference between the dropout mechanism in the simulator and our definition of β (see Section 2). Given an ADO rate μ, the simulator chooses μ fraction of the mutations. It changes of them into reference genotype, and of them into homozygous mutations; the β in our model indicates the probability of a mutation becoming reference genotype implying , which we can observe in Figure 1h.
Since Phylovar assumes the read counts are originated from diploid strands, in the fourth scenario, additional wild-type alleles were introduced to the read counts to imitate the effect of copy number changes. The simulator randomly selects a fraction of mutated loci (named copy number rate), and chooses c extra copies for each loci with probability (Singer et al., 2018). We increased the copy number rate from 0 to 0.5 with step size 0.125. For each value, we generated ten datasets containing 16 cells and 1000 mutations. Figure 1d shows that the SNV calling accuracy of the methods decreased as more mutated loci were subject to copy number changes.
In the fifth scenario, we were interested in observing how violations of ISA affect the SNV calling accuracy of the methods. Given a fraction of mutations, the simulator randomly selects half of them to recur in different branches, and the rest of them to be lost in the same subtree. We increased the fraction of mutations subject to ISA violations from 0 to 0.15 with 0.05 step size. For each value we generated ten datasets with 16 cells and 1000 mutations. Figure 1e shows that all three methods had a stable performance as the fraction of ISA violations increased. This observation suggests that even though the phylogenies inferred by SCI and Phylovar might be inaccurate due to the presence of violations of their evolutionary model, the effect of such violations on mutation inference is negligible.
3.2 Application to real data
We applied Phylovar on two human scDNAseq datasets. The first dataset consists of single-cell whole-exome sequencing (scWES) samples from a triple-negative breast cancer (TNBC) patient (Wang et al., 2014). Since the population sequencing data from bulk tumor and matched normal tissue are available for the TNBC dataset, the number of mutations shared by scWES and bulk data provides us a metric for measuring the accuracy of our approach. The TNBC dataset consists of 16 diploid cells, eight hyperdiploid/aneuploid cells and eight hypodiploid cells (Wang et al., 2014). Given the control normal cells, SCI’s likelihood ratio test identifies the loci likely to contain somatic mutations. Applying this statistic test on the input mpileup file resulted in 3375 candidate loci on which we applied Phylovar. Phylovar was run with ten parallel hill-climbing chains, each for 100 000 iterations on a pool of five CPU’s, each with 48 cores (AMD EPYC 7642) on a node with 192 GB RAM. The total runtime was 91 min. Phylovar inferred an 18.21% false-negative error rate and a 1.03% false-positive error rate from TNBC data. We ran SCI and Monovar with default parameters; SCI and Monovar terminated after 10 h and 144 min, respectively. Figure 2 shows the three methods’ mutation calls on TNBC data from the overlapping sites as well as the initial genotype matrix at the first iteration of our hill-climbing search. We performed hierarchical clustering with Ward’s minimum variance method implemented in Python’s SciPy package (Virtanen et al., 2020) on the genotype matrix for better visualization. We observed concordance between the calls from Phylovar and SCI while Monovar’s calls are noisy and resemble Phylovar’s initial genotypes.

Clustered heatmaps of mutation calls by different methods performed on the TNBC dataset. Here, rows and columns represent the genomic loci and the single cells, respectively. The pixels show mutation calls (dark blue), reference alleles (light blue) and missing data (pink). The initial genotypes are the initial estimates of genotypes considering no error rates and no underlying phylogeny at the starting step of Phylovar’s search algorithm (A color version of this figure appears in the online version of this article)
To annotate the mutations, we applied snpEff (Cingolani et al., 2012) on the SNVs detected by Phylovar. Out of 3375 candidate loci, 652 loci contained SNVs with ‘high’ or ‘moderate’ functional effects (see https://pcingola.github.io/SnpEff/se_inputoutput/ for details on the types of variants’ effects and their descriptions). Then, we ran HaplotypeCaller (GATK version 4.2.0.0) for mutation calling on the bulk tumor and normal samples. Among 652 SNVs in single cells, 550 (84%) mutations were found in bulk data (Fig. 3). Performing this analysis on the results of SCI yielded the same results as for Phylovar: 652 loci contained SNVs with high or moderate functional effect, out of which 550 mutations overlapped with the calls from bulk data. Monovar detected 18 187 high or moderate non-synonymous mutations in the single-cell data from which 10 981 mutations were found in the bulk data as well which is 60% of the single-cell calls (Figs. 4 and 5).

Clustered heatmap of the mutations detected by Phylovar from TNBC data and population sequencing data of tumor and matched normal tissues. The pixels show mutation calls (dark blue), reference alleles (light blue) and missing data (pink). Columns (cells) are colored according to the ploidy of the cells. The colors of the rows (genes) indicate whether the SNV was found in bulk data or not. Here, popT and popN are the tumor and normal population sequencing samples, respectively. Out of 3375 candidate loci, 652 loci contained SNVs with high or moderate functional effects in the single-cell data among which 550 mutations were found in bulk data as well, which is 84% of the single-cell calls (A color version of this figure appears in the online version of this article)

Clustered heatmap of the mutations detected by SCI from TNBC data and population sequencing data of tumor and matched normal tissues. The pixels show mutation calls (dark blue), reference alleles (light blue) and missing data (pink). Columns (cells) are colored according to the ploidy of the cells. Since SCI discarded the diploid cells from the final output, only hypodiploid and hyperdiploid cells are demonstrated here. The colors of the rows (genes) indicate whether the SNV was found in bulk data or not. Here, popT and popN are the tumor and normal population sequencing samples, respectively. SCI detected 652 SNVs with high or moderate functional effects in the single-cell data among which 550 mutations were found in bulk data as well, which is 84% of the single-cell calls (A color version of this figure appears in the online version of this article)

Clustered heatmap of the mutations detected by Monovar from TNBC data and population sequencing data of tumor and matched normal tissues. The pixels show mutation calls (dark blue), reference alleles (light blue) and missing data (pink). Columns (cells) are colored according to the ploidy of the cells. The colors of the rows (genes) indicate whether the SNV was found in bulk data or not. Here, popT and popN are the tumor and normal population sequencing samples, respectively. Monovar detected 18 187 high or moderate non-synonymous mutations in the single-cell data from which 10 981 mutations were found in the bulk data as well which is 60% of the single-cell calls (A color version of this figure appears in the online version of this article)
The second biological data consists of 16 neuron cells on which scWGS was performed to study somatic mutations in human brain development (Evrony et al., 2015). Applying SCI’s statistic test on the input mpileup identified 2 489 545 candidate loci. We ran Phylovar with five parallel hill-climbing chains, each for 50 000 iterations on 5 CPUs with 192 GB RAM. Phylovar finished the process after 17 h and 45 min. To compare our results with other methods, we ran Monovar and SCI with default parameters. Monovar processed the data in 10 h and 26 min, while SCI was still running after ten days. Phylovar’s inferred false-positive and false-negative error rates were 75.22% and 1.17%, respectively. snpEff identified 5745 non-synonymous SNVs among Phylovar’s mutation calls. Figure 6 shows hierarchical clustering on the genotypes of Phylovar, Monovar and the initial genotypes at sites with non-synonymous SNVs. We observed similarities between Monovar’s calls and the initial genotypes. By comparing the panel of Phylovar in Figure 6 with the other panels, one can see the sparse regions of mutations in panels of Monovar and the initial genotype matrix that are inferred as reference alleles by Phylovar.

Clustered heatmaps of mutation calls by different methods performed on neuron cells. Rows and columns represent the genomic loci and the single cells, respectively. The pixels show mutation calls (dark blue), reference alleles (light blue) and missing data (pink). The initial genotypes are the initial estimates of genotypes considering no error rates and no underlying phylogeny at the starting step of Phylovar’s search algorithm (A color version of this figure appears in the online version of this article)
Furthermore, we investigated the genes likely related to neurodegenerative diseases by comparing our findings with the genes reported in Wei et al. (2019), where somatic mutations were studied in 1461 control and diseased human brains with different neurodegenerative disorders. Among the genes inferred by Phylovar to harbor non-synonymous mutations, 12 genes were reported in Wei et al. (2019). We observed that the genes MUC16 and MLIP were frequently mutated in different regions; also, non-synonymous SNVs were observed within KRT33A and SEMA5B in Wei et al. (2019) from patients with Creutzfeldt-Jakob and Alzheimer diseases, respectively. The presence of these non-synonymous mutations in both diseased and normal samples implies the high mutability of these genes even in a healthy individual.
4 Conclusions
The rapid growth of SCS technologies poses computational challenges due to the increasing number of cells and sites sequenced per genome (Lähnemann et al., 2020). In this work, we focused on addressing the computational challenge associated with the breadth of genomic sites in scDNAseq data. Here, we introduced Phylovar, a scalable MLE method for phylogeny-guided inference of SNVs from single-cell DNA sequencing data suitable for scWGS and scWES data with an extensive number of loci. We introduced a novel vectorized formula for likelihood calculation, making Phylovar scalable to hundreds of thousands, even millions of loci.
We assessed Phylovar’s performance against state-of-the-art variant callers SCI (Singer et al., 2018) and Monovar (Zafar et al., 2016), through simulated benchmarks. Phylovar outperforms SCI in terms of running time while being more accurate than Monovar in different simulation scenarios. We also applied Phylovar to two real biological datasets. For a TNBC dataset with 32 single cells and 3375 candidate loci, Phylovar identified SNVs with functional impact among which 84% were supported by bulk sequencing data. Phylovar was also more accurate than Monovar and 6.5x faster than that of SCI. For a larger dataset containing 16 normal human neuron cells and approximately 2.5 million candidate loci, Phylovar identified 5745 non-synonymous SNVs some of which were related to neurodegenerative diseases. Interestingly, Phylovar detected 75.22% false-positive, and 1.17% false-negative error rate for this dataset. The neuron cells data was particularly challenging due to large number of sites. For this data, SCI failed to converge even after ten days of running while Phylovar terminated after less than 18 h.
Phylovar makes it possible to analyze datasets with large number of loci within reasonable time and memory requirements, thus adding to the growing toolbox for analyzing scDNAseq data. As a direction for future research, we will explore deviations from the simplified ISA model and investigate the feasibility of applying more general finite-sites models (FSM) (Zafar et al., 2017, 2019) to datasets with many loci. As scDNAseq technologies advance, the sequencing cost per cell decreases (Lim et al., 2020; Tang et al., 2019). Consequently, we expect more scWGS and scWES datasets to emerge in the future, requiring methods such as Phylovar that can perform scalable variant calling on datasets with millions of loci.
Acknowledgements
The authors thank Dr Nicholas Navin and Min Hu for providing access to the TNBC population sequencing data. They also thank Prof. Dan Gusfield and Dr Robert Gysel for their insight on the concept of Perfect Phylogeny and helping them with testing their tools.
Funding
This work was supported in part by the National Science Foundation [IIS-1812822 and IIS-2106837 to L.N.].
Conflict of Interest: none declared.