BBKNN: fast batch alignment of single cell transcriptomes

Polański, Krzysztof; Young, Matthew D; Miao, Zhichao; Meyer, Kerstin B; Teichmann, Sarah A; Park, Jong-Eun

doi:10.1093/bioinformatics/btz625

Abstract

Motivation

Increasing numbers of large scale single cell RNA-Seq projects are leading to a data explosion, which can only be fully exploited through data integration. A number of methods have been developed to combine diverse datasets by removing technical batch effects, but most are computationally intensive. To overcome the challenge of enormous datasets, we have developed BBKNN, an extremely fast graph-based data integration algorithm. We illustrate the power of BBKNN on large scale mouse atlasing data, and favourably benchmark its run time against a number of competing methods.

Availability and implementation

BBKNN is available at https://github.com/Teichlab/bbknn, along with documentation and multiple example notebooks, and can be installed from pip.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

The past few years have seen a rapid development of single cell RNA-Seq, with its increased throughput allowing large scale atlas projects to release data for hundreds of thousands of cells (Tabula Muris Consortium et al., 2018; Han et al., 2018). As with any technology, variation in experimental procedures and conditions between labs creates batch effects that need to be corrected, especially if the potential of collaborative large scale atlasing efforts is to be realized (Kiselev et al., 2018). A number of algorithms have been proposed to tackle this problem (Barkas et al., 2018; Butler et al., 2018; Haghverdi et al., 2018; Hie et al., 2019; Korsunsky et al., 2018; Stuart et al., 2019), but most of them struggle with excessive run time or resource requirements. This is likely to be further exacerbated as the size of scRNA-Seq collections continues to grow. The need for effective scaling into huge datasets is leading to scRNA-Seq analysis becoming established in Python, with SCANPY (Wolf et al., 2018) offering a comprehensive set of analysis and visualization tools covering the entirety of a typical workflow. The only batch correction method currently operating in Python is Scanorama (Hie et al., 2019), which has massive resource requirements that make it challenging to analyze large data collections.

Here, we present BBKNN (batch balanced k nearest neighbours), a simple, fast and lightweight batch alignment method. Performing batch correction at the neighbourhood graph inference step allows for the creation of an algorithm one to two orders of magnitude faster than existing methods, including those implemented with efficient performance in mind. BBKNN is written in Python and compatible with SCANPY, and its output can be immediately used for dimensionality reduction (McInnes and Healy, 2018), clustering (Traag et al., 2019) and pseudotime inference (Haghverdi et al., 2016). We illustrate the method’s utility using a large collection of mouse atlasing data (Tabula Muris Consortium et al., 2018; Dahlin et al., 2018; Deng et al., 2014; Han et al., 2018; Kernfeld et al., 2018; Mohammed et al., 2017; Park et al., 2018; Zeisel et al., 2015), and benchmark its run time against established methods on datasets of up to 2¹⁹ cells.

2 Materials and methods

A common step in scRNA-Seq analysis is the identification of a neighbourhood graph, often done as identifying each cell’s k nearest neighbours in principal component space. This graph is a good approximation of cell population structure, providing a basis for diverse downstream analysis. This includes clustering (Traag et al., 2019), dimensionality reduced visualization (McInnes and Healy, 2018) and pseudotime trajectory inference (Haghverdi et al., 2016). However, experimental variation added by batch effects often leads to cells being unable to connect to the same cell type/state across batches, introducing distortion and fracturing to this graph structure. This causes significant problems in all downstream analysis options outlined above.

BBKNN modifies the neighbourhood construction step to produce a graph that is balanced across all batches of the data. This approach treats the neighbour network as the primary representation of the data. For each cell, the BBKNN graph is constructed by finding the k nearest neighbours for each cell in each user-defined batch independently, resulting in each cell having an independent pool of neighbours in each batch. The neighbour sets are subsequently merged and processed via the UMAP algorithm (McInnes and Healy, 2018), which is the standard adopted by SCANPY (Wolf et al., 2018). BBKNN’s speed stems from a combination of the simplicity of the algorithm with the default use of approximate neighbour detection (annoy, https://github.com/spotify/annoy). This allows the algorithm’s run time to linearly scale with cell total increase. An exact neighbour detection algorithm (Johnson et al., 2017) is supported at a performance loss.

BBKNN’s main assumption is that at least some cells of the same type exist across batches, and that the differences between the same cell type across batches caused by batch effects are less than the differences between cells of different types within a batch. This is the core assumption of mnnCorrect (Haghverdi et al., 2018) and other methods inspired by it. In this case, the graph construction will group together similar cell types across batches while leaving unrelated cell types well separated. Further details of the method, along with a demonstration on simulated (Zappia et al., 2017) and real (Kiselev et al., 2018) data, are discussed in the Supplementary Methods and Supplementary Figures S1–S4.

3 Results

Recent times have seen a veritable flood of murine scRNA-Seq data, with multiple labs across the world collecting diverse datasets ranging from early embryo development to fully matured adult organs. We have collated eight of those, covering cells from at least 26 different mouse organs (Tabula Muris Consortium et al., 2018; Dahlin et al., 2018; Deng et al., 2014; Han et al., 2018; Kernfeld et al., 2018; Mohammed et al., 2017; Park et al., 2018; Zeisel et al., 2015). After down-sampling the data to ensure balanced population sizes (Supplementary Methods, Supplementary Fig. S5), we ended up with a collection of 114 600 cells that were clearly split based on dataset of origin (Supplementary Fig. S6A). Applying BBKNN to the data overcomes this technical effect. Annotating the cells based on atlas of origin along with canonical marker genes (Supplementary Fig. S7) reveals an intuitive biological trajectory (Supplementary Fig. S6B). It starts in the centre of the manifold with embryonic stem cells, which branch into T cell, B cell, myeloid, megakaryocyte and erythrocyte populations in the top of the manifold and epithelial, mesenchymal, endothelial, muscular and neuronal cells in the other path. As such, not only does BBKNN successfully correct the batch effect, it manages to propose a biologically sound structure to the neighbour graph that translates to a cohesive trajectory in UMAP space. When correcting the same data with Harmony (Korsunsky et al., 2018), the leading method in the field, cell populations are successfully merged but the final manifold is more fragmented, with no way to reconstruct the developmental trajectory (Supplementary Fig. S8). The quality of batch mixing in the corrected manifolds was assessed with kBET (Büttner et al., 2019), with BBKNN mildly outperforming Harmony on average score (Supplementary Fig. S9).

In order to comprehensively evaluate BBKNN’s efficiency with relation to established methods (Barkas et al., 2018; Butler et al., 2018; Haghverdi et al., 2018; Hie et al., 2019; Korsunsky et al., 2018; Stuart et al., 2018), we used simulated data (Zappia et al., 2017) to benchmark the algorithms on variably sized datasets (Supplementary Fig. S10). The total cell count was scaled in powers of two, from 2¹¹ to 2¹⁹, with each dataset featuring two equally sized batches of two matching cell types. BBKNN’s default approximate neighbour mode scales linearly with the dataset increase and remains consistently one to two orders of magnitude faster than the other methods. The supported exact nearest neighbour algorithm does not scale linearly with dataset increase, but remains faster than Harmony across the benchmark. The other R-based approaches were left out at the 2¹⁵ mark, and Scanorama was unable to complete processing the 2¹⁶ cell dataset due to resource constraints. The benchmarking was carried out on a personal MacBook Pro with 16GB RAM and a four-core i7 processor.

Funding

This project was supported by Wellcome Sanger core funding (no. WT206194) and Wellcome grant (211276/Z/18/Z). J.-E.P. is supported by an EMBO Long-Term Fellowship. Z.M. is supported by a Single Cell Gene Expression Atlas grant from the Wellcome Trust (nr. 108437/Z/15/Z).

Conflict of Interest: none declared.

References

Barkas

N.

et al. (

2018

) Wiring together large single-cell RNA-seq sample collections. bioRxiv, 460246.

Butler

A.

et al. (

2018

)

Integrating single-cell transcriptomic data across different conditions, technologies, and species

.

Nat. Biotechnol

.,

36

,

411.

Büttner

M.

et al. (

2019

)

A test metric for assessing single-cell RNA-seq batch correction

.

Nat. Methods

,

16

,

43.

Dahlin

J.S.

et al. (

2018

)

A single cell hematopoietic landscape resolves eight lineage trajectories and defects in kit mutant mice

.

Blood

,

131

,

e1

–

e11

.

Deng

Q.

et al. (

2014

)

Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells

.

Science

,

343

,

193

–

196

.

Haghverdi

L.

et al. (

2016

)

Diffusion pseudotime robustly reconstructs lineage branching

.

Nat. Methods

,

13

,

845.

Haghverdi

L.

et al. (

2018

)

Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors

.

Nat. Biotechnol

.,

36

,

421.

Han

X.

et al. (

2018

)

Mapping the mouse cell atlas by microwell-seq

.

Cell

,

172

,

1091

–

1107

.

Hie

B.

et al. (

2019

)

Efficient integration of heterogeneous single-cell transcriptomes using scanorama

.

Nat. Biotechnol

.,

37

,

685

–

691

.

Johnson

J.

et al. (

2017

) Billion-scale similarity search with gpus. arXiv preprint arXiv: 1702.08734.

Kernfeld

E.M.

et al. (

2018

)

A single-cell transcriptomic atlas of thymus organogenesis resolves cell types and developmental maturation

.

Immunity

,

48

,

1258

–

1270

.

Kiselev

V.Y.

et al. (

2018

)

scmap: projection of single-cell RNA-seq data across data sets

.

Nature Methods

,

15

,

359.

Korsunsky

I.

et al. (

2018

) Fast, sensitive, and accurate integration of single cell data with harmony. bioRxiv,

461954

.

McInnes

L.

,

Healy

J.

(

2018

) Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv: 1802.03426.

Mohammed

H.

et al. (

2017

)

Single-cell landscape of transcriptional heterogeneity and cell fate decisions during mouse early gastrulation

.

Cell Rep

.,

20

,

1215

–

1228

.

Park

J.

et al. (

2018

)

Single-cell transcriptomics of the mouse kidney reveals potential cellular targets of kidney disease

.

Science

,

360

,

758

–

763

.

Stuart

T.

et al. (

2019

)

Comprehensive integration of single cell data

.

Cell

,

177

,

1888

–

1902

.

Tabula Muris Consortium et al. (

2018

)

Single-cell transcriptomics of 20 mouse organs creates a tabula muris

.

Nature

,

562

,

367

.

Crossref

PubMed

WorldCat

Traag

V.

et al. (

2019

)

From louvain to leiden: guaranteeing well-connected communities

.

Sci. Rep

.,

9

,

5233

.

Wolf

F.A.

et al. (

2018

)

Scanpy: large-scale single-cell gene expression data analysis

.

Genome Biol

.,

19

,

15.

Zappia

L.

et al. (

2017

)

Splatter: simulation of single-cell RNA sequencing data

.

Genome Biol

.,

18

,

174.

Zeisel

A.

et al. (

2015

)

Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq

.

Science

,

347

,

1138

–

1142

.

Author notes

Krzysztof Polański and Matthew D. Young wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Associate Editor:

Download all slides

Month:	Total Views:
August 2019	258
September 2019	216
October 2019	304
November 2019	345
December 2019	294
January 2020	536
February 2020	571
March 2020	396
April 2020	244
May 2020	384
June 2020	455
July 2020	379
August 2020	330
September 2020	417
October 2020	451
November 2020	401
December 2020	422
January 2021	538
February 2021	465
March 2021	520
April 2021	433
May 2021	466
June 2021	493
July 2021	463
August 2021	434
September 2021	470
October 2021	430
November 2021	532
December 2021	450
January 2022	495
February 2022	404
March 2022	513
April 2022	455
May 2022	425
June 2022	356
July 2022	2,206
August 2022	795
September 2022	761
October 2022	978
November 2022	620
December 2022	360
January 2023	348
February 2023	424
March 2023	468
April 2023	467
May 2023	379
June 2023	359
July 2023	355
August 2023	381
September 2023	353
October 2023	473
November 2023	411
December 2023	574
January 2024	674
February 2024	716
March 2024	1,573
April 2024	313

Article Contents

BBKNN: fast batch alignment of single cell transcriptomes

Abstract

1 Introduction

2 Materials and methods

3 Results

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

BBKNN: fast batch alignment of single cell transcriptomes

Abstract

1 Introduction

2 Materials and methods

3 Results

Funding

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only