Fast Batch Alignment of Single Cell Transcriptomes Unifies Multiple Mouse Cell Atlases into an Integrated Landscape

Increasing numbers of large scale single cell RNA-Seq projects are leading to a data explosion, which can only be fully exploited through data integration. Therefore, efficient computational tools for combining diverse datasets are crucial for biology in the single cell genomics era. A number of methods have been developed to assist data integration by removing technical batch effects, but most are computationally intensive. To overcome the challenge of enormous datasets, we have developed BBKNN, an extremely fast graph-based data integration method. We illustrate the power of BBKNN for dimensionalityreduced visualisation and clustering in multiple biological scenarios, including a massive integrative study over several murine atlases. BBKNN successfully connects cell populations across experimentally heterogeneous mouse scRNA-Seq datasets, which reveals global markers of cell type and organspecificity and provides the foundation for inferring the underlying transcription factor network. BBKNN is available at https://github.com/Teichlab/bbknn.

lends itself well to a very simple alteration that enables batch alignment. The BBKNN graph is constructed by defining the k nearest neighbours for each cell within each of the user-defined batches. This abstracted neighbour collection is then processed into graph connectivities in the same manner as when constructing a KNN graph, aligning the batches. In order to avoid aligning unrelated cells across batches when no equivalent cells exist in another batch, we limit the total number of edges for each cell. This prioritises mutual neighbour pairs and connections to similar cells across batches (Figure 1b). The resulting graph structure is immediately useable in the same broad range of downstream analysis options as the KNN graph. BBKNN also offers the option to compute approximate nearest neighbours, with run time scaling linearly and offering superior performance for datasets with hundreds of thousands of cells.
We illustrate this concept on simulated data as described in Supplementary Methods. The simulation confirms that BBKNN connects cells from a known 4 shared population across experimentally different batches. It also demonstrates the use of the BBKNN trimming parameter to ensure that unrelated cells remain distinct, as shown in Supplementary Figure 1. Below, we focus on applying the algorithm to various biological datasets.

Merging PBMC Data from Droplet Protocol Variants
We first evaluate BBKNN in a biological context on two publicly available peripheral blood mononuclear cell (PBMC) samples, obtained using the 5 and 3 droplet protocols of 10X Chromium 20 . The 5 version was developed in response to community demand for accurate T and B cell receptor capture, allowing explicit VDJ reconstruction. Given that these methods capture different region of mRNA, we expect to see differences in gene expression quantification between cells profiled with 5 or 3 protocols. This is exactly what we observe: there is complete separation by experimental method when inferring a standard KNN neighbourhood graph ( Figure 2a). However, upon BBKNN merging, this becomes supplanted by the various cell types integrating into unified clusters representing data from both experimental protocols (Figure 2b and 2c).
Notably, a closer examination of the distribution of cells profiled by each protocol within the T and B cell clusters reveals genes expected to be technologyspecific. For instance, the TRBV, TRAV and IGHV genes are captured by the 5 kit, while TRAC instead mainly appears in the cells from the 3 kit (Figure 2d). This is concordant with the fact that the detection efficiency of different regions of T and B cell receptors is the main difference between the two methods.
As such, BBKNN performs excellent batch alignment for this data. BBKNN succeeds in allowing inference of a cluster structure that correctly captures the primary populations present in the data, while simultaneously retaining the protocol-driven biological differences in the relevant cell types.

Aligning Four Pancreatic Datasets from Diverse Technologies
Having demonstrated BBKNN's ability to successfully merge cell populations from different droplet protocols, the algorithm was applied to a more diverse collection of publicly available pancreatic single cell data [21][22][23][24] . The experiments came from four independent studies, and were performed using a combination of both droplet 32 and plate 33,34 based methods. With around 15,000 cells in total between the four datasets, and a known shared biology captured in a set of standardised annotations 25 , the data provides a perfect testing ground to evaluate BBKNN. We compare its performance to the established batch correction methods mnnCorrect 12 , CCA 13 and Scanorama 14 . These differences will become more marked as datasets continue to increase in size and heterogeneity, and users want to interact with datasets in a flexible and rapid manner. Therefore, we anticipate BBKNN's fast and lightweight graph alignment method to become a popular tool for both individual users and in the context of databases and web servers.

Mouse Single-Cell Atlases
With BBKNN's utility illustrated on two different well-studied biological scenarios, and its output falling in line with that of established batch correction methods, we set our sights on a larger dataset that would be computationally tax- The resulting graph can be directly used as the input for downstream analyses such as clustering 16 and diffusion pseudotime 17 , with compatible dimensionality reduction visualisation approaches including UMAP 18 and force-directed graphs 19 .
We demonstrate BBKNN's utility by applying it to a number of biological scenarios, and in each it was able to reconstruct the underlying shared cell populations. Examples are the four very disparate experimental setups of pancreatic data, and two biologically distinct protocols for PBMCs. Finally, we use BBKNN to propose an intuitive development trajectory in a landscape of hundreds of thousands of cells from murine atlases. The method was benchmarked against the established batch correction methods mnnCorrect 12 , CCA 13 and Scanorama 14 on the pancreatic data, yielding results of comparable quality, with run times one to two orders of magnitude faster on a personal computer.
A neighbourhood graph has a variety of downstream applications, and the choice of this format for batch alignment allows for easy deployment of a very fast and successful algorithm. At present, not all tools are equipped to work with neighbourhood graphs as input, with a notable example being SCANPY's implementation of t-SNE 45 . However, our algorithm is perfectly compatible with UMAP, which is quickly gaining traction. Seurat 13 has UMAP support, and the current development version of Monocle 46 features trajectory inference within a UMAP-reduced space.
We demonstrate that BBKNN is able to integrate large and disparate data sets into a single structure by applying it to multiple large mouse single cell atlases, providing the biologic community with a valuable resource to gain insights into diverse fields ranging from developmental biology to tissue adaptation. The utility of the results is asserted by the murine cell atlas integration serving as a baseline for the inference of the underlying regulatory network, which captures the branching of embryonic transcription factors into various cell type and organ specific modules.

Seurat-Inspired SCANPY Workflow
The three biological scenarios were evaluated using a common analysis core, which shall be henceforth referred to as the Seurat-inspired SCANPY workflow.
The steps of the analysis are normalising the data to 10,000 counts per cell, identifying highly variable genes, limiting the datasets to those genes only, log transforming the data, scaling it to unit variance and zero mean followed by PCA. At this stage, the established analysis identifies a regular KNN graph, but we also apply BBKNN in parallel. Both resulting AnnData objects are subsequently dimensionality-reduced with UMAP and are subjected to graph-based clustering.

Droplet PBMCs
The input data was downloaded from the 10X Genomics website. The exact 5 dataset was 'PBMCs of a healthy donor -5 gene expression', under Cell

Pancreatic Data
The data for the four different pancreatic experiments was downloaded in the form of homogeneously prepared SingleCellExperiment R objects featuring standardised annotations 25 . It was then processed with the Seurat-inspired SCANPY workflow. The standard neighbourhood graph analysis was compared to BBKNN, along with CCA and both the R and Python versions of mnnCorrect, with each being applied independently at their desired points in the analysis (as replacements to neighbourhood graph computation, PCA and prior to data scaling respectively). Scanorama was performed on raw data filtered to feature cells with a minimum of 600 unique genes, and its output was

Murine Atlases
Both the droplet and plate data from Tabula Muris was downloaded from figshare, while all the other atlases were obtained from GEO. The dataset was then analysed with the Seurat-inspired SCANPY workflow (Supplementary Figure 4).
To avoid cell type over-representation biases, the dataset was downsampled The neighbour distance collections are then converted to exponentially related connectivities. BBKNN has an optional graph trimming step to weed out any erroneous connections between independent cell populations. The resulting connectivity graph can be used in downstream analyses such as clustering or UMAP visualisation.