Evaluation of STAR and Kallisto on Single Cell RNA-Seq Data Alignment

Alignment of scRNA-Seq data are the first and one of the most critical steps of the scRNA-Seq analysis workflow, and thus the choice of proper aligners is of paramount importance. Recently, STAR an alignment method and Kallisto a pseudoalignment method have both gained a vast amount of popularity in the single cell sequencing field. However, an unbiased third-party comparison of these two methods in scRNA-Seq is lacking. Here we conduct a systematic comparison of them on a variety of Drop-seq, Fluidigm and 10x genomics data, from the aspects of gene abundance, alignment accuracy, as well as computational speed and memory use. We observe that STAR globally produces more genes and higher gene-expression values, compared to Kallisto, as well as Bowtie2, another popular alignment method for bulk RNA-Seq. STAR also yields higher correlations of the Gini index for the genes with RNA-FISH validation results. Using 10x genomics PBMC 3K scRNA-Seq and mouse cortex single nuclei RNA-Seq data, STAR shows similar or better cell-type annotation results, by detecting a larger subset of known gene markers. However, the gain of accuracy and gene abundance of STAR alignment comes with the price of significantly slower computation time (4 folds) and more memory (7.7 folds), compared to Kallisto.

reference to help researchers to decide how to perform the alignment for single cell analysis.

Datasets
Drop-seq and Fluidigm datasets: Drop-seq and demultiplexed Fluidigm RNA-seq datasets were downloaded from GSE99330 ). This dataset is composed of 8640 single cells generated by Drop-seq platform and 800 single cells by Fluidigm (C1 mRNA Seq HT IFC) platforms, using WM989-A6-G3 cell line. The availability of RNA-FISH validation data on 26 genes (housekeeping genes and drug resistance markers) makes this an ideal dataset for comparing alignment methods. This dataset also is accompanied by publicly downloadable results on the same 26 genes measured by single-molecule RNA fluorescence in situ hybridization (smRNA FISH), a method to detect RNA within a cell using the fluorescence probe.

Alignment of Drop-seq scRNA-Seq data
Pre-processing: The raw Drop-seq fastq files were processed using Drop-seq (Macosko et al. 2015) tools version 1.13 (https://github.com/ broadinstitute/Drop-seq/). Following the Drop-seq Core Computational Protocol version 1.0.1, reads with low-quality bases in the cell or molecular barcode were filtered with a default threshold setting. The adapter sequences at the 59 and poly-A at 39 of the reads were trimmed.
STAR alignment: STAR alignment was performed for Drop-seq data using STAR version 2.5.2a provided by University of Michigan HPC. Data were aligned to reference genome GRCh38 with the corresponding annotation file from Ensemble. Default parameters were used, unless notified otherwise.
Kallisto pseudoalignment: Kallisto quantification was performed for Drop-seq data using Kallisto version 0.45.1 obtained from github. Kallisto index was built with reference transcriptome GRCh38 with kmer length of 31. The suggested k-mer length of 31 can avoid ambiguities in de Bruijn graph in similar regions when k-mer length is small. Kallisto genomebam feature was used for obtaining the BAM output. Mapped reads with unknown reference genome locations were removed from BAM using Samtools version 1.9. The filtered BAM files were passed down into the Drop-seq pipeline for digital expression generation.

Alignment of fluidigm scRNA-Seq data
Pre-processing: PolyA was trimmed from the 39 end using Trim Galore version 0.4.2.
STAR alignment: STAR alignment was performed for trimmed data using STAR version 2.5.2a, and aligned to reference genome GRCh38.
Kallisto pseudoalignment: Kallisto quantification was performed using Kallisto version 0.45.1. Kallisto index was built with reference transcriptome GRCh38 only with the kmer length of 31. Kallisto -genomebam feature was used, and the output BAM files were filtered in the same manner as Drop-seq data.
Bowtie2 alignment: the same version and index files were used, as in Drop-Seq data above.
Gini coefficient calculation and smRNA FISH validation RNA FISH data, Drop-seq data and Fluidigm data were normalized against GAPDH, following the method described previously (Arisdakessian et al. 2018). Gini coefficients were calculated for 25 drug resistance markers and housekeeping genes in the same way as in SAVER . For gene i, the level of expression across cells is sorted. Then for the sorted array of expression values, Gini index is calculated as following:

Expression ij
Where j is the index for the sorted array of expression ranging from 1 to n (total number of the cell).
Data processing and analysis on 10x Genomics PBMC 3K data Cell Ranger filtered genes and cell matrix(available on 10x website) were processed using Seurat version 3.2.0 (Butler et al. 2018;Stuart et al. 2019), following the configuration from https://satijalab.org/ seurat/v3.0/pbmc3k_tutorial.html. Clusters were annotated using provided labels by the Seurat group based on the pre-defined gene markers. Fastq files were preprocessed to fit the input scheme for both STAR and Kallisto.
STARsolo: STARsolo for V1 chemistry requires two files: cell barcodes with UMIS, and cDNA read. STAR version 2.7.1a with -solo command was used. The STAR index was built with a read length of 98. STARsolo was configured for 14bp GemCode barcode, 10bp UMI, and 98bp transcript.
Kallisto With Human cDNA and intron index: Kallisto version 0.46.1 and bustools version 0.39.3 were used. The cDNA + intron index and relevant files were downloaded from the github page: https://www.kallistobus.tools/velocity_tutorial.html. The pseudoalignment and sequential correction and counting processes were done following the instruction from https://www.kallistobus.tools/ velocity_tutorial.html. The spliced and unspliced matrices were processed following the instruction from https://github.com/BUStools/ getting_started/blob/master/velocity_tutorial.ipynb. The cells with less than 3 expressed genes, and genes expressed in less than 200 cells were removed. Data processing and analysis on 10x mouse cortex 1 single nuclei RNA-seq data STARsolo: STAR version 2.7.3a with -solo command was used. For single nuclei RNA-seq data, command "-soloFeatures Gene SJ GeneFull" was used for generating counts for both exonic RNA and pre-mRNA. The STAR index was built with a read length of 50. STARsolo was configured for 16bp GemCode barcode, 10bp UMI, and 50bp transcript.
Kallisto with mouse cDNA and intron index: Kallisto version 0.46.1 and bustools version 0.39.3 was used. The mouse ensembl 86 cDNA + intron index and relevant files were downloaded from the github page: https://github.com/pachterlab/MBGBLHGP_2019/releases. The pseudoalignment and sequential correction and counting processes were done similarly as the PBMC 3k data. The spliced and unspliced matrices were processed in Python 3.7. The cells with less than 3 expressed genes, and genes expressed in less than 200 cells were removed.
Downstream clustering analysis: Seurat version 3.2.0 was used for downstream analysis. The filtration criteria include min.cells = 3, min features = 200. Data were then log normalized with a scale factor of 10000 in Seurat. The cell types were annotated manually based on the FeaturePlot of each marker gene.

Time and memory measurement
Time and memory usages on Fluidigm dataset were measured on a dedicated group computer server cluster (consisted of 4 nodes (Dell PowerEdge C6420) of 2 X Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz, 192GB RAM, one node (Dell Poweredge R740) with 2 X Xeon(R) Gold 6148 CPU @ 2.40GHz, 192 GB RAM, and two 16GB Nvidia V100 GPUs.) with Slurm job scheduler. One processor and 60GB memory were reserved for each job. "Resources_used.cput" and "resources_used.walltime" were collected from the job log for analysis.

Code availability
All code for the tools configurations and analysis could be found on GitHub page: https://github.com/yhdu36/aligner  (Table S1). To generate the count matrix, we used STAR and Kallisto genombam command (Yi et al. 2018 preprint) followed by featureCount. Kallisto genombam command projects the pseudoalignments to genomic space using a model of transcriptome consisting of genes, transcripts and exon coordinates, which allows the interchange between pseudoalignment and genome alignment possible. We then evaluated the aligners on the count matrix output (Figure 1).
Specifically, we first checked the overall correlation of alignments from STAR and Kallisto workflows. We added a pseudo-count of 1 to all gene counts before log transformation, then calculated the Pearson's correlation of all genes across all cells between STAR and Kallisto. As shown in Figure 1A-B, the correlation coefficient between Kallisto and STAR aligned gene counts is 0.836 and 0.862 for Drop-seq and Fluidigm data, respectively, demonstrating a strong concordance between them. However, further examination shows that STAR yields more uniquely expressed genes for both Drop-seq and Fluidigm platforms ( Figure 1C-D). For Drop-seq data, STAR and Kallisto detect 16892 common genes, but 7116 and 1906 unique genes, respectively. For Fluidigm data, STAR and Kallisto detect 23193 common genes, but 13710 and 645 unique genes, respectively. The modes of the density distribution of gene numbers in each cell ( Figure 1E-F) shift to higher values for STAR alignment, confirming that indeed STAR systematically detects more genes in each cell. Interestingly, while the Drop-seq platform yields a single peak of density distribution for gene counts, Fluidigm yields two peaks, both of which have higher gene counts than the peak of Drop-seq. Overall Kallisto pseudoaligned to more genes (proportionwise) with shorter length (,3000bp), whereas STAR can handle longer gene alignment better, as shown in Figure 1G-H. Since dropout is a significant issue in single cell RNA-Seq data, we also compared dropout rates among commonly detected genes using cumulative density plots ( Figure S1A-B). Overall, genes detected using STAR have cumulatively significantly lower dropout rates for both Drop-seq (K-S test p-value, 2.2 e-16) and Fluidigm (K-S test p-value , 2.2e-16) platforms.
In conclusion, despite high correlations between STAR and Kallisto, STAR detected more genes and also yielded more abundant gene expression counts across cells compared to Kallisto. However, since the absolute truth of gene expression is not used, the observations herein are relative measures.
Validation of STAR and Kallisto results using RNA FISH data To address the issue of lack of absolute truth of gene expression in the comparisons above, we next assessed STAR and Kallisto performance using smRNA-FISH data as the ground truth measurement. Singlemolecule RNA fluorescence in situ hybridization (smRNA FISH) is a method to detect RNA within a cell using the fluorescence probe. Torre et al. originally used smRNA-FISH to measure 26 genes and represented their distribution across cells using the Gini coefficient . Gini coefficient is a metric for inequality. A Gini coefficient of 0 represents that gene expression is perfectly even across all cells, whereas a Gini coefficient of 1 means that the gene is expressed in only one cell. In the context of gene expression, it can measure the variation of gene expression for a gene across all cells, and it is indeed influenced by the level of expression (see Equation 1). By comparing the Gini coefficient calculated from two alignment methods with that of the RNA Fish experiment, we can evaluate whether the alignment methods preserve the variability similar to the RNA Fish method. We compared the Gini coefficients between aligned scRNA-Seq data and smRNA-FISH data (Figure 2), and observed significant zero inflation in Gini coefficients from scRNA-Seq platforms, comparing to smRNA-FISH measurement ). For Drop-seq data, as shown in Figure 2A-B, both STAR and Kallisto missed detecting some of the 26 genes. STAR missed three genes: VEGFC, AXL and WNT5A, whereas Kallisto missed four genes: VEGFC, WNT5A, NGFR, PDGFRB. After removing these outliers, the correlation between STAR Gini and smRNA-FISH Gini was 0.50, and the correlation between Kallisto Gini and smRNA-FISH Gini was 0.53. For Fluidigm data, as shown in Figure 2C-D, STAR detected all 26 genes, whereas Kallisto missed to detect 3 genes: VEGFC, WNT5A, and NGFR. The correlation between STAR Gini and smRNA-FISH Gini is 0.55, whereas the correlation between Kallisto Gini and smRNA-FISH was 0.47 (after removing undetected genes). In summary, the comparisons with smRNA-FISH data show that STAR tends to miss fewer genes. Moreover, STAR has on par with (Drop-seq) or slightly better (Fluidigm) correlations with smRNA-FISH based truth measure, despite the fact that the Gini coefficients between STAR and Kallisto are highly correlated for commonly detected genes ( Figure S2A-B).

Comparison of Bowtie2 vs. STAR and Kallisto alignment
We further compared Bowtie2, a popular alignment method on bulk RNA-Seq data, to STAR and Kallisto, on the above mentioned single cell datasets ( Figure S3). Among all three methods, STAR still has the most abundant genes detected (24008). Bowtie2 is the second (21572), and Kallisto is the third (18789). However, Bowtie2 has slightly lower detected gene count per cell compared to both STAR and Kallisto ( Figure S3 G and H). This indicates that even though Bowtie2 detected more unique gene counts than Kallisto, the average gene expression level across cells was lower than Kallisto. For Gini coefficient comparison, Bowtie2 has closer coordination with STAR ( Figure S3   genes ( Figure S3 D-F). For efficiency, with premade reference, STAR took 15 min and 28G memory to finish the alignment, whereas Bowtie2 used 56 min and 14.2G memory. Therefore, by comparing with another traditional aligner Bowtie2, we have reached the consistent conclusion with other aligner benchmarking studies on the real and simulated datasets (Baruzzo et al. 2017); (Teissandier et al. 2019) that STAR is currently one of the top performing methods on alignment rates and speed comparing to other traditional aligners.

Running time and memory comparison on Fluidigm data
We performed time and memory usage comparison on a computer cluster (detailed configuration could be found in the Methods section) with Slurm job scheduler using Fluidigm data (800 cells). For Fluidigm data, the experiments were conducted using the same pipelines as in Figures 1 and 2, with the notation that STAR was aligned to reference genome GRCh38 whereas Kallisto used GRCh38 cDNA + intron index. Both Kallisto and STAR were assigned with one processor and 60GB memory and ran in one thread mode to minimize the effect of parallel processing. Overall, Kallisto pseudoalignment takes 1/4 amount of time as STAR ( Figure 3A). Moreover, the maximum memory usage of Kallisto was $3.6GB, which is about 1/7 of the memory usage ($28GB) of STAR ( Figure 3B). Thus, Kallisto with cDNA-only index demonstrates significant advantages of pseudoalignment both on speed and memory.

STARsolo and Kallisto bus comparison on human 10x
Genomics PBMC data In order to see the effect of align/pseudoalignment method on downstream clustering analysis, we compared STARsolo (Dobin et al. 2013) and Kallisto bustools (Melsted et al. 2019 preprint) on 10x genomics' publically available PBMC 3K scRNA-Seq datasets. STARsolo (available in STAR after version 2.7.0) and Kallisto bustools are pipelines developed based on each method to analyze the UMI-based data, such as 10x genomics data. For fairness, cDNAs plus introns sequences were used as the reference to generate index files for both methods. Both pipelines produced the same output format FISH data with STAR and Kallisto protocols for selected genes. The x-axis represents the Gini coefficient obtained from RNA FISH study, and the y-axis represents the Gini coefficient from the Drop-seq experiment using STAR and Kallisto respectively. (C-D) Scatterplots of Gini coefficients for Fluidigm vs. FISH data with STAR and Kallisto protocols. Pearson correlation coefficients are shown as Pearson (after removing the non-detected genes). from 10x's Cell Ranger tool, so we next used the Seurat package (Butler et al. 2018;Stuart et al. 2019) to find cell clusters using unsupervised clustering method.
For the PBMC 3K dataset, STARsolo showed 89.5% alignment rate, whereas Kallisto bustools has 61.89% pseudoalignment rate (Table S1). STARsolo output resulted in nine clusters ( Figure 4A), each assigned to a cell type using the predefined markers in the Seurat vignette. 90.73% of cell labels from STARsolo matched with those from Cell Ranger (based on STAR), confirming the similarity between the two workflows (Table S2). Given the same reference genome (GRCh38), the difference between Cell Ranger and STARsolo could be caused by different correction and filtering criteria. In comparison, 1 out of 9 cell types, FCGR4A+Mono cells, could not be identified after Kallisto alignment ( Figure 4C). This is largely due to the fact that Kallisto could not detect high enough FCER1A expression, one of the marker genes to identify the FCGR4A+Mono cell type ( Figure 4D). For the proportions of cell types, STARsolo and Kallisto generate similar proportions for different cell types ( Figure S4A), with STARsolo identifying more memory CD4+ T cells but fewer CD8+ T cells.
Worth noticing, the mapping rate (61.89%) using cDNA and intron index in Kallisto is much better than the mapping rate (51.30%) when only cDNA index was used. However, this improvement comes with the price of memory usage. With only cDNA index, Kallisto used 3.7 Gigabytes of memory, and the runtime was 46 min. With cDNA + intron index, Kallisto used 67.1 Gigabytes of memory, and the runtime was 1.5 hr. The clustering result from the Kallisto pipeline (with cDNA and intron index) is also much improved from that built on cDNA index. In the latter case, seven clusters are identified ( Figure S5A), and the cell-type specific marker genes are either missing or do not co-locate well with the clusters ( Figure S5B). Therefore, including introns in the reference genome can largely increase the accuracy for Kallisto.
STARsolo and Kallisto bus comparison on single nuclei RNA-seq mouse cortex data Single nuclei RNA-Seq is one solution for human studies where fresh materials are lacking (Lake et al. 2017). We downloaded the single nuclei Mouse Cortex 1 data published earlier (Ding et al. 2019 preprint). As single nuclei RNA-Seq data contain high content of intronic fragments, we evaluated the performance between STARsolo and Kallisto bus using cDNA + intron as the reference.
For the mouse cortex single nuclei RNA-seq data, Kallisto bus required 58.9 Gigabytes of memory, whereas STARsolo used 31.4 Gigabytes. The run time was similar. Both STARsolo and Kallisto bus took 3 hr to complete (pseudo)alignment without the indexing step. STARsolo recovered 81.6% of reads, compared to 74.9% mapping rate for Kallisto (Table S1). This resulted in more genes detected by STARsolo. Both alignment methods detected seven cell types: excitatory neurons, inhibitory neurons, astrocytes, oligodendrocytes, oligodendrocyte precursors and microglia ( Figure 5A and 5C). The proportions of detected cell types were similar using both STARsolo and Kallisto ( Figure S4B). However, closer investigation showed that STARsolo detects more marker genes than Kallisto. For example, mbp, a marker gene for oligodendrocytes, is missing in Kallisto aligned result ( Figure 5D vs. 5B). In fact, among the 47 cell-type specific markers, STARsolo only failed to identify one marker (Des), whereas Kallisto could not identify 7 markers (Table S3). In all, comparison on the single nuclei RNA-seq data further shows the significance to include intronic information for Kallisto alignment.

DISCUSSION
As single cell RNA-Seq (scRNA-Seq) technologies continue to gain popularity, the amount of generated data are increasing faster than ever, which subsequently imposes increasing demand to process these datasets. Alignment is one of the most time consuming yet most critical steps to process scRNA-Seq data. Currently, an unbiased third-party comparison on the alignment tools of scRNA-Seq data are lacking, although during the manuscript preparation time the authors of Kallisto had made a comparison preprint available (Melsted et al. 2019 preprint).
In this report, we compared the performance of the two most popular alignment/pseudoalignment methods: STAR and Kallisto. We aimed to provide readers the details on gene expression characteristics, beyond simple correlations between the two methods. Toward this goal, we purposefully selected datasets that have the "reference measures". We first used a dataset with both scRNA-Seq (in both Drop-seq and Fluidigm platforms) and smRNA FISH validation results. Through the comparison, it appeared that in this dataset, despite high correlations between STAR and Kallisto, STAR tends to yield more genes overall, as well as more abundance of the genes, compared to Kallisto; whereas Kallisto detects more shorter genes compared to STAR. Based on the 26 genes that have smRNA FISH results, STAR appeared to detect more of them, compared to Kallisto; also STAR had on-par (Drop-seq) or slightly better (Fluidigm) correlations with the "reference measure" of smRNA-FISH.
We subsequently compared the alignment results using PBMC 3K cells, by examining the clustering results based on the count matrix. We chose this PBMC 3K dataset because the Seurat group provided marker genes and cell types, so we could use such knowledge as the "reference". The result showed that STAR alignment harvested cell clusters that could be well identified by predefined cell-specific markers; the cell clusters generated from Kallisto with cDNA plus intron (not just cDNA) index information were very similar, but still missed one cell type. This dataset was not one of the 20 datasets reported by Melsted et al. (Melsted et al. 2019 preprint). We additionally applied these two methods on a single nuclei RNA-seq dataset from mouse cortex, in order to further investigate the necessity of including intronic reference for Kallisto. With cDNA and intron as the reference, Kallisto identifies the same cell types as STARsolo, despite missing slightly more marker genes. Thus, adding introns to cDNA sequences is vital to the alignment/ pseudoalignment step for Kallisto. Our results are consistent with Vieth et al. in their single cell pipeline evaluation studie: Kallisto with cDNA index had a low fraction of assigned reads, and for UMI-based methods STAR performs better (Vieth et al. 2019).
Consistent with the observation of Melsted et al., we also found that Kallisto (with only cDNAs as the reference) is 4 times faster than STARsolo, the recent version of STAR aligner adapter for scRNA-Seq, and the memory usage of Kallisto is 7.7 times less than STARsolo. Including intron sequences in reference significantly improves the accuracy of Kallisto, however, this improvement is at the cost of computer memory. In summary, based on the datasets used in this study, we conclude that Kallisto's use of computing resources is much less demanding than STAR when only cDNA sequences are used as the reference; however, such efficiency gain is at the cost of loss of information. The users should make decisions based on their preferences in accuracy vs. computing resources.