Quantifying orthogonal barcodes for sequence census assays

Abstract Summary Barcode-based sequence census assays utilize custom or random oligonucloetide sequences to label various biological features, such as cell-surface proteins or CRISPR perturbations. These assays all rely on barcode quantification, a task that is complicated by barcode design and technical noise. We introduce a modular approach to quantifying barcodes that achieves speed and memory improvements over existing tools. We also introduce a set of quality control metrics, and accompanying tool, for validating barcode designs. Availability and implementation https://github.com/pachterlab/kb_python, https://github.com/pachterlab/qcbc.


Introduction
Single-cell RNA sequencing assays rely on unique sequences of nucleotides ("barcodes") to associate sequences from mRNA molecules with individual cells.This strategy has been extended to quantify orthogonal biological or technical information in what is known as "multimodal" genomics.For example, the 10� Genomics Feature Barcoding assay quantifies the expression of cell-surface proteins using targeted antibody barcodes ("Feature Barcode Overview-Software-Single Cell Gene Expression-Official 10� Genomics Support" n.d.).The ClickTag (Gehring et al. 2020), Multiseq (McGinnis et al. 2019), and Cell Hashing (Stoeckius et al. 2018) assays, on the other hand, use custom sample-specific barcodes to group cells enabling multiplexing and reducing batch effects.Moreover, similar approaches have been used for CRISPR screens (Dixit et al. 2016, Datlinger et al. 2017, Schraivogel et al. 2020), targeted perturbation assays (Gehring et al. 2020, Srivatsan et al. 2020), and recently for spatial genomics assays (Srivatsan et al. 2021).
The processing of data from these assays has required the development of custom tools, resulting in a proliferation of application-and assay-specific software (Roelli et al. 2019, Johnson et al. 2023).We introduce a broadly applicable framework, which we term "kITE" for kallisto Indexing and Tag Extraction, for uniformly quantifying orthogonal barcode-based sequence census assays that has several advantages over assay-specific tools.First, by virtue of providing a single solution to seemingly distinct problems, it serves as a unifying framework for biology x barcode assays and will simplify the development and deployment of novel assays.Second, the software implementation we share is portable, engineered to work across platforms, and is modular, allowing for easy adaptation and customization.It makes use of the kallisto and bustools programs that have been previously extensively validated and benchmarked for quantification of the biological sequences in barcode x biology assays such as single-cell and bulk RNA sequencing (Bray et al. 2016, Melsted et al. 2021).Finally, our method is faster than existing tools, and we identify corner cases in processing that have been missed in previous analyses.The examination of such cases led us to develop a tool, called qcbc, for assessing the quality of barcode sets, and that should prove to be a useful standalone package.It is applicable both during assay development, and during assay quality control.

Methods
Feature Barcoding assays use short barcodes and are susceptible to ambiguous or incorrect assignment if errors are introduced during the experimental procedure.Inspired by the CITE-seq-count approach, kITE begins by generating a list of all single-base mismatches of the Feature Barcodes used in an experiment (Fig. 1).The resulting "mismatched FASTA" file is used as input for the "kallisto index" program with k set appropriately (Section 2).Finally, the kallisto j bustools pipeline is used to quantify the dataset using the "mismatch index" generated after running "kallisto index" on the mismatched FASTA file.In this way, kallisto will effectively scan the entire sequencing read for barcodes present in the mismatch index allowing for barcode pseudoalignment despite variability in barcode position.Feature barcode "error correction" occurs when barcode counts are summed across different mismatches of a given barcode (Section 2).

Results
To evaluate the accuracy and efficiency of kITE across a range of assays, we gathered data from six different assays, and compared kITE quantifications to those produced with assay-specific tools.We first quantified cell surface protein abundance for the 10� Genomics 5266 Peripheral blood mononuclear cells dataset, which targeted 32 cell surface proteins.We used kallisto bustools and a mismatch index to pseudoalign the target sequences (Appendix A.1.1)and found high concordance between our method and Cell Ranger across varying read depths, with an average Pearson correlation of 0.99 (Fig. 2a-f).kITE was five times faster than Cell Ranger and required 3.5 times less RAM (Appendix A.1.2).
Next, we tested how accurately we could quantify and demultiplex pooled samples tagged with sample-specific barcodes.We processed data from 3795 methanol-fixed mouse neural stem cells from four multiplexed samples generated with the ClickTag assay (Supplementary Fig. S1), 9551 primary patient-derived xenograft (PDX) samples from nine multiplexed samples generated with the Multiseq assay (Supplementary Fig. S2) and 10 807 PBMCs from eight multiplexed samples generated with the Cell Hashing assay (Supplementary Fig. S3).We found high concordance with an average Pearson correlation of 0.99, for all three datasets, between kITE and CITE-seq-Count (Roelli et al. 2019), a tool for counting antibody tags from a CiteSeq or Cell Hashing experiments (Fig. 2g).kITE was 48 times faster and required only a quarter to a fifth of the RAM (Fig. 2h and i).We observed slight discordance with two feature barcodes, BC49 and BC74, when processing the PDX samples assayed with Multiseq.These two Multiseq barcodes are short (8 bp in length), yet share long subsequences up to 7 bp in length (Supplementary Fig. S4a).The introduction of errors during sequencing can result in possible incorrect barcode assignment and loss of barcodes (Appendix A.1.3).We addressed this issue by trimming the excess sequence surrounding the 8bp feature barcode (Appendix A.1.6).We also observed a discordance on a subset of cells in the 10�CRISPR assay between kITE quantifications and Cell Ranger.Investigation of the unmapped Cell Ranger barcodes indicates that Cell Ranger failed to map reads with a clear guide-of-origin.For example, Cell Ranger failed to map 43 098 reads (8312 UMIs) for the cell CTGAGCGCATAAGATG with a guide RNA that exactly matches that for DPP4-1 (Section 2).
To demonstrate the flexibility of our approach we also processed 2799 A549 lung carcinoma cells expressing nuclease-dead Cas9-Kr€ uppel-associated box (dCas9-KRAB) assayed with the 10� Genomics CRISPR screen kit, and transduced with 90 sgRNAs targeting 45 different genes.We achieve high concordance against Cell Ranger with an average Pearson correlation of 0.99 across all cells for nearly sgRNA's, with two sgRNAs in the 10� CRISPR screen dataset, EZR-1 and PPIB-2, differing substantially (Supplementary Fig. S5).Both of these two sgRNA's share a subsequence of 16 bp with EZR-2 and PPIB-1 respectively.The latter two sequences begin with GCG and GGA, both of which are a single hamming distance away from the three bases that precede the target sgRNA in the protospacer sequence (Supplementary Fig. S4b).Therefore, a single mutation two bases away from the start of the EZR-1 sgRNA sequence could result in an ambiguous read.To resolve this, we trimmed the protospacer sequence off of the reads (Appendix A.1.8).
We also processed 10 860 K562 cells expressing dCas9-KRAB from TAP-Seq transduced with 86 sgRNAs targeting 14 promoters or enhancers and 30 control sgRNAs to compare assignment of sgRNAs to individual cells.We found 97.8% of all 10 860 barcodes are fully concordant with TAP-Seq assignments, 0.53% of cells are partially concordant, and 1.6% of cells are fully discordant.(Appendix A.1.10).The discrepancies are likely the result of differences between mapping procedures producing differences in counts near the threshold (8 UMI counts) resulting in differential cell assignment for cells with low UMI counts (Supplementary Fig. S6).

Read trimming
Sequencing libraries often contain adapter or spurious sequences that flank the barcode and the presence of these shared sequences can make mapping challenging.Sequencing reads are often trimmed, with tools such as trimmomatic (Bolger et al. 2014), to avoid mapping noninformative regions that flank a barcode, a task that is crucial when barcodes share subsequences at their ends.Trimming helps to avoid longer sequence clashes that contain both the shared barcode subsequence and the shared nonbarcode sequence.The 10�CRISPR assay, e.g.contains two barcodes which share a 16-bp subsequence (the ends of which are one hamming distance from a subsequence of the protospacer sequence that flank the barcode) (Supplementary Fig. S4).Similarly, the Multiseq assay produces sequencing reads where the barcodes are flanked by poly-A stretches and two barcodes share a 7-bp subsequence that contains the end of the barcode.

Barcode design validation
Orthogonal barcode sequences designed for quantifying multimodal data, demultiplexing samples, or performing large-scale perturbation experiments can clash if the sequences are not sufficiently unique.To allow researchers to assess the extent of these issues in datasets they are analyzing, or for assays they are developing, we developed a quality control tool, qcbc, and a simulation framework to assess the quality of a set of barcodes.Our tool is complementary to the simulation framework proposed in (Johnson et al. 2023) and expands on the functionality of DNABarcodes (Buschmann 2017) which produces pairwise barcode distance comparisons.qcbc computes multiple metrics for barcode quality control: 1) diversity of barcodes (subject to distance constraints) (Fig. 3a), 2) number of barcodes that share a subsequence of a given length (Fig. 3b), 3) distribution of Hamming distances between pairwise barcodes (Fig. 3c and d), 4) distribution of homopolymers of a given length (Fig. 3e), and 5) per position and barcode nucleotide distribution (Fig. 3f, Supplementary Fig. S7).
With qcbc we are able to identify the 16-bp shared subsequence with EZR-2 and PPIB-1 sgRNAs in the 10�CRISPR that result in improper quantification.qcbc is also able to identify the 7-bp shared subsequence between and BC74 in the Multiseq assay.
These metrics are important for understanding the impact that barcode error correction has on rescuing sequencing reads.A large diversity of unique barcodes with shorter shared subsequences and larger pairwise distances between barcodes make it easier to disambiguate sequencing reads with pseudoalignment.Additionally, long homopolymer runs and low nucleotide content diversity could contribute to sequencing errors ("Illumina: What Is Nucleotide Diversity and Why Is It Important?"2016, Sina Booeshaghi and Pachter 2022).
To evaluate the impact that base errors have on error correction, we developed a simulation framework that reports the fraction of lost barcodes for a given error rate in the observed sequences (Supplementary Fig. S8) (Appendix A.1.13).Short barcodes, such as those in the Multiseq assay, can yield insufficient barcode diversity that results in a greater fraction of barcodes that cannot be unambiguously assigned (Supplementary Fig. S9).Barcodes with small minimum-pairwise/edit distance (Fig. 3d and e) and long shared subsequences (Fig. 3b) such as those in the Mulitseq and 10�CRISPR assay can result in more clashes (Appendix A.1.12).

Discussion
We have demonstrated a broadly applicable approach to align, error-correct, group, and count sequences from orthogonal barcodes to quantify multimodal data such as cell surface protein abundance, to demultiplex samples, and to perform largescale perturbation experiments.Our approach is fast, memory efficient, and extensible: orthogonal barcode quantifications can be obtained for numerous assays with variable sequence The l 1 distance between a kallisto bustools cell and its Cell Ranger equivalent and the same cell and its nearest kallisto bustools neighbor across the total UMI counts for that cell.(f) The l 1 distance of a kallisto bustools cell to its Cell Ranger equivalent (y-axis) and to its nearest neighbor (x-axis) and the l 1 distance of a Cell Ranger cell to it's kallisto bustools equivalent (x-axis) and its nearest neighbor (y-axis).The marginal distributions show that each kallisto bustools cell is closest to its corresponding Cell Ranger cell and that each Cell Ranger cell is nearest to its corresponding kallisto bustools cell.(g) Pearson correlation for all the cells between the kallisto bustools qualifications and in CITE-seq-Count or Cell Ranger quantifications for each of the five datasets.(h) Runtime and (i) memory improvements for each of the five datasets across the various read depths.
Quantifying orthogonal barcodes for sequence census assays structure such the recently developed spatial RNA-seq assays (Srivatsan et al. 2021) and massively parallel reporter assays (Gordon et al. 2020).This allows future assay developers to avoid producing, validating, and maintaining software implementing ad hoc solutions.Moreover, some current methods are slow, have high memory requirements, and are specifically tailored for specific assays, which can result in batch effects when performing integrative analysis.Methods such as CITE-seq-Count offer extensibility features for different barcode strategies, but are slow and memory-intensive.The speed and efficiency of our approach enables reproducible single-cell analysis in the cloud with tools such as Google Colab.We anticipate that our framework for quantifying orthogonal barcoding assays will be useful as multimodal assays proliferate (Packer and Trapnell 2018), and that our quality control tool will help assay developers avoid pitfalls in barcode design and help practitioners quality control orthogonal barcode assays before generating data.Finally, while our approach is currently engineered to correct mismatch errors, robustness to other types of sequencing errors are possible via other methods that could readily be incorporated in our modular framework (Zorita et al. 2015, Sina Booeshaghi et al. 2020, Johnson et al. 2023).aggregated per cell and quantified per feature barcode using bustools to generate a target count matrix.

A.1.2 Speed and memory comparison
Preprocessing steps were benchmarked with GNU time -v option, and the elapsed (wall clock) time and maximum resident set size (kbytes) were used to perform the time and memory comparisons.All code was benchmarked on Ubuntu 18.04.5 LTS (GNU/Linux 4.15.0-136-genericx86_64) with Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70 GHz processors.

A.1.3 Barcode design evaluation
qcbc allows users to evaluate potential orthogonal barcode designs.The tool reports multiple metrics that may help users understand potential error modes.Each of the metrics is explained below along what various values mean: 1) diversity of unique barcodes (subject to distance constraints) (Fig. 3a), a) If the diversity of barcodes is high, relative to the maximum possible number of barcodes (constrained by a given pairwise distance), then the design space of barcodes was used more effectively.b) Opting for a shorter barcode length at a given pairwise Hamming distance constraint may save reagents and assay complexity.2) number of barcodes that share a subsequence of a given length (Fig. 3b), a) If the number of barcodes that share a subsequence of a given length is high, then k-mer based mapping may yield multimapping reads.For example, if the barcode length is 16 bp and multiple barcodes share a 15-bp subsequence, then ambiguity in k-mer based mapping will arise.b) Opting for barcodes with shorter ambiguous subsequences enable effective barcode discrimination using k-mer based mapping.3) distribution of Hamming distances between pairwise barcodes (Fig. 3c and d) a) If the distribution of pairwise Hamming distances between barcodes is low, then the mismatched index may yield barcodes that share a common subsequence giving rise to multimapping reads.b) Opting for larger pairwise distances in barcode design will enable effective discrimination of barcodes at a given Hamming distance, rescuing more reads.4) distribution of homopolymers of a given length (Fig. 3e) a) If the distribution of homopolymers of a given length is high, then sequencing with technologies such as Ultima Genomics (Sina Booeshaghi and Pachter 2022) which uses mostly natural synthesis for sequencing, may yield errors. b) Opting for fewer homopolymers may help alleviate this issue.5) per position and barcode nucleotide distribution (Fig. 3f, Supplementary Fig. S7) a) If the per position and barcode nucleotide distribution is not uniform, then sequencing errors may arise during basecalling.b) Opting for a more uniform distribution of nucleotides may help improve base calling during sequencing with technologies such as Illumina.
A.1.4103 feature barcode FASTQ files for 5266 Peripheral blood mononuclear cells dataset targeting 32 cell surface proteins were downloaded from https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_protein_v3.A mismatch index was created using the Feature Reference file and kb ref with the kite workflow.The reads were pseudoaligned and a feature count matrix was made using kb count with the kite: 10�FB technology option.Cell Ranger, the Feature Reference, and refdata-gex-GRCh38-2020-A was used to process the FASTQ files into a feature count matrix.
All preprocessing steps are runnable via a Google Colab notebook: https://github.com/pachterlab/BMGP_2020/blob/main/analysis/notebooks/10xFB/10xFB_preprocess.ipynb.All matrix comparisons are runnable via a Google Colab notebook: https://github.com/pachterlab/BMGP_2020/blob/main/analysis/notebooks/10xFB/10xFB_preprocess.ipynbA.1.5 ClickTag FASTQ files for 3795 methanol-fixed mouse neural stem cells from four multiplexed samples were downloaded from https://caltech.box.com/shared/static/zqaom7yuul7ujetqyhnd4lvf8vhzqsyg.gz.A mismatch index was created using the list of targeted barcodes file and kb ref with the kite workflow.The list of barcodes can be found here: https://github.com/pachterlab/BMGP_2020/tree/main/references/clicktag.The reads were pseudoaligned and a feature count matrix was obtained using kb count with the kite technology option.CITE-seq-Count and the targeted barcodes were used to process the FASTQ files into a feature count matrix.
All preprocessing steps are runnable via a Google Colab notebook: https://github.com/pachterlab/BMGP_2020/blob/main/analysis/notebooks/clicktag/clicktag_preprocess.ipynb.All matrix comparisons are runnable via a Google Colab notebook: https://github.com/sbooeshaghi/BMGP_2020/blob/main/analysis/notebooks/clicktag/clicktag_preprocess.ipynb We generated a kallisto index with no mismatches by running kb ref with the-no-mismatches option and compared the resultant count matrix generated with kb count to the count matrix generated with mismatches.

A.1.6 Multiseq
FASTQ files for 9551 primary patient-derived xenograft (PDX) samples from nine multiplexed samples were downloaded from https://caltech.box.com/shared/static/0scoe38xvcoxfd62xnm848pimh2zr1ip.gz.A mismatch index was created using the list of targeted barcodes file and kb ref with the kite workflow.The list of barcodes can be found here: https://github.com/pachterlab/BMGP_2020/tree/main/references/multiseq.The reads were pseudoaligned and a feature count matrix was obtained using kb count with the kite technology option.CITE-seq-Count and the targeted barcodes were used to process the FASTQ files into a feature count matrix.
Quantifying orthogonal barcodes for sequence census assays A.1.7 Cell hashing FASTQ files for 10807 PBMCs from eight multiplexed samples were downloaded from https://caltech.box.com/shared/static/0scoe38xvcoxfd62xnm848pimh2zr1ip.gz.A mismatch index was created using the list of targeted barcodes file and kb ref with the kite workflow.The list of barcodes can be found here: https://github.com/pachterlab/BMGP_2020/tree/main/references/multiseq.The reads were pseudoaligned and a feature count matrix was obtained using kb count with the kite technology option.CITE-seq-Count and the targeted barcodes were used to process the FASTQ files into a feature count matrix.

Figure 1 .
Figure 1.kITE workflow.The mismatch map is generated from the feature barcode whitelist and indexed with kallisto.Reads are pseudoaligned and error-collapsed with kallisto and bustools to form a cells x features matrix.

Figure 2 .
Figure 2. Comparison between kallisto bustools (kb) quantifications and Cell Ranger (cr) quantifications for the 10� Feature Barcoding assay on 5266 Peripheral blood mononuclear cells dataset targeting 32 cell surface proteins.(a) Knee plot comparing cumulative UMI counts per cell.(b) Pseudobulk comparison of cumulative UMI counts per feature barcode.(c) Cumulative UMI counts per cell.(d) Pearson correlation of the same cell between the two quantifications.(e)The l 1 distance between a kallisto bustools cell and its Cell Ranger equivalent and the same cell and its nearest kallisto bustools neighbor across the total UMI counts for that cell.(f) The l 1 distance of a kallisto bustools cell to its Cell Ranger equivalent (y-axis) and to its nearest neighbor (x-axis) and the l 1 distance of a Cell Ranger cell to it's kallisto bustools equivalent (x-axis) and its nearest neighbor (y-axis).The marginal distributions show that each kallisto bustools cell is closest to its corresponding Cell Ranger cell and that each Cell Ranger cell is nearest to its corresponding kallisto bustools cell.(g) Pearson correlation for all the cells between the kallisto bustools qualifications and in CITE-seq-Count or Cell Ranger quantifications for each of the five datasets.(h) Runtime and (i) memory improvements for each of the five datasets across the various read depths.