Scirpy: a Scanpy extension for analyzing single-cell T-cell receptor-sequencing data

Abstract Summary Advances in single-cell technologies have enabled the investigation of T-cell phenotypes and repertoires at unprecedented resolution and scale. Bioinformatic methods for the efficient analysis of these large-scale datasets are instrumental for advancing our understanding of adaptive immune responses. However, while well-established solutions are accessible for the processing of single-cell transcriptomes, no streamlined pipelines are available for the comprehensive characterization of T-cell receptors. Here, we propose single-cell immune repertoires in Python (Scirpy), a scalable Python toolkit that provides simplified access to the analysis and visualization of immune repertoires from single cells and seamless integration with transcriptomic data. Availability and implementation Scirpy source code and documentation are available at https://github.com/icbi-lab/scirpy. Supplementary information Supplementary data are available at Bioinformatics online.

: Runtime analysis on simulated datasets. (a) Elapsed time for calculating the clonotype network based on nucleic acid sequence identity (scirpy.tl.tcr neighbors function with metric="identity")with increasing number of cells. The runtime is limited by the number of edges in the resulting cell × cell connectivity network. We simulated 500 000 α and β TCR sequences using the immuneSIM package [3], assuming a power law-distribution of clonal frequencies [3,4]. To obtain clonal frequencies representative for single-cell TCR sequencing data, we fitted a power law distribution to the empirical distribution of clonotype frequencies in the Wu et al. [1] dataset using the powerlaw Python package [5]. From this distribution, we randomly sampled datasets with 5000 to 300 000 unique clonotypes, resulting in datasets with 16 553 to 1 085 999 cells. The analysis was performed on a single core of an Intel E5-2699A v4, 2.4 GHz CPU. (b) Elapsed time for calculating the sequence × sequence alignment-distance matrix on up to 300 000 simulated α TCR sequences. This distance-matrix is internally computed for α and β CDR3 amino-acid sequences independently when executing the scirpy.tl.tcr neighbors function with metric="alignment".The runtime is quadratic over the number of unique sequences and more computationally expensive than the sequence-identity metric due to the computation of pairwise sequence alignments. The analysis was performed on 16 cores of an Intel E5-2699A v4, 2.4 GHz CPU. The source-code to reproduce the runtime analysis is available from: https://github.com/icbi-lab/scirpy-paper/tree/master/runtime-analysis.   Table 1: Comparison of Scirpy with currently available tools supporting immune repertoire analysis at the single-cell level, or offering immune repertoire visualization. Basic immune repertoire (IR) visualization includes clonotype abundance, diversity, and V(D)J usage. Advanced IR visualization also includes repertoire overlap, clustering as well as specialized analysis of individual clonotypes and generation of publication-ready figures. "Gex" indicates gene expression data.

Supplementary Note
A clonotype designates a collection of T or B cells that descend from a common, antecedent cell, and therefore, bear the same adaptive immune receptors and recognize the same epitopes. In single-cell RNA-sequencing (scRNAseq) data, T cells sharing identical complementarity-determining regions 3 (CDR3) nucleotide sequences of both α and β TCR chains are considered a clonotype.
Contrary to what would be expected based on the previously described mechanism of allelic exclusion [15], scRNAseq datasets can feature a considerable number of cells with more than one TCR α and β pair. Since cells with more than one productive CDR3 sequence for each chain did not fit into common understanding of T cell biology, most TCR analysis tools ignore these cells [16,17] or select the CDR3 sequence with the highest expression level [14]. While in some cases these double-TCR cells might represent artifacts (e.g. cell doublets), there is an increasing amount of evidence in support of a bone fide dual-TCR population [18,19].
Scirpy allows investigating the composition and phenotypes of both single-and dual-TCR T cells by leveraging a T cell model similar to the one proposed in [12], where T cells are allowed to have a primary and a secondary pair of α and β chains. For each cell, the primary pair consists of the α and β chains with the highest read counts. Likewise, the secondary pair is the pair of α and β chains with the second highest read counts. Based on the assumption that each cell has only two copies of the underlying chromosome set, if more than two variants of a chain are recovered for the same cell, the excess TCR chains are ignored by Scirpy and the corresponding cells flagged as "multichain" (Supplementary Figure 1). Moreover, Scirpy flags as "orphan chain" the cells that have lost their α or β chains, possibly due to sequencing inefficiencies resulting in chain dropouts. This filtering strategy leaves the choice of discarding or including multichain and orphan-chain cells in downstream analyses.
Scirpy implements a network-based clonotype analysis that enables clustering cells into clonotypes or clonotype clusters based on the following options: (a) identical CDR3 nucleotide sequences; (b) identical CDR3 amino acid sequences; (c) similar CDR3 amino acid sequences based on pairwise sequence alignment.
The latter approach is inspired by studies showing that similar TCR sequences also share epitope targets [16,20,21]. While convergence of the nucleotide-based clonotype definition to the amino acid-based one hints at selection pressure, sequence alignment-based networks offer the opportunity to identify cells that might recognize the same epitopes.