SCIGA: Software for large-scale, single-cell immunoglobulin repertoire analysis

Abstract Background B-cell immunoglobulin repertoires with paired heavy and light chain can be determined by means of 10X single-cell V(D)J sequencing. Precise and quick analysis of 10X single-cell immunoglobulin repertoires remains a challenge owing to the high diversity of immunoglobulin repertoires and a lack of specialized software that can analyze such diverse data. Findings In this study, specialized software for 10X single-cell immunoglobulin repertoire analysis was developed. SCIGA (Single-Cell Immunoglobulin Repertoire Analysis) is an easy-to-use pipeline that performs read trimming, immunoglobulin sequence assembly and annotation, heavy and light chain pairing, statistical analysis, visualization, and multiple sample integration analysis, which is all achieved by using a 1-line command. Then SCIGA was used to profile the single-cell immunoglobulin repertoires of 9 patients with coronavirus disease 2019 (COVID-19). Four neutralizing antibodies against severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) were identified from these repertoires. Conclusions SCIGA provides a complete and quick analysis for 10X single-cell V(D)J sequencing datasets. It can help researchers to interpret B-cell immunoglobulin repertoires with paired heavy and light chain.


Background
The diversity of B-cell immunoglobulin is an important characteristic of the adaptive immune system. It is developed through the rearrangement of variable (V), diversity (D), and joining (J) gene segments, which is referred to as V(D)J; the pairing of heavy and light chains; and somatic hypermutation (SHM) [1]. Expo-sure to infections and environmental factors shapes the repertoire of B-cell immunoglobulins [2][3][4] and leads to clonal expansion of immune cells, allowing them to change into different types of cells to respond to a specific antigen. Understanding these immunoglobulin repertoires can help researchers to discover antibodies, monitor vaccination responses, and infer Bcell trafficking patterns [5,6].
10X single-cell V(D)J sequencing is a powerful tool for investigating paired heavy and light chain repertoires of B-cell immunoglobulins [7]. It has been used in the identification of neutralizing antibodies against severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [8], the virus that causes coronavirus disease 2019 (COVID-19) [9]. However, accurately analyzing 10X single-cell immunoglobulin repertoires remains a challenge owing to the high diversity of immunoglobulin repertoires and the lack of specialized software that can analyze such diverse data.
Here, we developed SCIGA (Single-Cell Immunoglobulin Repertoire Analysis), a software package for quickly analyzing the data of 10X single-cell immunoglobulin repertoires. SCIGA performs read trimming, immunoglobulin sequence assembly and annotation, and heavy and light chain pairing by means of a 1-line command. It also computes the statistics of repertoires, including gene usage frequency, SHM rate, length of complementarity-determining region 3 (CDR3), and clonality, and further implements visualization. We profiled the immunoglobulin repertoires of peripheral blood mononuclear cells (PBMCs) from 9 patients with COVID-19 using SCIGA. Finally, we identified 4 neutralizing antibodies against SARS-CoV-2 from these repertoires.

Methods
SCIGA is a software package for the analysis of 10X single-cell immunoglobulin repertoires. It integrates several tools and algorithms into a single workflow. The input data can be raw reads or the output of Cell Ranger (RRID:SCR 017344) [10]. The details of the SCIGA algorithm can be found in the Supplementary Methods and Materials. Briefly, the workflow, which is summarized in Fig. 1, is as follows. (i) Quality control of reads: trim the reads of low quality using Trimmomatic (RRID:SCR 011848) [11]. (ii) Call cell: the 10X system generates a large amount of Gel Beads-in-Emulsion (GEMs) that contain no cell. We need to identify the cell-containing GEMs before further analysis. SCIGA considers the GEMs containing cell(s) when the read number of the GEMs exceeds a threshold (see Supplementary Methods and Materials). (iii) Immunoglobulin sequence assembly: the immunoglobulin sequences for each cell were assembled using SSAKE (RRID:SCR 010753) [12], which is a reliable de novo assembler for short reads. (iv) Gene call: to detect the usage of the V(D)J gene and C gene (isotype), SCIGA aligns the assembled immunoglobulin sequences against the V-, D-, and J-gene reference database using IgBLAST (RRID:SCR 002873) [13] and against C-gene reference database using BLAST (RRID:SCR 004870) [14]. The V(D)JC reference databases for humans, mice, and rats were downloaded from the international ImMunoGeneTics information system (IMGT) [15] and embedded in the SCIGA software. (v) Quality control of the immunoglobulin sequence: only the immunoglobulins that are complete, in the correct reading frame, and have no stop codon are retained. (vi) Quality control of the cells: after immunoglobulin sequence assembly and filtering, some cells have multiple heavy or light chains, whereas some cells have only 1 chain. SCIGA reports the heavy and light chain with the highest number of unique molecular identifiers (UMIs) for each cell. A certainty score is calculated for each reported chain (see Supplementary Methods and Materials). The chains with a certainty score less than a given threshold are discarded. Next, the cells without paired heavy and light chains are filtered out. (vii) Clonal lineage grouping: clonal lineage is defined as the cells that have identical V H , J H , V L , and J L genes and identical H-CDR3 length and exceed a given similarity threshold of H-CDR3 nucleotide sequences. (viii) Statistical analysis and visualization: SCIGA calculates a list of statistics, including gene usage frequency, SHM rate, CDR3 length, Simpson index, Shannon entropy, and others. SCIGA subsequently generates figures to show the features of the repertoires. (ix) Multiple sample integration analysis: after analyzing each sample, SCIGA consolidates all of the outputs into one. It identifies the shared immunoglobulins that are potential public antibodies suitable for use against a specific pathogen. Shared immunoglobulins are defined as immunoglobulins from different samples that can be clustered into the same clonal lineage. Clustering is performed as described in Step 7 with cells of all samples.

Comparing SCIGA to existing software
At the time of this study, Cell Ranger is the only existing software for processing raw data generated by 10X single-cell V(D)J sequencing. A test dataset was therefore built to compare SCIGA to Cell Ranger. The PBMCs from 9 patients with COVID-19 (B1-B9) were collected and 10X single-cell V(D)J sequencing was performed ( Fig. 2A and Supplementary Table S1). The raw data were analyzed using SCIGA and Cell Ranger (v3.1.0) with the default parameters. The comparison mainly focused on the following aspects: (i) Cell quality control: for Cell Ranger, the final results still included low-quality cells that had either multiple heavy or light chains or only 1 chain. In our test dataset, the mean percentage of low-quality cells was 29% (range, 16.8-47.6% per sample, Fig. 2B). SCIGA implements the cell quality control process (Step 6 described in Methods) and only outputs the high-quality cells. The cell count was generally less in the output of SCIGA compared to Cell Ranger owing to the strict quality control process ( Fig. 2C). (ii) Detecting B cell clonal lineage: Cell Ranger clusters B cells into a clonal lineage when cells have identical nucleotide sequences of CDR3. However, it will break up the clonotypes that are clonally related in fact when the SHM falls within the CDR3 region. SCIGA uses a popular clonal grouping method (Step 7 of Methods), which considers the SHM and allows mismatch in the CDR3 region. Therefore, SCIGA could detect a larger clonal lineage than Cell Ranger (Fig. 2D). (iii) Output information: the output of Cell Ranger is quite limited, and some important information, such as SHM rate, is not included. SCIGA outputs the necessary information, including gene frequency, clone frequency, clonality, SHM rate, CDR3 length, the immunoglobulin variable region sequence, and others (Fig. 2E). Moreover, SCIGA is able to implement visualization to display the features of the repertoires. (iv) Detecting shared immunoglobulin: this is a specific function in SCIGA, and it could detect the shared immunoglobulin across samples.

Determining the features of immunoglobulin repertoires of COVID-19 using SCIGA
We performed a trial study to show the usage and performance of SCIGA. The 10X V(D)J sequencing data of the 9 patients with COVID-19 were analyzed, and features of the immunoglobulin repertoires were determined. A total of 8,358 B cells were detected (range, 571-2,371 cells per sample, Fig. 2C). We focused on the genes used in ≥1% of B cells for the V-gene usage ( Fig. 3A and Supplementary Fig. S1). The top 3 gene families were IGHV4-34 (12.51%), IGHV3-30 (7.95%), and IGHV3-23 (6.30%) for the heavy chain and IGLV3-19 ( (Fig. 3B). The SHM levels were low in the repertoires of most patients (<2%, Fig. 3C and Supplementary Fig. S2). However, Patient B2 showed a high SHM level in the IGH chain (7.48%) and IGL chain (7.13%). Moreover, Patient B2 had an elevated CDR3 length for its immunoglobulin repertoire ( Fig. 3D and Fig. S3).

Clonal lineage analysis using SCIGA
Clonal lineages were grouped using SCIGA with the default parameters. We used the Simpson index and Shannon entropy to determine the clonality of the immunoglobulin repertoires ( Fig. 4A and B). Both indices showed that the repertoires of Patient B2 underwent clonal expansion. The top 10 largest clones of each patient were reviewed ( Fig. 4C and Fig. S4). The frequency of the largest clone for most patients was <8%. However, the largest clone in Patient B2 reached a frequency of 61.21%. This strongly expanded clone used the IGHV4-34 and IGLV3-19 genes, with an 8.82% mean SHM rate and 23-amino acid H-CDR3 length. We determined the most used V-genes in the top 10 largest clones of all patients. It was observed that IGHV4-34 (9 clones) was the most used gene in the heavy chain, IGKV1-39 (9 clones) and IGKV3-20 (9 clones) were the most used genes in the light chain, and IGHV4-34: IGLV3-19 (5 clones) was the most used gene pair (Fig. S5A-C). Next, the shared immunoglobulin sequences across patients were determined using SCIGA. There were 12 immunoglobulins shared between Patients B1 and B2, 1 immunoglobulin was shared between Patients B5 and B8, and 26 immunoglobulins were shared between Patients B6 and B9 ( Fig. 4D and Supplementary Table S2).

Identification of neutralizing monoclonal antibodies
It was hypothesized that IgGs with higher clonal expansion may be SARS-CoV-2-specific antibodies in the patients with COVID-19. Thus, monoclonal antibodies (mAbs) were screened for the following criteria: IgG antibodies in the clone with fraction ≥1% and cell number ≥20 (see Supplementary Methods and Mate-rials). Four mAbs met the criteria and were expressed: B2-C1, B6-C2, B6-C3, and B8-C1 ( Fig. 4C and Supplementary Table S3).

Discussion
In this study, we developed the SCIGA software for 10X singlecell immunoglobulin repertoire analysis. SCIGA is easy to use and allows researchers to quickly perform advanced analysis on 10X V(D)J sequencing datasets. Cell Ranger has previously been used for 10X single-cell immunoglobulin repertoire analysis. However, this software includes low-quality cells in the output and disregards the effect of SHM when defining clonal lineage. In addition, some important information about repertoires, including, e.g., the level of SHM, is not included in the output of Cell Ranger. SCIGA performs the quality control process for cells and defines the clonal lineage including the effect of SHM. Larger clones can be detected by using SCIGA. Moreover, SCIGA generates the needed statistical outputs and implements visualization. It is therefore a more efficacious tool for researchers. In this study, SCIGA was used to analyze the single-cell immunoglobulin repertoires of the PBMCs of patients with COVID-19. Large-scale clone expansion was not observed in most patients. In Patient B2, however, B cells expanded, which was indicated by a large size of clonal lineage with the IgG isotype. This   indicates that Patient B2 likely generated neutralizing antibodies against SARS-CoV-2. Finally, we tried to identify the SARS-CoV-2-responding antibodies. In previous work, immunoglobulins with an SHM rate <2% were excluded in screens for neutralizing antibodies [8]. However, some studies have shown that several potent neutralizing antibodies against SARS-CoV-2 have low SHM rates [16][17][18]. Therefore, we included the antibodies with low SHM levels in our work. Four neutralizing antibodies with different potency were identified using our criteria. This demonstrates that SCIGA is useful for 10X single-cell immunoglobulin repertoire analysis.

Data Availability
The datasets supporting the results of this article are available in the NCBI repository and can be accessed with PRJNA682839.
All additional supporting data and materials are available in the GigaScience GigaDB database [19].  Figure S1:The usage frequency of V-genes in the repertoire of each patient. Only the genes with frequency >1% are shown. Figure S2: Distribution of the SHM rate in the repertoire of each patient. Color denotes the chain, with IGH shown in red, IGK in yellow, and IGL in blue. Figure S3: Distribution of the CDR3 length in the repertoire of each patient. Color denotes the chain, with IGH shown in red, IGK in yellow, and IGL in blue. Figure S4: Fraction of all clones in the repertoire of each patient. The x-axis captures the clone rank and y-axis captures the clone fraction. * Selected antibody candidate.  Table S1: Information on the 9 patients with COVID-19. Table S2: Information on the shared immunoglobulins. Columns 7-11 denote the number of shared immunoglobulin sequences.

Ethics, Consent, and Permissions
This study was conducted according to the ethical principles of the Declaration of Helsinki. Ethical approval was obtained from the Research Ethics Committee of Shenzhen Third People's Hospital (2020-207). All participants provided written informed consent for sample collection and subsequent analyses. Raw data has been approved for release by the Ministry of Science and Technology (MOST) of People's Republic of China, with the reference number 2021BAT3238.