Single-cell transcriptomic landscape of nucleated cells in umbilical cord blood

Abstract Background For both pediatric and adult patients, umbilical cord blood (UCB) transplant is a therapeutic option for a variety of hematologic diseases, such as blood cancers, myeloproliferative disorders, genetic diseases, and metabolic disorders. However, the level of cellular heterogeneity and diversity of nucleated cells in UCB has not yet been assessed in an unbiased and systemic fashion. In the present study, nucleated cells from UCB were subjected to single-cell RNA sequencing to simultaneously profile the gene expression signatures of thousands of cells, generating a rich resource for further functional studies. Here, we report the transcriptomes of 17,637 UCB cells, covering 12 major cell types, many of which can be further divided into distinct subpopulations. Results Pseudotemporal ordering of nucleated red blood cells identifies wave-like activation and suppression of transcription regulators, leading to a polarized cellular state, which may reflect nucleated red blood cell maturation. Progenitor cells in UCB also comprise 2 subpopulations with activation of divergent transcription programs, leading to specific cell fate commitment. Detailed profiling of cytotoxic cell populations unveiled granzymes B and K signatures in natural killer and natural killer T-cell types in UCB. Conclusions Taken together, our data form a comprehensive single-cell transcriptomic landscape that reveals previously unrecognized cell types, pathways, and mechanisms of gene expression regulation. These data may contribute to the efficacy and outcome of UCB transplant, broadening the scope of research and clinical innovations.


INTRODUCTION
Human umbilical cord blood (UCB) is an excellent source of hematopoietic progenitor cells. It has been widely used for bone marrow reconstitution for decades [1,2]. The progenitor cells contained in UCB are capable of regenerating the entire lympho-hematopoietic compartment in the host. The most notable advantage of UCB transplant is the low risk of developing graft-versus-host disease (GVHD), even when donor and recipient are partially mismatched [3]. The immune cells in cord blood are virtually free from external stimulant and infection and thus are in a relatively more naïve stage. Such immunological immaturity is the key to alleviate the severity of GVHD by decreasing the alloreactive potential of lymphocytes [2,4]. These advantages expand the clinical potential of UCB transplant in many cases including some fatal diseases. The major limitation of UCB transplant, however, is the limited and inconsistent cell dose. It has been shown that the success rate of engraftment was critically dependent on the number of nucleated cells in the donor UCB [4][5][6].
Although UCB is now widely used for important clinical applications, we know surprisingly little about its cellular and molecular characteristics. Especially, the composition of progenitor, lymphocyte and other nucleated cells that affect the reconstitution potency after UCB engraftment is poorly understood. Recent advances in single-cell transcriptomics technology enable the exploration of cellular heterogeneity and deduction of functional relevance [7,8]. Single-cell RNA-seq (scRNA-seq) studies of human peripheral blood (PB) cells have revealed new insights into immune cell composition and disease-related functional abnormalities [9][10][11]. 4 Previous studies conducted in mouse and human have focused on hemopoietic stem cell, erythroblast and certain T cell subtypes and unveiled novel biological properties at single-cell level [12][13][14][15][16][17]. However, scRNA-seq studies have not thoroughly to different cell enrichment approaches used (Methods). The abundance of the common cell types also varied in PB versus that of UCB, suggesting a specific immunological capacity of UCB (Fig. 1C, Supplementary Fig. 4B). We focus the scope of current study 7 in a few cell types that have profound clinical applications. However, the cellulome landscape of UCB data constitute a rich resource that can be used as a reference to complement transcriptomics analysis performed in bulk or single-cell settings, as well as a guide to future functional studies.

Polarity of cord nucleated red blood cell
In mammal hematopoiesis, NRBCs, or erythroblast, undergo several developmental stages in the bone marrow and progressively decrease cellular volume and RNA content, while accumulating specific functional proteins such as hemoglobin ( Supplementary Fig. 5A). The cell ordering along the trajectories deduced by the two algorithms showed remarkable concordance ( Supplementary Fig. 5B).
Next, we modeled gene expression along the Monocle2-inferred trajectory to identify genes characterized by a wave-like pattern. The most prominent of these were the genes encoding surface markers and proteins critical to the function of red blood cells, such as CD47, CD36, hemoglobin and glycophorins [30] (Fig. 2B) (Fig. 2D) and expressed genes ( Fig. 2E) across the pseudotime axis was observed, reflecting the decrease of global gene expression activity due to the NRBC enucleation, supporting the correlation between linear polarity and the cord blood NRBCs maturation. These lines of evidence further corroborated the polarity identified in the NRBC population in UCB, and strongly indicated that the differential activation of transcriptional programs was one of the underlining mechanisms.

Molecular signatures of UCB progenitor cell
A distinct progenitor population was found in the UCB that shared a similar transcriptome profile with the hematopoietic stem cells (HSCs) in the PB dataset ( Fig.   1A). However, when the tSNE clustering was performed with the progenitor population in a finer resolution, a secondary subpopulation emerged, demonstrating the heterogeneity of progenitor population in the UCB (Fig. 3A). One subpopulation of UCB progenitor cells overlapped with HSCs in PB and specifically expressed the canonical HSC marker genes such as CD34, SOX4 and FLT3 (CD135) (Fig. 3B, triangles), suggesting their identity as cord blood HSCs. Interestingly, the other subpopulation consists cells only from the UCB (Fig. 3A, dots) and did not express the HSC canonical markers (Fig. 3C, 3D) despite the similarity in overall spectrum of gene expression, which drove the clustered embeddings of these cells in the tSNE space.
Permutation tests were performed to estimate the significance of the four-way intersection in both cases and the resulted p values were both < 3x10 -16 . These two sets of signature genes (31 and 22) that we found were defined as GZMB and GZMK co- Similar analysis was performed in the PB cells, and we found the core modules largely consistent with that in UCB, though the GZMK core module was less prominent (Fig.   5D, red labeled genes). These enriched genes in the two programs that we identified represent common features of the GZMB + and GZMK + subtypes of cytotoxic cells.
They may serve as specific selection markers and targets for perturbation in further functional studies.

DISCUSSION
For the first time, we present here a single-cell level transcriptomic landscape of nucleated cells in UCB. By analyzing the expression pattern of known marker genes, we identified UCB cells belonging to almost all of the major hematopoietic lineages in PB, covering lymphoid, myeloid and hematopoietic progenitor cells. We also observed that certain cell populations were highly enriched in UCB cells, such as NRBCs, uIBCs and GZMB + NKT cells. The features we discovered regarding these cells were consistent in both UCB donors. However, it is important to keep in mind that the UCB donors' shared factors, such as genetic background, could contribute to the enrichment of these UCB-specific cell subtypes. A related technical challenge in the current study that we encountered was the severe batch effect among sample types and donors. To minimize the technical variance that could lead to misinterpretation of the data, we rigorously tested three widely used algorithms for batch effect correction, namely, CCA, SVA and MNN. Based on a quantitative evaluation of cell segregation in the tSNE space, performance of MNN and CCA appeared comparable and effective for our datasets, though MNN scored marginally higher.
In adults, red blood cells are generated mainly in the bone marrow from nucleated cells identified as erythroid precursors. These cells undergo morphological changes through cell divisions and gradual decrease in cell size and RNA species, increase in chromatin condensation and hemoglobin protein accumulation. Such changes have been associated with the early stages of maturation of red blood cell. In 17 our dataset we also observed such a dynamic cellular state in a linear polarity. While it is possible that the erythroid precursors at different stages in UCB may be migrated from the bone marrow, our finding also suggested the possibility that the erythroid precursors may undergo a similar maturation process in the UCB.
Progenitor cell populations in UCB also appeared to be a mixture of at least two distinct subpopulations. It is conceivable that the HSC subpopulation (CD34 + ) we identified may be a mixture of hematopoietic stem cells and various early multipotent progenitors committed to differentiation, which was termed primed progenitors and extensively discussed in a recent study profiling UCB HSC at single-cell level [16].
Taken together, our data provides the first single-cell transcriptomic references for UCB, which could be used as a standard dataset for comparative analysis. We expect that this dataset will prove useful in uncovering the novel molecular signatures that define the cellular heterogeneity in UCB and provide markers for targeted enrichment of certain cell types of interest to researchers in multiple fields. Our dataset is a rich resource to formulate hypothesis of signaling pathway activation, transcription control and other mechanistic studies in the field of functional immunology at single cell level .   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 20

Sample collection
Two umbilical cord blood samples were collected from healthy donors immediately after caesarean section with informed consents. Samples were stored in EDTA anticoagulant tubes and transported to laboratory within 1 hour. CD45 + and CD45cells were isolated from 1 mL cord blood by positive and negative selection, https://support.10xgenomics.com/single-cell-gene-expression/datasets.

UCB library construction and sequencing
Single-cell suspension of UCB samples was loaded to Single-cell 3'Chips (10×

Alignment and initial processing of sequencing data
CellRanger toolkit (10X Genomics, USA, version 2.0.0) was employed to align the cDNA reads to GRCh38 transcriptome. Filtered UMI expression matrices of both samples were generated with the default parameters and an additional "--force-cells=4000" parameter [78]. The expression matrices of all samples were first normalized by "cellranger aggr" function in the CellRanger toolkit, with the parameter "--normalize=mapped". As a result, raw expression data of total ~32,000 single cells of UCB sample was generated.

Cells clustering in individual UCB samples
Next, the filtered expression matrices of UCB1 and UCB2 were used for variable genes were used for "RunPCA" function. Subsequently, the top 10 PCs were subjected to "FindClusters" and "RunTSNE" function with high resolution setting at 2.0 ( Supplementary Fig. 1A). In the dimensional reduced tSNE space, the clusters of NRBCs were identified on the basis of the concerted expression of hemoglobin genes, such as HBG1 and HBM ( Supplementary Fig. 1B). Then we bioinformatically isolated the total of 672 NRBCs from UCB1 and UCB2 as a sub-dataset for further analyses.
The NRBC-excluded data were then subjected to merging and batch effect removal.

Batch effects correction
Strong technical bias introduced by sample preparation, library construction and/or sequencing was observed in the merged data (Supplementary Fig. 2A We tested different parameters when processed CCA analysis, and observed best performance while chose 15 canonical vectors and 1,500 shared high variable genes. For MNN, we first created a SingleCellExperiment object to store the counts and metadata together for each sample, using SingleCellExperiment package (1.3.10).
These cells were pre-clustered by quickCluster function. Size factors was computed for the endogenous genes using the deconvolution method by computeSumFactors function [79]. We then acquired the normalized log-expression values and distinguished highly variable genes by trendVar function and decomposed the genespecific variance into biological and technical components by decomposeVar function.

Evaluation of batch correction
The alignment scores of the methods above were calculated based on tSNE plots according to the strategy of previously study [19]. First, neutrophil and eosinophil that were only present in UCB datasets were masked from the datasets. Then, we randomly sampled cells from the four datasets with same number of cells and constructed a nearest-neighbor graph based on their relative positions in tSNE space. For each sampled cell, we calculated the cell numbers from the dataset sample in the k nearestneighbors and average with total cells to obtain . The alignment score was then calculated as following: The alignment scores were normalized by size of the datasets and scaled to range from 0 to 1. For Supplementary Figure 2E, the parameters used were k = 800, N = 4. As shown, alignment score of MNN was marginally higher than that of CCA. To   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 25 rule out the potential bias from the arbitrary selection of k, we tested different k from 100 to 1,000, and observed that the high scores by MNN was independent of k selection ( Supplementary Fig. 2F).

Cell type annotation
After batch-correction by MNN, the merged expression matrix was further filtered following the typical Seurat pipeline. Specifically, ribosomal genes were removed and cells with mitochondria gene UMI percentage high than 10%, and cells with more than 11,000 total UMI counts were removed. Then the expression matrix was normalized by NormalizeData function. The corrected expression matrix was used to perform dimensionality reduction following the typical Seurat pipeline. Next, 3,556 variable genes in the batch-corrected expression matrix were used for RunPCA, ProjectPCA, FindClusters and RunTSNE functions with default parameters, except dims.use = 1:13 and resolution = 2.
Subsequently, the feature genes for each cluster were identified using normalized data by the Seurat FindAllMarkers function with parameter min.pct = 0.25, thresh.use = 0.25. Four minor clusters with ~5% (same as estimated by 10X Genomics, USA) of total cells were suspected as doublets as they share feature genes from two adjacent large clusters were removed from the datasets. The identity of each cell cluster was manually annotated by the specific expression of commonly known markers.
Unsupervised annotation by comparing averaged single cell expression levels with bulk RNA-seq data of sorted immune cells was also performed to validate the results as  Fig. 4A).  Fig. 5B).

Clustering and pseudotime analysis of UCB progenitor cells
UCB progenitor cells were re-clustered using Seurat packages, same as in the global clustering described above. In order to visualize the potential transition of cell

Cytotoxic cell clustering and profiling
Cytotoxic cells of interest were selected by unsupervised clustering at resolution=2 by the FindClusters function in Seurat package ( Supplementary Fig. 6D).

Signature gene selection in GZMK + and GZMB + subtypes
To identify the common features of the GZMK and GZMB programs in the cytotoxic cells (Fig. 5), the GZMB/GZMK expressing NK, NKT and CTL subtypes were used to create a new Seurat object by SubsetData function. The function FindAllMarkers was used to identify corresponding features genes of each clusters with parameter min.pct = 0.25, thresh.use = 0.25.
The four-way Venn diagrams of feature genes shown in Fig.5A and B were generated using R package VennDiagram. To verify the statistical significance of the enrichment of the four-way-overlapped genes (GZMB/GZMK program genes), One Sample t-test was carried out by testing the mean number of overlapping genes from randomly sampled pools of genes, sizes of which was kept the same as the original feature genes in the four subtypes. Co-expression modules in Figure 5C

ACKNOWLEDGMENTS
We thank the two donors who generously provided the UCB samples. We also thank Liqin Xu, Zhikun Zhao for helpful discussions and BGI colleagues who have helped producing the high-quality data. This work was supported by Shenzhen Municipal Government of China (JCYJ20170817145404433 and JCYJ20170817145428361)

Figure 3: Heterogeneous molecular signatures of progenitor cells in UCB
A. The re-clustered tSNE projection of progenitor cells from UCB and PB samples.
After extensive re-analysis and detailed processing and comparisons of various pipelines, including significant efforts to minimize the batch effect of the data, we are now quite confident that we have comprehensively addressed all the concerns from the reviewers, and the quality of the manuscript has been significantly improved. Therefore, we sincerely hope you can re-evaluate our manuscript for publication in your journal.
We believe our work will be of broad interests to the readers of Gigascience. The manuscript has been approved by all the authors, and is in adhere to all the ethical guidelines. A detailed point-topoint response letter is attached to fully address all the concerns from the reviewers. We have also re-written the manuscript and highlighted the major additions and revisions from the former edition.
Thank you for all your help to improve the quality of the manuscript and we look forward to your favorable reply.