Shortcut barcoding and early pooling for scalable multiplex single-cell reduced-representation CpG methylation sequencing at single nucleotide resolution

Abstract DNA methylation is essential for a wide variety of biological processes, yet the development of a highly efficient and robust technology remains a challenge for routine single-cell analysis. We developed a multiplex scalable single-cell reduced representation bisulfite sequencing (msRRBS) technology. It allows cell-specific barcoded DNA fragments of individual cells to be pooled before bisulfite conversion, free of enzymatic modification or physical capture of the DNA ends, and achieves read mapping rates of 62.5 ± 3.9%, covering 60.0 ± 1.4% of CpG islands and 71.6 ± 1.6% of promoters in K562 cells. Its reproducibility is shown in duplicates of bulk cells with close to perfect correlation (R = 0.97–0.99). At a low 1 Mb of clean reads, msRRBS provides highly consistent coverage of CpG islands and promoters, outperforming the conventional methods with orders of magnitude reduction in cost. Here, we use this method to characterize the distinct methylation patterns and cellular heterogeneity of six cell lines, plus leukemia and hepatocellular carcinoma models. Taking 4 h of hands-on time, msRRBS offers a unique, highly efficient approach for dissecting methylation heterogeneity in a variety of multicellular systems.


Introduction
Cytosine methylation at DNA CpG dinucleotides, or DNA methylation ( DNAme ) , is an important epigenetic mechanism that regulates gene expression in health and disease.Whole genome bisulfite sequencing ( WGBS ) ( 1 ) and reduced representation bisulfite sequencing ( RRBS ) ( 2 ) accurately quantify the methylation of cytosine at single-base resolution genomescale, and remain the gold standard for CpG methylation analysis.More recently, single-cell CpG methylation sequencing has shown its unique ability to explore the heterogeneity of the epigenetic state among cells in a variety of biological systems (3)(4)(5)(6)(7)(8) .In addition, it is highly beneficial for profiling of the methylomes of cells from rare sources, such as mammalian blastomeres and some liquid biopsies, which are difficult to examine with bulk analysis methods ( 9 ,10 ) .The first singlecell RRBS ( scRRBS ) report digitally detected the methylation status of single CpG sites in haploid sperm, covering up to 1.5 million CpG sites in the mouse genome ( 9 ) .Then, single-cell WGBS, scBS ( 11 ) , was developed using post-bisulfite adapter tagging ( PBAT ) (11)(12)(13) , which could detect up to 48.4% ( 10.1 of 20.87 million ) CpG sites within the mouse genome.All of these methods ( 9 , 11 , 13 ) are based on independent processing of each single cell from cell isolation to library construction with PCR.
Early sample-pooling greatly reduces DNA damage and the cumbersome cell-by-cell bisulfite conversion and post processing.There have been only a few techniques that are able to pool multiple single cells before bisulfite conversion until recently.The outstanding methods for high-throughput singlecell WGBS are represented by snmC-seq2 ( 14 ) , sci-METv2 ( 5 ) and sciEM ( 8 ) , showing their power in the dissection of cellular heterogeneity of the DNA methylome despite the shallow depth of coverage.Although the WGBS strategy theoretically profiles the entire genome at each cytosine, it is costprohibitive when a large number of individual cells require analysis and high coverage of CpG sites is desired.Moreover, PBAT, used in scWGBS, prevents the labeling and pooling of every single cell before bisulfite conversion, which results in a practical pattern where only sparse and usually inconsistent CpG sites are detectable across single cells.
In contrast, RRBS enriches the CpG dinucleotide-dense regions that contain key elements in CpG methylation regulation, like CpG islands, promoters and other important gene regulatory elements ( or regions ) .This makes RRBS a more sequencing-cost efficient method than WGBS, and it allows for deeper coverage of areas of interest, albeit with a relatively lower alignment rate.Adapting the RRBS principle (15)(16)(17) , we have developed an extraordinary strategy for scalable, multiplex single-cell RRBS ( msRRBS ) .msRRBS enables a panel of single cells to be pooled and bisulfite converted in a single tube, immediately after cell lysis and restriction enzyme digestion, followed by a novel shortcut ligation of barcodeadapters without any enzymatic modification or physical capture of the target DNA fragments.msRRBS is highly efficient in library construction, with consistent coverage of the key elements from both strands, as well as orders of magnitude reduction in the cost per cell, in addition to highly sequencing efficient, comparing to conventional low throughput scRRBS and scWGBS, while robust results are obtained as long as the genomic integrity of the cell is retained.Furthermore, sequencing data is tracible to each input cell, and the multiplex scale is extendable to high throughput.Our results are based on tests of hundreds of single cells and low-input tissue samples from both humans and mice generated from eight cell lines, cytarabine-resistant myeloid leukemia cells and hepatocellular carcinoma cells from a mouse model.

Preparation of samples
Cell lines K562, GM12878, 293T, MGC803, MDA-MB-231, HT22 and CT26 were grown in their suggested media supplemented with 1% penicillin and streptomycin and maintained in a 37 • C humidity-controlled incubator with 5% CO 2 .
The cytarabine ( Ara-C ) -resistant KG-1a cell line ( KG-1a_R ) was generated by direct high-dose drug shock on KG-1a, which is closer to the practical scenario in clinics.Specifically, the KG-1a cell line cultured in vitro with DMEM medium containing 10% fetal bovine serum was treated with 1 μM Ara-C for 3 days, followed by a 9-day drug holiday to restore cell numbers, while the medium was changed every 3 days.Subsequently, KG-1a_R was resistant to 1 μM Ara-C, and 1 μM Ara-C was added to cell culture to maintain the resistance.
Along with three healthy mouse controls, two hepatocarcinoma C57BL / 6 mice aged 4-5 months were used, which were generated with liver-specific MYC overexpression, approved by the Southern Medical University Experimental Animal Ethics Committee and conducted in strict accordance with the Committee's guidelines.DNA was extracted from cell line K562 and liver tissues of HCC mice and healthy mice with the cell / tissue genomic DNA extraction kit ( Tiangen, cat.no.DP304 ) according to the manufacturer's instructions.

msRRBS library preparation
Firstly, two adapters were generated.For barcode adapter ( AdBc ) , oligonucleotide AdBcS ( S = short ) four times the number of moles of the AdBcL ( L = long ) were annealed with 1 × T4 DNA Ligase Reaction Buffer ( New England Biolabs, Cat.no.B0201S ) and heated to 94 • C for 3 min, and then rapidly cooled to 80 • C, followed by slow cooling to 4 • C at a rate of 0.1 • C s −1 .For sequencing primer adapter AdSp, an equal volume of moles of two oligonucleotides was used for the annealing process described above.All adapters are available in Supplementary Tables S1 and S2.
To digest the genome of each cell, 2.5 μl lysis mixture, containing 0.125 μl protease ( Qiagen, cat.no.19155 ) , 0.075 μl 10% Triton X-100, 10 mM Tris-EDTA, 20 mM KCl and 2 μl nuclease-free water, was added into each well ( or tube ) with a single cell contained in 1 μl PBS.After incubation at 50 • C for 15 min, the wells were heated at 70 • C for 30 min.In each of the lysis mixture ( 3.5 μl ) wells, 36.5 μl was added, containing 27.5 μl nuclease-free water, 1 μl carrier RNA ( 1 μg, from Qiagen, cat.no.59824 ) , 4 μl of Dr.GenTLE precipitation carrier ( Takara, cat.no.9094 ) and 4 μl sodium acetate.After mixing the contents of the well, 112 μl of pre-cooled 100% ethanol ( −20 • C ) was supplied to each well.The strips of the wells or plates were then incubated at −20 • C for 10 min, and centrifuged at 13 300 rpm at 4 • C for 15 min.Then, the supernatant in each well was removed carefully to avoid disturbing the white precipitate, and the genomic DNA pellet was washed with 200 μl of 80% ethanol ( −20 • C ) and then the wells were centrifuged at 10 000 rpm at 4 • C for 10 min.The residual liquid was discarded, and the genomic DNA in each well was air dried with caps open until its color turned transparent.The genomic DNA was then incubated in a 3 μl PAGE 3 OF 14 mixture that contained 0.1 μl nuclease-free water, 1 μg carrier DNA, 1 × Tango buffer ( Thermo Fisher, cat.no.BY5 ) , 6 U MspI ( Thermo Fisher, cat.no.ER0541, 10 U / μl ) and 60 fg unmethylated lambda-DNA ( Promega, cat.no.D1521 ) at 37 • C for 2.5 h.
AdBc ligation was performed in 5 μl solution containing the 3 μl of digested DNA from the previous step, 0.6 μM AdBc, 0.2 μl of Fast-Link DNA Ligase ( Lucigen, cat.no.LK6201H ) , 1.2 mM ATP, 0.3 μl 10 × fast link ligation buffer and 0.6 μl of nuclease-free water at 25 • C for 20 min, followed by incubation at 16 • C for 15 min ( the temperature can be held up to 14 h ) and 20 min at 25 • C, followed by 75 • C for 20 min.Finally, 25 mM EDTA ( Thermo Fisher, cat.no.AM9260G ) was added to the tubes, which were then incubated at 37 • C for 15 min with the lid set at 50 • C; after incubation the samples were spun down.Each batch of processed samples was pooled in a new tube, 200 μl or 1.5 ml, according to sample number.The pooled DNA was purified using 1.5 × volume of AMPure XP beads ( Beckman coulter, cat.no.A63880 ) following the manufacturer's instructions, and a total of 18 μl DNA solution was collected.Then, 2 μl adapter filling mixture containing 0.375 mM 5-Methylcytosine dNTP Mix ( Zymo Research, cat.no.D1030 ) , 0.75 μl of Thermopol Reaction buffer ( New England Biolabs, cat.no.B9004S, 10 ×) and 0.5 μl of Sulfolobus DNA polymerase IV ( New England Biolabs, cat.no.M0327S ) was added to the 18 μl DNA solution and incubated at 55 • C for 30 min.
The bisulfite conversion mixture was made of 1 μg carrier DNA ( a synthesized 24 bases of artificial oligonucleotides ) , 19 μl nuclease-free water, 85 μl bisulfite solution and 15 μl of protect buffer ( Qiagen, cat.no.59824 ) .The volume of 20 μl of DNA solution ( after filling ) was incubated in 120 μl bisulfite conversion mixture at 95 • C for 5 min, 60 • C for 10 min, 95 • C for 5 min and 60 • C for 20 min.Purification was performed according to manufacturer's instructions and the converted DNA was then eluted in 32.5 μl nuclease-free water ( 17 μl × 2 times ) through a column from the DNA Clean & Concentrator kit ( Zymo Research, cat.no.D4003 ) .
To generate the final library, a 50 μl mixture for the firstround of PCR was prepared with 32.5 μl bisulfite converted DNA solution, 2.5 U T akara Epi T aq HS ( T akara, cat.no.R110A, 5 U / μl ) , 2.5 mM MgCl 2 , 5 μl 10 × Takara Taq PCR buffer ( Mg 2+ Free ) , 0.1 mM dNTP and 1 μM PreAP.Thermal cycling reaction was performed as follows: 95  C for 2h, followed by heating at 65 • C for 20 min.After that, purification was performed using the DNA Clean & Concentrator kit and the DNA was eluted with 15 μl nuclease-free water.A 25 μl ligation mixture was prepared by adding the 15 μl DNA solution from the previous step, 5 μM sequencing adapter AdSp, 1 μl Fast-Link DNA ligase ( Lucigen, cat.no.LK6201H ) , 3 μl 10 × Fast link ligation buffer and 1mM ATP.Ligation reaction and inactivation of enzymes were performed as described above.Finally, a total of 20 μl DNA solution was collected using DNA Clean & Concentrator kit.DNA fragments between 125 and 300 bp ( size extendable up to 700 bp ) were recovered by 2% E-gel ( Invitrogen, cat.no.G402002 ) and purified using Zymo Gel DNA recovery kit ( Zymo Research, cat.no.D4007 ) following the manufacturer's instructions.A Qubit fluorometer ( Thermo Fisher, cat.no.Q33238 ) was used to measure the concentration of the eluted DNA solution.The PCR mixture was prepared with approximately 10 ng of DNA recovered by size selection on gel, 12.5 μl GC-rich Phusion High-fidelity 2 × master mix ( Thermo Fisher, cat.no.F532S ) , 1 μM primer P5, 1 μM primer P7 and nuclease free water up to 25 μl final volume.A final library amplification was carried out with the program: 95 • C for 1 min; 4-5 cycles of ( 95 • C 30 s, 57 • C 30 s, 72 • C 45 s ) ; 72 • C for 10 min; 4 • C hold.The DNA Clean & Concentrator kit was used to clean up excess library primers, and 20 μl of the eluted library was used for the final size selection, as descried above.Finally, a total of 25 μl of DNA fragments of 175-350 bp in length ( the upper limit is extendable to 550 or 750 bp when specified ) was obtained.Optimal concentration of the DNA solution from the second round of DNA size selection was ≥2 ng / μl.All libraries were sequenced using the Illumina HiSeq X Ten platform with 10-30% PhiX spike-in and 150-bp paired-end reads.

Data analysis
The 4-bp of marker sequence, GGTG, or a random sequence, NNNN and 6-bp sample barcode were located at the beginning of both reads.FASTQ files were first split by pre-assigned sample barcodes.No mismatches were allowed.If the average sequencing quality of the barcode was lower than Q20, the whole read was filtered out.The marker sequence, sample barcode and Illumina adapters were trimmed from read 1 and read 2 with Cutadapt ( 18 ) and Trimmomatic ( 19 ) .Notably, the 3 end of the reads often contained the inverse complementary sequence of the fixed sequence and sample barcode, because the recovered fragments of the library were approximately 30-200 bp if it was not specified ( otherwise, 30-400 bp or 30-600 bp ) , and the sequencing strategy was 2 × 150 bp.Here, the clean reads were obtained.
The clean reads were then aligned to the reference genome ( GRCh38 or GRCm38 ) in paired-end mode using the Bismark ( 20 ) tool.Calculation of the conversion rate and DNAme calling were performed as descried ( 21 ) .CGmapTools ( 22 ) software package was used to annotate the genome elements of CpG sites, so as to obtain the corresponding information of different CpG sites on chromosomes.The data of genomic elements were downloaded from the UCSC Genome Browser, including CGI ( CpG islands: GC content of 50% or greater, contains at least 200 bp, ratio greater than 0.6 of observed number of CG dinucleotides to the expected number on the basis of the number of Gs and Cs in the segment ) , CGI shores ( regions flanking and up to 2000 bp away from CpG islands ) , CGI shelves ( regions flanking and up to 2000 bp outwards from a CGI shore ) , promoters ( containing CGIs and 1500 bp upstream and 500 bp downstream of all transcription start sites of protein-coding genes ) , genes, exons, transcripts, CTCF binding sites and enhancers.Duplication or redundant reads of genomic elements were removed according to the coordinates of the chromosomes; that is, only one of the genomic elements with identical genomic coordinates remained.
A methylation signal matrix was used to calculate the correlation between different samples or groups.Cluster analysis through principal component analysis ( PCA ) , tdistributed Stochastic Neighbor Embedding ( tSNE ) , and multidimensional scaling ( MDS ) was also performed.Calcu-lateDiffMeth ( 23 ) function or DSS package was used for differential analysis of CpG sites or elements ( DMC or DMR ) , and the obtained results were visualized by heat map or volcano map to display the differential DNAme information between different samples or groups.Homer ( 24 ) was used to annotate the DMRs according to reference genomic elements to explore the potential phenotypes and functions corresponding to differential methylation levels in different samples.Differentially methylated genes were screened for GO or KEGG pathway enrichment to reveal the potential relationship between methylation-level information and phenotype.

msRRBS procedure overview
We sought to provide a middle to high-throughput RRBS protocol ( Figure 1 and Supplementary Figure S1a ) through early labeling of each cell for multiple single cells in parallel, followed by subsequent pooling of the panel of cells for singletube transformation.Thus, it makes the procedure highly efficient and robust in operation with lower cost, while generating high quality data consistent among the cells.A summary of the protocol follows.First, a panel of single cells in different wells or tubes obtained by pipette aspiration, flow sorting, or microfluidic delivery are lysed to fully expose genomic DNA.Second, CpG-rich DNA regions of the genome for each cell are fragmented with the enzyme MspI.Third, the double-stranded genomic fragments with sticky ends from each cell are directly ( without end repairing and A-tailing ) ligated to a specifically designed adapter ( AdBc ) , composed of a longer strand and a shorter strand ( AdBcL and AdBcS ) .AdBcL is built with the cell-barcode at its 3 end consisting of a combination of six nucleotides to label each cell, and the BciVI recognition site ( 5 -GT A TCC-3 ) follows the barcode; its 5 end is built with a tag for priming PCR amplification after bisulfite conversion, in which all the cytosines ( Cs ) in the AdBcL are methylated.AdBcS is complementary to the 3 end of AdBcL.After ligation, the ligase is inactivated by heating and AdBcS, whose 5 end is modified such that it does not ligate to the 3 end of the digested DNA fragments, is dissociated away, while AdBcL keeps its covalent bond with the genomic fragments.Fourth, all constructs of adapter-DNA fragments barcoded for each cell for the panel of cells, in which each is built with different barcode, are then pooled into a single reaction tube for subsequent processes.Fifth, Sulfolobus DNA polymerase IV is employed to fill in the adapter by an extension of the 3 end of the DNA fragments along AdBcL; the reaction includes methylated C but no unmethylated C. Sixth, bisulfite is used to convert unmethylated C to U. Seventh, the first round of PCR is performed with a limited number of cycles.Eighth, the adapter / primer portion of the amplicon is excised with BciVI, and the barcodes of the cell are retained.Importantly, this cleavage leaves an 'A' tail at the 3 end of the dsDNA to enable the ligation of the conventional library adapter that has a 'T' tail at its 3 end for the second round of PCR.Ninth, a second round of PCR amplification is performed incorporating an index for each pool of cells; library products of 175-350 bp ( alternatively 550 or 750 bp as the upper limits when specified ) are then isolated by size selection through agarose gel or its alternatives.After purification, the obtained library is sequenced ( Supplementary Figure S1b, c ) .An aliquot of with a cell-specific barcode composed of a relatively long strand ( AdBcL ) and a short strand ( AdBcS ) .Because AdBcS with its 5 end is just adjacent spatially to, but no co v alent ligation to the genomic DNA fragment at 3 end, it falls off the AdBcL when the ligase is inactivated at 75 • C. Then multiple single cells are pooled into a single tube.With the addition of 4 nucleotides including methylated C, while no unmethylated C is included, the new AdBcS is re-synthesized along AdBcL with a DNA polymerase, Sulfolobus DNA polymerase IV, to re-construct the double strand adapter.Here bisulfite con v ersion is emplo y ed, and the products are reco v ered b y PCR amplification.T he amplicon is digested with BciVI, and a sticky end with an ' A ' tail protruding at 3 end is generated, which enables this product to be directly ligated to a con v entional Y-shaped adapter that builds a 'T' tail protruding at 3 end.Finally, a minimum cycle number of PCR is f ollo w ed to finalize the sequencing library.
purified genomic DNA or a small amount ( micro-bulk ) of cells can be alternatively processed as a sample instead of a single cell.

Improved mapping rate and coverage of single cells achieved by msRRBS
We analyzed the basic features of msRRBS using the cell line K562.The average conversion rate was obtained by spiking 60 fg of unmethylated lambda DNA in the pooled samples for the msRRBS assay, which was 99.6% ( ranging from 99.5% to 99.9% ) , indicating that unmethylated Cs were efficiently converted to Ts in both single cell and bulk samples ( Supplementary Tables S3 and S4 ) .Therefore, the methylation status of CpG sites obtained in the sample data is reliable.Indeed, the average DNAme levels of cytosines in the CHG and CHH context, which are known to be nearly unmethylated in most mammalian cell types, were detected as being < 0.4%, further confirming the complete conversion.
To choose an appropriate fragment size range for the most cost-effective strategy in library construction, we generated libraries with four fragment ranges and a silicon-merging library after sequencing, which were 175-350, 350-550, 175-550, merged 175-550 and 175-750 bp.The merged 175-550 bp data was generated by silicon-merging the data of 175-350 and 350-550 bp libraries.The ideal and actual distributions of the insert sizes across sequencing libraries with different sizes were analyzed ( Supplementary Figure S2a-j ) .We also found that the increased sizes of target fragments in the sequencing library did not significantly raise the mapping rate.When the library sizes were recovered as aforementioned, accordingly the average mapping rates of raw reads in single cells ( msRRBS ) were 57.3 ± 3.7%, 58.8 ± 0.6%, 58.1 ± 1.2%, 62.5 ± 3.9% and 67.1 ± 1.8%.In bulk cells ( mRRBS ) , the average mapping rates were higher than that of single cells.However, the long range of fragments sequenced in bulk cells overall did not significantly result in higher mapping rates than the short range of fragments ( Figure 2 A ) .For individual samples, the mapping rate was up to 79.9% in single cells and to 89.5% in bulk cells over the whole study.
Importantly, when library size range increased from 175-350 to 175-750 bp, the coverage was raised substantially.For the two ranges of library size, the best detection values were respectively up to 2 196 284 and 3 791 693 CpG sites, and up to 20 312 and 22 383 CGIs ( corresponding to 72.5% and 79.9% CGIs ) in single-cell samples.These values were 4 673 052 and 7 188 812 CpG sites, and 24 051 and 25 115 CGIs ( corresponding to 85.9% and 89.7% CGIs ) in bulk samples.
When the library size ranges were recovered between 175-350, 350-550, 175-550, merged 175-550 and 175-750 bp, the average detection rates of CpG sites in single cells were 1.9, 1.3, 1.7, 2.9 and 3.1 million, respectively.While in bulk cells, these rates were correspondingly 3.8, 2.7, 4.2, 6.1 and 6.6 million ( Figure 2 B ) .Compared with 175-350 bp, the detection of CpG sites of merged 175-550 bp libraries increased by 37.3%.Interestingly, the merged 175-550 bp msRRBS data covered dramatically more CpG sites than that was obtained in the directly generated 175-550 bp library.It is also noted that the CpG sites detected in the 175-750 bp libraries did not significantly outperform the merged 175-550 bp libraries.Moreover, msRRBS in single cells recovered on average 31.7%± 0.7%, 39.3% ± 3.7, 35.5% ± 1.2% and 33.3% ± 5.8% of CpG sites across the whole genome from the libraries with variants of sizes aforementioned.With libraries of these sizes, the coverage increased with bulk cells, reaching to 55.5% ± 4.8%, 58.7% ± 2.8%, 59.3% ± 1.2% and 56.4% ± 2.0% of the whole genome, respectively ( Supplementary Figure S2k  and l ) .Therefore, when more CpG sites are desired in practice, particularly when analyzing precious and rare cells, the merged 175-550bp msRRBS ( recovery of 175-350 bp and 350-550 bp fragments separately and merged the sequencing data afterward ) is preferred.
Conversely, the coverage of the genomic elements, particularly CGIs ( Figure 2 C ) and promoters ( Figure 2 D, Supplementary Figure SS3 ) , showed just a slight extension when the library sizes were increased, either in single cells or bulk cells.Obviously, more CpG sites detection, such as in the merged 175-550 bp library than in 175-350 bp library, should improve the DNAme measurement representation and confidence of these genomic elements.
Furthermore, we compared msRRBS with conventional scRRBS, WGBS, XRBS and scCGI-seq to assess the proportion of reads associated with promoters, CpG islands, gene bodies and other genomic elements.It showed that msR-RBS obtained the highest proportion of unique reads associated with CGIs and promoters than all other methods compared ( Supplementary Figure S4 and Supplementary Table S5 ) .These results implied that in terms of efficient recovery of CGIs and promoters with a limit low sequencing depth, msRRBS won the game, followed by conventional scRRBS.

Outstanding reliability, reproducibility and efficiency enabled by msRRBS
As most CpG sites are enriched in short fragments with the 4-nucleotide restriction endonuclease MspI, for the ultimate sake of high throughput msRRBS with a minimal sequencing cost, we focused on the library sizes of 175-350 bp for the following analyses.At first, we tested 15 high-quality single cells in a batch as well as its corresponding bulk cells.For single cells, micro-bulk quantities of 10-200 cells and 10-200 ng of DNA extracted from K562 cells, there was no significant difference in CpG site methylation levels between groups ( Figure 2 F, P > 0.05 ) .It was found that the correlation coefficient ( R ) of the methylation profiles of two typical single K562 cells was 0.79 ( Figure 2 G ) , while this factor did reach up to 0.88 ( Figure 2 H ) when the correlation analysis was calculated between a single cell and a micro-bulk sample of approximately 50 cells.In addition, the correlation ( R ) reached 0.91 between the merged single cells analyzed by the msRRBS procedure, and true-bulk cells ( Figure 2 I ) .Considering the heterogeneity among single cells, we randomly merged the data of four single K562 cell samples and identified that the methylation pattern of CpG islands ( Supplementary Figure S5b ) and promoters ( Supplementary Figure S5c ) among cells were also basically similar.

PAGE 7 OF 14
To further assess the reproducibility of msRRBS and rule out the effect of epigenomic heterogeneity and quality variability among single cells, the correlation between the two bulk samples ( 50 ng ) of K562 cells was measured, with 0.97 ( R ) at coverage ≥10 × ( Figure 2 J ) .When analyzing CpG sites at depths of ≥3 ×, ≥5 ×, ≥15 × and ≥20 ×, the correlation coefficients ( R ) were 0.93 ( Supplementary Figure S5d ) , 0.95 ( Supplementary Figure S5e ) , 0.97 ( Supplementary Figure S5f ) and 0.98 ( Supplementary Figure S5g ) , respectively, indicating high robustness of the method.The overall methylation pattern of the co-detected CpG sites by mRRBS was highly consistent with that obtained by EPIC array analysis ( Figure 2 K, R = 0.95 ) of KG-1a, and a publicly available WGBS dataset of K562 ( Figure 2 L, R = 0.94 ) .
To analyze the coverage limitation and methylation level accuracy with different amounts of input, we compared microbulk samples versus single cells and a very low number of cells.When 200 K562 cells were pooled together, there were no significant differences in detection rate of CpG sites compared with pooled 100 cells, 10, 50 and 100 ng DNA ( P > 0.05, ttest ) , but there was significant difference when compared with 10 cells ( P < 0.01, t -test ) ( Supplementary Figure S5h ) .As expected, single-cell and bulk-cell samples displayed a consistent methylation level for each genomic element over the genome ( Supplementary Figure S5i ) .
To evaluate any possible cross-contamination between cells or barcodes, four single cells of cell line HT22 ( mouse ) , four single cells of K562 ( human ) and one sample of nuclease-free water as a negative control were pooled into the same conversion reaction and library amplification steps.In addition, four aliquots of 100 cells of CT26 ( mouse ) and four aliquots of 100 cells of K562 ( human ) were pooled into the same reaction tube.Two batches of each mix were performed.Mapping of msRRBS data to the genomes from both species confirmed that, technically, there was no evident barcode crosscontamination ( Figure 2 m ) .Notably, the mapping rate of the negative control to both human and mouse genomes was < 1.0%.
With different sequencing depths, it was also recognized that msRRBS reached plateau earlier ( 0.5-1M ) , in both bulkcell and single-cell samples, than other competing methods, such as EM-seq ( 25 ) , which is an alternative whole genome DNAme sequencing method that enzymatically converts unmethylated C genome-wide, XRBS ( 15 ) -an extended scR-RBS method, in additional to the conventional WGBS and scRRBS.msRRBS, at a relatively low sequencing depth, also outperforms WGBS and scRRBS in overall coverage and consistency ( Figure 2 N and Supplementary Figure S5j, k ) .This is attractive for sequencing a large number of single cells for pair-wise comparison and clustering analysis.This result may be attributed to the fact that msRRBS more specifically enriched the elements rich in CG sites, such as CpG islands and promoters.Thus, the coverage of the key epigenetic regulatory elements in msRRBS is more consistent among cells.At a low depth, for example at 0.5 MB reads, msRRBS obtained approximately 1.5 times the CGIs, and 1.3 times the promoters than obtained by XRBS.While bulk cells ( mRRBS ) had an even higher efficiency than single cells ( msRRBS ) when compared to XRBS for the coverage of these elements, i.e. approximately 3.5 times the CGIs, and 3 times the promoters than obtained by XRBS.At 1 Mb clean reads, msRRBS still outperformed XRBS.With further, deeper sequencing of single cells, XRBS detected more CpG sites and elements, but msRRBS / mRRBS covered increased CpG sites when sequencing a library with increased range from 170-350 bp to 170-550bp, as demonstrated above.Interestingly, when the sequencing depth is less than 4-5 Mb, bulk-cell sequencing with mRRBS also detects more CpG sites than XRBS.Therefore, msRRBS with 175-350 bp library size is suitable for sequencing a large number of single cells with less sequencing depth, while when more CpG sites are desired, particularly when analyzing precious and rare cells, the merged 175-550 bp msR-RBS is preferred, and a relatively increased sequencing depth is a further boost.

msRRBS profiling of methylation characteristics of single cells from distinct cell lines
The relationship of CpG methylation profiles was assayed for 6 cell lines: GM12878, MGC803, MDA-MB-231, 293T, KG-1a and K562, using msRRBS methylation level as the parameter to build correlation maps.We calculated the successful rate of the msRRBS procedure in several batches of the experiment: 86 cells ( ≥0.2 M CpG sites were detected ) out of 101 cells tested were used in the following analysis ( Supplementary Table S6 ) .As expected, 6 different clusters ( obtained using tSNE with genomic 0.2-kb windows ) were generated with each cell clustered to its corresponding cell line ( Figure 3 A ) , and a high correlation was observed among the single cells with each cell line ( Figure 3 B, on average R = 0.8 ) .We found that 0.2-kb windows best reflect the correlation of the six cell lines, followed by three genomic elements: CpG sites, CGIs, promoters and gene bodies ( Supplementary Figure S6a-e ) .
Furthermore, the methylation characteristics of the three elements were examined using randomly merged data of 2, 4 and 8 single cells.The average correlation coefficients ( R ) of methylation levels based on these three elements in the six cell lines were 0.97, 0.95 and 0.91, respectively ( Supplementary Figure S6f-h ) .In addition, each of the six cell lines were distinguished and formed isolated clusters based on methylation profile of CGIs, with obvious distinct patterns among the cell lines ( Figure 3 C ) ; a similar pattern was observed with promoters ( Supplementary Figure S6i ) .At the same time, the methylation profile of single cells from the same cell line was relatively consistent with existence of heterogeneity among cells.Average DNAme levels were similar across single cells of a given cell line ( 34.6 ± 1.3%, 26.8 ± 0.9%, 39.0 ± 2.5%, 37.9 ± 1.3%, 43.6 ± 1.2% and 46.9 ± 2.8 for GM12878, K562, MDA-MB-231, MGC803, 293T and KG-1a, respectively ( Figure 3 d ) .
The methylation pattern of particular genomic elements was also analyzed with the msRRBS data.Three single cells of each cell line were randomly selected as representatives to draw cross-gene methylation profiles ( Figure 3 E ) ; the results were consistent with early reports ( 21 ) .The methylation level of genes across the genome was higher than that in the adjacent regions.There was an expected hypomethylation valley at the transcription start site ( TSSs ) and methylation levels increased gradually from the 5 end ( TSS ) of the gene to the 3 end ( transcriptional end site, or TES ) .The same result was obtained when 86 cells from 6 cell lines were plotted ( Supplementary Figure S7a ) .Meanwhile, we also mapped the methylation levels of CpG islands over the whole genome and revealed that the methylation levels of CpG islands were lower than that of adjacent genomic regions, as shown in the randomly plotted data of three single cells or with the whole panel of 86 single cells ( Supplementary Figure S7b, c ) .In addition, the promoters ( Supplementary Figure S7d ) , the first exons ( Supplementary Figure S7e ) and a combination of all exons ( Supplementary Figure S7f ) demonstrated consistent hypomethylation patterns within each type of element among different cell lines.The average number of the eleven genomic elements detected in each single cell from the six cell lines remained roughly consistent ( Figure 3 F ) .
Next, we mapped the entire genome ( Supplementary Figure S7g, left ) and chromosome 1 ( Supplementary Figure S7g, right ) for the msRRBS status for six cell lines each with 8 single cells merged.The cell line KG-1a was derived from acute myeloid leukemia and displayed high methylation levels.Both maps revealed that the methylation level of KG-1a was the highest of the six cell lines, at about 55.0%, while K562 had the lowest methylation.Subsequently, corresponding heatmaps of methylation status genome-wide were generated ( Supplementary Figure S7h ) , which were consistent with those above.In the region 1 664 226 to 1 664 541 of chromosome 7, a representative lollipop map illustrates the different levels of methylation in the six cell lines ( for each, data from a single cell was shown ) , and each cell line displayed distinct features of its own ( Supplementary Figure S7i ) .

Single-cell methylation profiling by msRRBS of cytosine arabinoside ( Ara-C ) resistant versus the original myeloid leukemia cell line KG-1a
To evaluate the performance of the method in a practical biological system, msRRBS was applied to Ara-C resistant KG-1a cells ( KG-1a_R ) .Thirty-two single-cell libraries, 16 each for KG-1a_R and KG-1a, in addition to the corresponding bulk-cell libraries, were constructed.We filtered out the data for 4 unqualified cells and used data from 28 cells for the downstream analysis, which covered > 0.2 million CpG sites per genome.A correlation analysis determined that the correlation coefficient ( R ) between two single cells from same KG-1a_R group was up to 0.87 ( Figure 4 A ) .The correlation coefficients between a single cell and its corresponding bulksample, between the pseudo-bulk ( 14 cells merged in-silico ) and its corresponding bulk, and between biological replicates of bulk samples, were 0.93 ( Figure 4 B ) , 0.93 ( Figure 4 C ) and 0.99 ( Figure 4 D ) , respectively.The correlation results with the cells of KG-1a were similar ( Supplementary Figure S8a-d ) .Two distinct clusters were shown when unsupervised clustering was performed based on the methylation level of either CGIs ( Figure 4 E ) or promoters ( Supplementary Figure S8e ) for KG-1a_R ( n = 11 ) and KG-1a ( n = 12 ) .We further generated the methylation curve of gene bodies, promoters, and CGIs separately over the genome for single cells ( Figure 4 F ) and bulk cells ( Supplementary Figure S8f ) from the KG-1a versus its counterpart KG-1a_R.Overall, there was no significant

PAGE 10 OF 14
Nucleic Acids Research , 2023, Vol.51, No. 21, e108 difference in methylation levels between these two groups, either in single cells or bulk cells.However, this analysis identified that the fluctuation of the methylation signal for single cells was greater than that of bulk-cell samples in the three elements or regions, CGIs, promoters and gene bodies, probably reflecting the heterogeneity among cells and the limited coverage of CpG sites.
Differential methylation analyses for DMCs and DMRs were performed on the merged single cells ( Supplementary Figure S8g ) and bulk cells ( Supplementary Figure S8h ) .We obtained the differential hyper-methylated ( Figure 4 G ) and hypo-methylated ( Supplementary Figure S8i ) regions for KG-1a_R versus KG-1a, using msRRBS ( merged data of 11 or 12 single cells ) and mRRBS ( bulk-cell data ) compared to the bulk-cell EPIC array, and we calculated the overlap rate between the two methods.We illustrated that merged msR-RBS covered more DMRs than the mRRBS bulk results; that merged msRRBS covered more DMRs than the bulk EPIC results; and that bulk mRRBS covered more DMRs than the bulk EPIC results, significantly.
The DMRs obtained above were inspected for functional significance by KEGG analysis using DMRs obtained from bulk cells of KG-1a_R versus KG-1a.We identified enrichment of the MAPK signaling pathway ( 26 ) , Calcium signaling pathway ( 27 ) , AXON guidance ( 28 ) among others ( hyper: Figure 4 H, hypo: Figure 4 I ) , which have been associated with the occurrence and development of myeloid leukemia.These KEGG pathways that were enriched specifically in DMRs for KG-1a in the bulk samples overlapped in the corresponding KEGG list from single cells.As representative results, the regulatory elements for DSCAM and SBF2 were hyper-methylated, while GSE1 and GCNT2 were hypo-methylated, in KG-1a_R in contrast to KG-1a ( Figure 4 J and Supplementary Figure S8j ) .The role of these genes in leukemia or other cancers is consistent with our result.Specifically, GCNT2 is overexpressed in highly metastatic breast cancer cell lines of human and mouse origin and breast tumor samples ( 29 ) .GSE1 employs protumorigenic activity in AML cells and is a downregulation reliever of the differentiation block in AML blast cells ( 30 ) .The change of DSCAM is a risk factor for Down Syndrome leukemia ( 31 ) .Finally, SBF2-AS1 is significantly upregulated in NSCLC compared with corresponding nontumor tissues, and a high expression level of SBF2-AS1 is correlated with lymph node metastasis and the advanced TNM stage of NSCLE ( 32 ) .

R epresentative meth ylation pattern of mouse hepatocellular carcinoma ( HCC ) revealed by mRRBS
Recognizing that msRRBS was highly efficient in bulk-cell analysis, we applied it to characterize the CpG methylation profile of HCC in mice.This HCC was initiated by the overexpression of the oncogene c-Myc constructed in our laboratory.Two or three tumor nodules were taken from each of the two HCC mice, while a corresponding part of liver was retrieved from 3 healthy mice sacrificed as the negative controls.Four technical replicates were performed with the purified genomic DNA derived from each nodule and the controls.A sum of 32 samples were processed as one healthy sample was lost during library construction.With tSNE analysis applied on methylation levels of bins, the HCC nodules and healthy livers clearly formed 3 distinct clusters ( Figure 5 A and Supplementary Figure S9a, b ) .The heatmap analysis showed that the average correlation coefficients ( R ) were as high as 0.98 within the technical replicates of both healthy livers and HCC nodules ( Figure 5 B and Supplementary Figure S9c, d ) .In addition, mRRBS remained consistent in detecting multiple functional components in mice, and focuses on CGIs and promoters ( Supplementary Figure S10a ) .
DMR analyses were performed to investigate the differential methylation status between healthy livers and HCC nodules.It was evident that the clusters were clearly identified both before ( Supplementary Figure S10b ) and after ( Supplementary Figure S10c ) the combination of technical replicates.We found that there were more DMRS in hypermethylated states than in hypomethylated states on each chromosome, and in particular, almost all DMRS in chromosome X and chromosome 3 were hypermethylated ( Supplementary Figure S10d ) .KEGG analysis based on the DMRs hypermethylated in healthy livers and hypomethylated in HCC nodules highlighted signaling pathways related to the occurrence and development of liver cancer, such as axon guidance ( 33 ) and the Wnt signaling pathway ( 34 ) ( Figure 5 C ) .As examples, we highlight three DMRs on chromosomes 2, 4 and 7, where the healthy livers were hypermethylated, while all nodules of the two HCC livers were hypomethylated in Rbm46 and Creb5 ( Figure 5 D and Supplementary Figure S10e ) .Conversely, Znf385b displayed the opposite methylation status ( Supplementary Figure S10f ) .

Discussion
DNAme at CpG sites regulates transcriptional activity and the stability of the genome ( 35 ) , and is altered during differentiation ( 36 ) , development ( 37 ) and aging ( 38 ) .Single-cell analysis plays an important role in elucidating the genomic heterogeneity in multicellular organisms.In spite of other efforts to assess methylation profile free of bisulfite conversion ( 25 , 39 , 40 ) , single-cell bisulfite sequencing techniques have been the focus over recent years, allowing researchers to precisely analyze the epigenomic heterogeneity among single cells ( 3 , 4 , 6 , 7 , 9-11 ,41 ) .Recent research has revealed the fundamental cellular and molecular patterns particularly in neurons and brains, embryos, stem cells, cancers and leukemia, as well as dissecting the precise mechanism underneath DNA replication ( 42 ) .No doubt, single-cell methylation analysis methods will deepen our understanding of genomic imprinting and Fragile X Syndrome ( 43 ) , plus a variety of other fundamental processes in life, and speed up the effort for precise screening of biomarkers for cancers and other disorders, being benefit drug development and disease therapy in clinics.Yet, the efficiency and robustness of these techniques are still a huge challenge.In this study, we have established msRRBS to analyze the methylation status of multiple single cells in-parallel at single nucleotide resolution, with inventive designs for early shortcut barcoding and pooling of the single cells.We believe that msRRBS has the potential to work well when the bisulfite conversion is switched to enzymatic conversion ( 25 ) .
The innovative design of msRRBS is reflected in the following aspects.Firstly, different single cells, are combined immediately after specific barcodes are added to the genome of each cell with a shortcut procedure at the beginning, so that all subsequent steps are carried out in a single tube, thus achieving multiplicity and scalability.msRRBS greatly reduces the cost, complexity and labor intensity of library construction when handling many single cells, and greatly improves the efficiency.It is also worth noting that the whole experimental process of msRRBS library construction takes 14.5 h, with only 4 h of manual operation, while conventional scRRBS for a few samples takes 2-3 days and much more hands-on time.
Importantly, the data quality of msRRBS is greatly improved, attributing to several unique features of the method.The possible damage of the original template at the singlecell level is substantially reduced by the shortcut procedure, which is free of enzymatic DNA modifications, such as endrepair and deoxyadenosine nucleotide-adding to the MspI digested DNA fragments, in addition to the direct ligation of a short barcode-containing adapter instead of a long adapter to the MspI digested DNA fragments.This single-strand ligation allows the double-strand adapter being covalently ligated to the restriction fragments by only the long strand of the adapter, while the short strand is subsequently dissociated away from the ligation product, which avoids hard-toremove adapter-dimers.A special non-displacement and non-nick DNA-polymerase, Sulfolobus DNA polymerase IV, is introduced to create the complementary strand of the long strand adapter ligated to the DNA fragments.This enzymatic single-tube process is more efficient than any physical enrichment ( such as biotin-streptavidin ) of the MspI digested DNA fragments, and also empowers early pooling of multiple cells, taking advantage of the barcoding nucleotide combination in the adapter for each single cell.This construct enables the efficient amplification of the desired fragments after bisulfite conversion.The whole process in combination minimizes the possible damage of the DNA and enables the capture of the double-stranded MspI digested DNA fragments from both ends for best capture of CpG methylation message in the target fragments.
In msRRBS, the mapping rate of the raw data runs up to 79.9% with an average of 62.5% with the optimal protocol.This rate may vary with different quality of cells, associated with the genome integrity of the cells.It is similar to the latest protocols of scWGBS / scBS ( 11 , 14 , 41 ) , which ranges from 64-68%, and is significantly improved over the conventional scRRBS methods, which is approximately 25% ( 9 ,21 ) .Additionally, in msRRBS, no cross-contamination among cells is observed, and the identity of each individual cell is traceable from the output data, ensuring the independence and reliability of the measurement for each cell.This is especially valuable for precious and rare cells, such as preimplantation embryo blastomeres, fetal cells from non-invasive prenatal testing ( NIPT ) and liquid biopsies of cancers, such as fine needle aspirates and circulating tumor cells ( CTCs ) or minimal residual diseases ( MRDs ) .
Another advantage in the performance is that the sequencing output of msRRBS consistently enriches the key elements involved in methylation regulation, particularly CGIs and promoters.At 0.5-1 Mb reading depth, the coverage of CGIs and promoters is close to the plateau, which results in much higher consistency in coverage than scWGBS with the same sequencing depth.In addition, the silico-merged data of a very limited number of single cells ( i.e. pseudo-bulk built from data of 15 single cells ) detects approximately equivalent CpG sites and key elements to the corresponding bulk cells.Many more DMRs are detectable from in-silico merged single cells than in real bulk cells, probably because of the independent detection of the elements in single cells, rather than the detection of the elements from the average of the bulk samples.These results favorably support the importance of a panel of single cells over a bulk of cells to detect the heterogeneity and subpopulations of a multicellular system ( 7 ,41 ) .
In recent years, some efforts have been made to perform multiplex, single-cell bisulfite sequencing, mostly using the TruSeq adapter after digestion with a restriction enzyme and cell barcoding before bisulfite conversion, yet very few have succeeded.Among them, Msc-RRBS ( 42 ) successfully employed modified TruSeq adapters with additional barcodes for multiple single-cell RRBS, and resulted in Smart-RRBS for co-analysis of RNA and CpG methylation at the single-cell level ( 17 ) .While our data was being analyzed, XRBS was reported using streptavidin to capture biotinylated adapters that are covalently ligated to ( and later recovered ) one strand of MspI-digested DNA fragments from the 5 end, followed by bisulfite conversion, and the use of a semi-random primer to build the second handle for target amplification.This report enriched extra elements, particularly CTCF binding motifs, beyond CGIs and promoters, with a sequencing depth much more than RRBS ( 15 ) .Our msRRBS focuses on CGIs and promoters and uses a different strategy circumventing the laborintensive procedures, while rescuing both strands of the MspI digested DNA fragments, minimizing DNA damage, and being flexible in fragment size recovery.Furthermore, with wider ranges of library size ( switched from 175-350 bp to 175-550 bp, particularly separately recovering of the fragments in two ranges of sizes: 175-350 bp and 350-559 bp, and siliconmerging the data after sequencing ) , the coverage of CpG sites and genomic elements were greatly improved, correspondingly with a slight increase in cost of sequencing.For rare and precious cells, the silicon-merging 175-550 bp msRRBS aforementioned is preferred to best characterize the methylome in a single cell.Furthermore, other non-methylation sensitive restriction enzymes that enrich for CpG-rich sequence may be applied instead of MspI, and two enzymes in combination or separately for the same sample may improve the coverage ( 44 ) .
As an off-the-shelf reagent and equipment based, highly efficient, multiplexed approach, msRRBS aims to provide an ultimate practical resolution for single-cell DNAme profiling, targeting informative CpG islands and promoters at genomescale.However, msRRBS has some shortcomings and areas for further improvement.Firstly, sparsity and incomplete coverage remains an issue, due to the nature of single-cell sequencing, and because RRBS only uses MspI or its alternatives for target enrichment.The unique restriction site makes the theoretical detectable CpG sites with a high sequencing depth toward saturation relatively limited in comparison to WGBS and XRBS.This is the downside of the advantage of high efficiency .Secondly , we have designed 96 barcodes for single-cell traceable msRRBS, but more barcodes need to be designed and validated so as to increase the throughput of msRRBS.More systematic random barcodes could be applied in combination with droplets or nanowells for truly high-throughput analysis of thousands of single cells in a run.Thirdly, the annotation of the RRBS profile of specific, single-cell subpopulations with a functional basis is a necessity and requires independent investigation.
In conclusion, we provide a highly efficient approach for medium-throughput analysis of scRRBS based on technological innovation, which is not only practically valuable when the cells are precious and rare, but also scalable to highthroughput analysis of a large number of cells in a variety of systems.This method will promote single-cell epigenomics and multiomics ( 10 , 17 , 45 ) , which is essential for understanding the epigenomic regulation of cellular heterogeneity.

Figure 1 .
Figure 1.Procedure scheme of msRRBS.After lysis and Msp I digestion of multiple single cells in parallel independently, the genomic DNA fragments of each single cell are directly ligated to a DNA adapter ( AdBc )with a cell-specific barcode composed of a relatively long strand ( AdBcL ) and a short strand ( AdBcS ) .Because AdBcS with its 5 end is just adjacent spatially to, but no co v alent ligation to the genomic DNA fragment at 3 end, it falls off the AdBcL when the ligase is inactivated at 75 • C. Then multiple single cells are pooled into a single tube.With the addition of 4 nucleotides including methylated C, while no unmethylated C is included, the new AdBcS is re-synthesized along AdBcL with a DNA polymerase, Sulfolobus DNA polymerase IV, to re-construct the double strand adapter.Here bisulfite con v ersion is emplo y ed, and the products are reco v ered b y PCR amplification.T he amplicon is digested with BciVI, and a sticky end with an ' A ' tail protruding at 3 end is generated, which enables this product to be directly ligated to a con v entional Y-shaped adapter that builds a 'T' tail protruding at 3 end.Finally, a minimum cycle number of PCR is f ollo w ed to finalize the sequencing library.

Figure 2 .
Figure 2. Basic features of msRRBS.The mapping rate ( A ) , detection rate of CpG sites ( B ) , CGIs ( C ) and promoters ( D ) associated with different library sizes in single cells ( n = 3-4, i.e. 3-4 single cells as biological replicates, within the same experiment, similarly hereinafter ) and bulk cells ( n = 4 ) .( E ) Co v erage of genomic elements or regions along the genome associated with different library sizes, detected in K562 single cells ( n = 3-4 ) .( F ) Overall meth ylation le v el among single cells ( n = 15 ) , 10-200 cells ( n = 4 ) and 10-200 ng DNA ( n = 2 ) e xtracted from K562 cells ( P v alue = 0.42 ) .( G-J ) Pearson correlation compares methylation profiles of CpG sites acquired by msRRBS between two single cells ( G, R = 0.79 ) , a single cell and a bulk of cells ( H, R = 0.88 ) , merged single cells and bulk cells ( I , R = 0.91 ) , technical replicates of bulk cells ( J , co v erage = 10 ×, R = 0.97 ) for cell line K562, mRRBS and EPIC methylation array ( K , R = 0.95 ) for KG-1a cells, and mRRBS and public data of WGBS ( L , R = 0.94 ) for K562 cells.( M ) Mapping rates that align specifically to the mouse genome ( x-axis ) versus the human genome ( y-axis ) , confirming no detectable cross-contamination between human and mouse cells prepared in the same reaction pool.Each dot represents a sample ( a bulk or a single cell ) .Duplicate pools of CT26 and K562 bulk cells ( ∼100 cells ) , and duplicate pools of HT22 and K562 single cell were independently tested.( N ) Number of CGIs with at least 1 CpG site as a function of sequencing depth for msRRBS, mRRBS, con v entional RRBS, WGBS, EM-Seq and XRBS, detected in K562.T he v ertical gra y dashed lines indicate 0.5 and 1 M clean reads serving as reference lines for the comparison.

Figure 3 .
Figure 3. Feasibility and reproducibility of msRRBS profiling for methylation characteristics of a panel of cell lines.( A, B ) Single cells from the six cell lines were analyzed in 0.2-kb windows over the genome.( A ) Unsupervised clustering with tSNE.Six cell lines: 293T ( n = 15 ) , GM12878 ( n = 20 ) , K562 ( n = 22 ) , KG-1a ( n = 13 ) , MDA-MB-231 ( n = 8 ) and MGC803 ( n = 8 ) .( B ) Pearson correlation heatmap at 0.2-kb windows.The color key from blue to red indicates low to high correlation.( C ) Methylation profile of CGIs for 2, 4 and 8 cells randomly in silico-merged from each of the six cell lines.The number 0 represents unmethylated, and the number 1 represents fully methylated.The color changes from blue to red to indicate a gradual increase in meth ylation le v els.( D ) B ar plot sho ws the a v erage genome-wide DNAme le v el of CpG sites f or the six cell lines.( E ) Av erage DNAme le v els along the gene body and 5-kb upstream and downstream of the TSSs and the TESs for all RefSeq genes of the six cell lines.( F ) Average detection rates of various genomic elements in each of the six cell lines.

PAGE 9 OF 14 Figure 4 .
Figure 4. Epigenomic and biological significance of a m y eloid leuk emia cell culture resist ant to cyt arabine re v ealed b y msRRBS.( A-D ) Pearson correlation heatmap compares methylation values of individual CpG sites acquired by msRRBS between two single cells ( A , R = 0.87 ) , between a single cell and a bulk of cells ( B , R = 0.93 ) , between a pseudo-bulk ( merged 14 cells ) and a bulk of cells ( C , R = 0.93 ) , and between technical replicates from a bulk of cells ( D , R = 0.99 ) for cyt arabine-resist ant KG-1a ( KG-1a_R ) .( E ) Unsupervised clustering based on single-cell CpG island methylation for KG-1a_R ( n = 11 ) and original KG-1a ( n = 12 ) with tSNE.( F ) Average DNAme patterns across gene body, promoters and CGIs from the data set of single cells by msRRBS.( G ) Hyper-DMR intersection of KG-1a_R versus KG-1a among the methods msRRBS, mRRBS and EPIC methylation array.Hyper ( H ) and hypo ( I ) DMRs enriched in the KEGG pathw a y of a bulk of cells are shown for KG-1a_R versus original KG-1a.( J ) Two windows each contain a DMR of merged KG-1a single cells from KG-1a_R versus the corresponding original KG-1a population ( MWU-test P -value < 0.05, number of CpG sites in one DMR ≥ 3, the distance between two independent DMRs ≥ 100 bp, length of DMR ≥ 50 bp ) .

PAGE 11 OF 14 Figure 5 .
Figure 5. Methylation patterns of hepatocellular carcinoma ( HCC ) in mice uncovered by msRRBS.( A, B ) Analysis of HCC bulk cells based on methylation le v els in 50-kb windows over the genome for two HCC mice with two or three liver cancer nodules versus three liver biopsies from healthy mice.Except for H2 with three technical replicates, all the other samples were measured with four technical replicates.( A ) Unsupervised clustering with tSNE.( B ) Pearson correlation heatmap.Group 1, representing the detailed samples, includes each of the three mice from the healthy group, of the two nodules from the first HCC mouse, and of three nodules from the second HCC mouse.Group 2 is the grouping of the healthy group and HCC group, similarly hereinafter.The color key from blue to red indicates low to high correlation.( C ) KEGG pathways enriched on HCC hyper-DMRs.( D ) One of the HCC Hypo-DMR regions, labeled as the pink background, being highlighted by DSS package, is hypermethylated in healthy liver biopsies and hypomethylated in HCC nodules.The Y-axis from 0 to 1 represents methylation level.Blue vertical bars: methylation level.Gray line: depth of reads.