CpG Island Definition and Methylation Mapping of the T2T-YAO Genome

Precisely defining and mapping all cytosine positions and their clusters, known as CpG islands (CGIs), as well as their methylation status are pivotal for genome-wide epigenetic studies, especially when population-centric reference genomes are ready for timely application. Here we first align the two high-quality reference genomes, T2T-YAO and T2T-CHM13, from different ethnic backgrounds in a base-by-base fashion and compute their genome-wide density-defined and position-defined CGIs. Second, mapping some representative genome-wide methylation data from selected organs onto the two genomes, we find that there are about 4.7–5.8% sequence divergency of variable categories depending on quality cutoffs. Genes among the divergent sequences are mostly associated with neurological functions. Moreover, CGIs associated with the divergent sequences are significantly different with respect to CpG density and observed CpG/expected CpG (O/E) ratio between the two genomes. Finally, we find that the T2T-YAO genome not only has a greater CpG site coverage than that of the T2T-CHM13 genome when whole-genome bisulfite sequencing (WGBS) data from the European and American populations are mapped to each reference, but also show more hyper-methylated CpG sites as compared to the T2T-CHM13 genome. Our study suggests that future genome-wide epigenetic studies of the Chinese populations rely on both acquisition of high-quality methylation data and subsequent precision CGI mapping based on the Chinese T2T reference.


Introduction
Since the position-sensitive and co-methylated CpG island (CGI) is one of the key epigenetic mechanisms for gene expression regulation and chromosome integrity [1,2], it becomes an important research direction for epigenomics [3][4][5][6][7] to accurately map the precise location of all CGIs and their methylation status onto the human genome.
Human reference genomes provide single-base-resolution coordinates for genome-wide genetic and epigenetic mapping with an ultimate precision [8].The release of the Telomere-to-Telomere CHM13 (T2T-CHM13) genome in 2022 has not only filled the previous 8% gap in the human genome [9], but also provides one of such benchmarks.Therefore, after the recent assembly of a Chinese reference genome T2T-YAO, we have attempted to detail some of the obvious inter-population deviations between T2T-YAO and T2T-CHM13 [10].Here, we raise two questions; one is whether the two reference genomes are significantly different in epigenetic parameters, and the other whether T2T-YAO is more appropriate for the Chinese population.The latter is of importance for subsequent genome-wide epigenetic studies based on Chinese populations under different physiological and pathological conditions.We choose to accurately predict genome-wide CGIs for both references and validate their applications in methylation studies by testing them on a limited dataset.
According to the divergent sequences of the two reference genomes, Aganezov et al. [9] compared the sequence differences between the classical GRCH38 genome and T2T-CHM13 in detail, which found that the T2T-CHM13 genome have introduced nearly 200 million base pairs (Mbp) of sequence.Since Chinese T2T precision genome T2T-YAO has been assembled, and the sequence differences between T2T-YAO and T2T-CHM13 may represent the epigenetic characteristics for Chinese population, exploring the divergent sequences between T2T-YAO and T2T-CHM13 genomes and whether the divergent sequences are associated with specific biological functions, became our first scientific questions.
According to the genome-wide CGIs prediction for the two reference genomes, our previous studies indicated that distance-based clustering method can localize more short CGIs (position-defined CGIs) for potential gene regulatory functions and are closely correlated with gene expression specificity [2,11,12].Since we neither investigate the position-defined CGIs for T2T-CHM13, nor sure whether there is a difference between the CGI features of T2T-YAO and the T2T-CHM13, what is the position-defining CGI distribution of the T2T-YAO and T2T-CHM13 genomes, and whether there is any specificity between these two genomes according to CGI features, become our second scientific question.
According to the CGI methylation specificity investigation for these two reference genomes, several relevant studies have been carried out [2,8,11,12].For example, our previous studies already analyzed the relationship between position-defined CGIs associated with core promoters and the methylation level of the human genome [2,11,12].Also, Gershman et al. [8] constructed a complete human genome methylation profiles using nanopore methylation data based on T2T-CHM13, and compared the methylation profiles with the methylation profiles using whole genome bisulfite sequencing (WGBS) data.However, since the differences in methylation patterns between the Chinese genome and T2T-CHM13, especially those CpG sites and CGIs associated with the divergent sequences, have not been well studied, using public genome-wide bisulfite sequencing data to analyze whether T2T-YAO and T2T-CHM13 are different at the CpG and CGI methylation levels become our third scientific question.
To address the above scientific questions, as shown in Figure 1, we firstly compared the difference of sequence between T2T-YAO and T2T-CHM13, and then analyzed the genes associated with the divergent sequences of these two reference genomes.Secondly, we computed the density-defined and position-defined CGIs of T2T-YAO and T2T-CHM13, and then compared the differences of their CGI features.Finally, we not only map the genome-wide methylation WGBS data with representative organs from authoritative databases onto T2T-YAO and T2T-CHM13 genome respectively, but also analyze the differences from the methylation level of CpG sites and CGIs between these two reference genomes.The following is the major findings of the study: 1) There is about 4.5-5.5% (4.665%-5.751%)difference between the reference sequences of Chinese T2T-YAO and T2T-CHM13 genomes, and the divergent sequences related genes are closely correlated with neurological functions by GO enrichment analysis [13][14][15][16][17][18].
2) CGIs associated with Chinese T2T-YAO and T2T-CHM13 divergent sequences are statistically significantly different with respect to CpG density and O/E values [11,19].
3) Not only WGBS data of Chinese T2T-YAO have greater coverage of CpG sites than theT2T-CHM13, but also T2T-YAO shows more hyper-methylated CpG sites than T2T-CHM13.
In general, our study demonstrated the statistically significant differences between T2T-YAO and T2T-CHM13 in genome sequence and epigenetic aspects, such as CGI and methylation patterns, which suggests that the establishment of CGI and methylation profiles based on Chinese T2T reference genome is crucial for the subsequent genome-wide epigenetic studies for the Chinese population.

Sequence comparative analysis results
To answer our first scientific question, we firstly calculated sequence differences between T2T-YAO-hp and T2T-CHM13, and then analyzed the categories and functional enrichment of divergent sequences related genes of T2T-YAO-hp.

Sequence differences between T2T-YAO-hp and T2T-CHM13
Firstly, Table 1 identifies the similar and divergent sequences between the two genomes by sequence alignment (the alignment results are detailed in Supplementary Data S1 and S2), as well as computes the proportion of divergent sequences for both T2T-YAO-hp and T2T-CHM13 genome.

Table 1. Sequence differences between T2T-CHM13 and T2T-YAO-hp
Table 1 demonstrates that T2T-CHM13 is about 54.6 mega-base pairs (Mbp) longer than T2T-YAO-hp.Meanwhile, the divergent sequences account for 5.751% of the full length for T2T-CHM13, whereas the divergent sequences account for 4.665% of the full length for T2T-YAO-hp.Thus, the difference of the reference sequences between Chinese T2T-YAO-hp and T2T-CHM13 genomes is about 4.5-5.5%.
Secondly, Figure 2A describes the mapping of similar sequences between Chinese T2T-YAO-hp and T2T-CHM13 genomes by using NGenomeSyn [20] to make a collinearity map.We observed that most of the T2T-CHM13 sequences are very similar to the sequences of the corresponding chromosomes of T2T-YAO, which is consistent with the expected chromosome correspondence.However, Figure 2A also shows some T2T-CHM13 sequences are very similar to the sequences of the different chromosomes of T2T-YAO.For example, chromosomes 21 and 22 of T2T-CHM13 have similar sequences to chromosomes 13, 14, and 15 of T2T-YAO-hp.
The parts of sequences of chromosome 22 of T2T-YAO-hp are as same as that of chromosome 14 of T2T-CHM13.

Divergent sequences related genes
Firstly, we calculated the divergent sequence related genes for the two reference genomes, and the specific gene are listed by Supplementary Data S3.By categorizing these divergent sequences related genes, Figure 3A indicates that most divergent sequence related genes are protein-coding genes accounting for 44.26% and IncRNA accounting for 25.93%.Furthermore, we did GO enrichment analysis for protein-coding genes in the divergent sequences related genes.Figure 3B not only shows that the most significant terms were " homophilic cell adhesion via plasma membrane adhesion molecules", "ion channel complex", and "gated channel activity" in the biological process (BP), cellular component (CC), and molecular function (MF) [21], but also it demonstrates that most of the GO terms of divergent sequence related protein-coding genes are associated with the function of the nervous system.

CGI prediction and analysis
To answer our second scientific question, we firstly carried out the statistical analysis for the CGI prediction for these two genomes, and then we compared the CGI features between these two genomes.

CGI prediction results
Firstly, we employed rule-based method [12,19] and distance-based clustering method [12,19] to predict the number of CGIs and the average CGI length for T2T-CHM13 and T2T-YAO-hp, respectively, which are detailed by Table 2 and Supplementary Data S4.We also predicted all CGIs related with divergent sequences of the two genomes, which are detailed by Supplementary Data S5.
Table 2 indicates that the number of density-defined CGIs and the average length predicted by T2T-YAO-hp is slightly smaller (84471 vs 87788) and shorter (391 vs 407 bp) than T2T-CHM13; the number of position-defined CGI predicted by position-defined CGIs for T2T-YAO-hp is greater than the T2T-CHM13 under the default parameter d=50 (220,736 vs. 222,421); the CGI number predicted by T2T-YAO-hp is less than the T2T-CHM13 not matter with density-defined CGIs or position-defined CGIs, but the average length of the position-defined CGIs predicted by T2T-YAO-hp is relatively longer than T2T-CHM13 (e.g., 60 vs. 58 bp with d=25) under the parameter d=25 and 12.

Table 2. Statistics of CGIs for T2T-CHM13 and T2T-YAO-hp
Secondly, to study the relationship between position-defined CGIs and density -defined CGIs for the same genome, we investigated the sequence intersection of density-defined CGIs and position-defined CGIs (d = 25bp) for both genomes.
Figure 4 lists the two consistence between T2T-YAO-hp and T2T-CHM13.One is that the total length of density-defined CGIs is much longer, approximately three times, than that of position-defined CGIs for each genome (e.g., (24,991,625+8,062,542) / (8,062,542 + 1,735,619) =3.37 for T2T-YAO-hp); The other is although most of the position-defined CGIs overlap with the density-defined CGIs, there is still a fraction (16.5% and 17.7%, respectively) that is not predicted by the rule-based CGI predictive methods.

CGI features comparison
To compare the differences of CGIs between T2T-CHM13 and T2T-YAO-hp, we calculated four commonly used CGI features, which are CGI length, GC-content, CpG Density, and CGI Observed/Expected ratios [2,22], and then Figure 5 presents their distribution for all CGIs (Figure 5A-5D) and the CGIs related to the divergent sequences (Figure 5E-5H) for two reference genomes.In addition, we showed the features distribution of genome-wide and divergent sequence related position-defined CGIs (including d=12bp, 50bp) in Supplementary Figure S1.S1).
Finally, Figure 5 indicates that no matter for all CGIs or divergent sequences CGIs, distance-based clustering method is more likely to find such CGIs that have shorter length, but greater GC content, O/E values, and CpGDensity than rule-based CGI predictive method.

Methylation analysis and comparison
To answer our third scientific question, we not only compared the number of cytosine sites (C sites, Supplementary Figure S2) and the methylation levels of all CpG sites, but also the methylation levels of genome-wide CGIs for the two reference genomes.

CpG methylation level comparison
We compare the difference at different CpG methylation level for similar sequences (Supplementary Figure S3) and divergent sequences (Figure 6) between T2T-CHM13 and T2T-YAO-hp, Figure 6 maps the WGBS data of two representative tissues (ovary and testis) onto the divergent sequences of the two genomes, respectively.In addition, based on the WGBS data of these two tissues, we also performed the methylation profiles of whole genome and the divergent sequences of T2T-YAO-hp and T2T-CHM13 which shows in Supplementary Figure S4. Figure 6 demonstrates that the number of divergent sequences CpG sites that failed to be mapped by the same WGBS methylation data is much less for T2T-YAO-hp than the T2T-CHM13 in both tissue Ovary and testis.For example, the number of divergent sequences CpG sites for Ovary that failed to be mapped to T2T-YAO-hp is 2.16 × 10 6 , whereas the T2T-CHM13 is 3.49 × 10 6 (Figure 6).The number of divergent sequences CpG sites that can be mapped for T2T-YAO-hp is greater than the T2T-CHM13 at all three methylation levels.And in terms of proportions, the most notable difference is in hyper-methylation (10.59% vs. 5.77% for Ovary, 10.99% vs. 5.89% for Testis).

CGI methylation level comparison
We used Eq.6 to calculate the methylation level of all density-defined CGIs and position-defined CGIs for the two reference genomes (Figure 7 and Supplementary  Firstly, Figure 7 turns out that the percentage of CGIs that failed to be mapped by WGBS data of T2T-CHM13 for all CpGs (invalid) is greater the T2T-YAO-hp for both density-defined CGI or position-defined CGI methods.For example, there are 10.46% vs 7.27% for density-defined CGI, and 23.63% vs 14.87% for position-defined CGI in Ovary (Figure 7A)。 Secondly, the proportions of hyper-methylated, intermediate methylated, and hypo-methylated of CGIs, predicted by density-defined CGIs and position-defined CGIs, are slightly greater for T2T-YAO-hp than the T2T-CHM13.For example, Figure 7B describes that the proportions of hyper-methylated, intermediate methylated, and hypo-methylated for T2T-YAO-hp predicted by position-defined CGI are respectively 19.73%, 13.29%, and 53%, whereas the corresponding proportions for T2T-YAO-CHM13 are 17.32%, 11.77%, and 48.02%, respectively.
Finally, Figure 7 shows that density-defined CGIs have the greatest proportion of hyper-methylation for both T2T-YAO-hp and T2T-CHM13, while position-defined CGIs highlight the proportion of hypo-methylation.For example, Figure 7A shows that the density-defined CGIs of both two reference genomes have a higher proportion of hyper-methylated (50.02% and 48.40%, respectively) than that of intermediate methylated and hypo-methylated, whereas the position-defined CGIs have the greatest proportion of hypo-methylation, which are 53.65% and 48.57% for T2T-YAO-hp and T2T-CHM13, respectively.

Discussion and conclusion
This research aims to compare the differences of sequence, CGI, and methylation patterns between two reference genomes, T2T-YAO and T2T-CHM13.And then, we can further determine the necessity of using T2T-YAO as a reference for future genome-wide epigenetic studies for Chinese populations.
For this purpose, we firstly compared the similar and divergent sequences of the high-quality T2T-YAO-hp haploid reference genome with T2T-CHM13 genome, respectively.And then, we analyzed the categories and functions of the divergent sequences related genes.Secondly, we computed position-defined CGIs and density-defined CGIs for two reference genomes and compared the differences for four commonly used CGI features between these two genomes.Finally, we compared the methylation level of CpGs and CGIs for these two genomes by public WGBS data.
In terms of sequence divergence of the two reference genomes, firstly, sequence alignment indicated that there is about 4.5-5.5% (4.665%-5.751%)difference between the reference sequences of Chinese T2T-YAO and T2T-CHM13, which further validated the great interracial genomic difference since T2T-CHM13 cells are from Europe [9,10].Secondly, although a few sequences of certain chromosomes appeared to be like those of other chromosomes possibly due to mutations (Figure 2A), most of the similar sequences are consistent with the expected chromosome correspondence.
And Figure 2B turns out that most of the differential sequences of the two genomes are mainly concentrated on a few chromosomes, such as specific regions of chr9 and chrY (Figure 2C), which implies that different races represented by these two genomes have retained ancestral genetic traits during their evolution.Meanwhile, the GO enrichment analysis demonstrated (Figure 3B) that divergent sequences related genes of T2T-YAO show a strong preference for several specific molecular mechanisms of biological functions of neuronal cells and synapses in the nervous system, which provides a series of clear information for the future study.
In terms of genome-wide CGI differences between the two reference genomes, firstly, since Table 2 indicates that distance-based clustering method with default d = 50bp (Eq.4) predicts opposite trends in the number and average length of CGIs for two reference genomes compared to the other three methods, it is necessary for us to optimize the parameter d (Eq.4) for different genomes.Secondly, when we compare the density-defined CGI with position-defined CGI (d=25), Figure 4 shows that density-defined CGI may not be able to predict such CGIs that are shorter but have potential regulatory functions.Third, we compare four commonly used CGI features for two reference genomes (Figure 5). Figure 5A turns out that the length distributions of all CGIs for two genomes are very similar, whereas the GC content, O/E value and CpGDensity predicted by distance-based clustering method shows less sensitivity than the rule-based method, which suggest that highly homologous T2T-YAO and T2T-CHM13 have certain diversity for the shorter position-defined CGI.Also, we found the position-defined CGIs related to the divergent sequences of the two reference genomes demonstrate more obvious differences of features rather than all CGIs (Figure 5E-5H), and since the genes related to the divergent sequences show strong preference for several specific gene functions (Figure 3B), it is worthwhile to further investigate whether these divergent sequence related CGIs are associated with the regulation of corresponding gene functions.
In terms of methylation pattern specificity of the two reference genomes, we firstly found that the number of CpG methylation sites of T2T-YAO that can be mapped onto the divergent sequence of two reference genomics is greater than the T2T-CHM13 for the same public WGBS data (Figure 6).In addition, the methylation level of the CpG sites of the divergent sequences shows a prominent proportion of hyper-methylation (Figure 6), which not only indicates that the methylation patterns of two reference genomics are quite different, but also further demonstrates that T2T-YAO sequenced by nanopore has greater sequencing accuracy and assembly precision than the T2T-CHM13 [10].Secondly, by comparing the methylation level (Eq.6) of all CGIs for two reference genomes, Figure 7 shows that the methylation level predicted by position-defined CGIs has greater difference than the density-defined CGIs due to the high sensitivity of position-defined CGIs.Also, we found that the hypo-methylated CGIs has the highest proportions predicted by position-defined CGIs for two genomes, which are completely different from the density-defined CGIs with great hyper-methylated proportions.This not only suggests that CGI methylation patterns are very different for the two reference genomes, but we also urgently need a more precise and comprehensive annotation for all CpG sites and CGIs in the human genome.
In general, our study demonstrated that T2T-YAO and T2T-CHM13 have statistically significant differences (Figure 6, Figure 7 and Supplementary Table S1) in genome sequence and epigenetic aspects such as CpG and CGI methylation patterns, which suggests that it is important for the subsequent genome-wide epigenetic studies of Chinese population to establish CGI and methylation profiles according to Chinese T2T reference genome.
However, there are still several questions that need further exploring.Firstly, it is crucial for genome-wide epigenetic regulation study to know how to use long-read sequencing technology to sequence genome-wide methylation data for more people and construct complete human methylation atlas of different populations.Secondly, since artificial intelligence (AI) has been gradually used in genome-wide methylation-related studies [23,24], how to use AI techniques such as neural networks to investigate the relationship between sequence features, CGI, methylation, and expression specificity of T2T-YAO becomes one of the future research directions.
Lastly, genome-wide CGI prediction and methylation analysis are not only data-intensive, but also time-consuming.However, there is currently no such a web service that collects and manages the CGI and methylation annotation data for Chinese T2T genome.Therefore, we will build up a web service for online computation, data sharing and visualization based on high-performance computing such as GPU acceleration and distributed computing to provide a data foundation and an online platform for the further study.

Data source
According to the divergent sequences of the two reference genomes, we not only downloaded the sequences of the classical human genomes GRCh38 and T2T-CHM13 (v2.0) from the NCBI database [25], but also obtained the Chinese T2T-YAO-hp genome sequenced by He et al. [10] using Nanopore sequencing and the corresponding genome annotation data in GFF3 format.
According to genome-wide CGIs prediction for both genomes, we used the T2T-CHM13 genomic CGI data, which are predicted by the Gardiner-Garden and Frommer (GGF) [26] standard in UCSC [27].
According to the CGI methylation specificity investigation for two reference genomes, we downloaded the WGBS (whole genome bisulfite sequencing) data for two representative tissues, which are ovary (ENCSR417YFD) and testis (ENCSR806NNG) from the ENCODE database [28].

Methods
Figure 1 describes the workflow of the study with three main aspects, which are sequence comparative analysis, CGI detection and analysis, and CpG methylation

Figure 1 .
Figure 1.Workflow of the study.

Finally,
Figure 2C counts the comparison of the length of the divergent sequences for different chromosomes.As shown in Figure 2B, the divergent sequences between T2T-CHM13 and T2T-YAO genomes are mainly concentrated in specific regions of each chromosome.

Figure 7 .
Figure 7. CGI methylation level of two reference genomes.

Figure 1 .
Figure 1.Workflow of the study.

Figure 2 .
Figure 2. Sequence alignment between T2T-CHM13 and T2T-YAO.(A) Sequence alignment collinearity map.Here, colored links between two genomes indicate genomic fragments with similar sequences, and the thickness of the link represents the similarity between two genomic fragments.(B) Visualization of sequence comparison across chromosomes, where blue and orange sequences represent similar sequences of T2T-CHM13 and T2T-YAO-hp in the current chromosome, respectively, and the blank sections represent the divergent sequences of this reference genome relative to the other.(C) Divergent sequence length statistics of T2T-CHM13 and T2T-YAO-hp genomes, where the blue and orange bars represent the divergent sequence lengths of T2T-YAO-hp and T2T-CHM13 in different chromosomes, respectively.

Figure 3 .
Figure 3. Divergent sequences related genes analysis.(A) Proportion of the types of divergent sequences related genes in these two reference genomes.(B) GO enrichment analysis of divergent sequence related protein-coding genes, where the size and color of the circle represent the number of enriched genes and the statistical significance value (p.adjust) [31, 35-43] of the current GO term, respectively.

Figure 4 .
Figure 4. Sequence intersection of density-defined CGIs with position-defined CGIs (d = 25bp).Here, the white parts represent the sequence lengths of the intersection for position-defined CGI and density-defined CGI, while the orange and blue parts represent the sequence lengths unique to density-defined CGI and position-defined CGI, respectively.

Figure 5 .
Figure 5.Comparison of CGI features between T2T-CHM13 and T2T-YAO-hp.(A) Length of all CGIs; (B) GC content of all CGIs; (C) O/E ratios of all CGIs; (D) CpGDensity of all CGIs; (E) Length of divergent sequences CGIs; (F) GC content of divergent sequences CGIs; (G) O/E ratios of divergent sequences CGIs; (H) CpGDensity of divergent sequences CGIs.

Figure 6 .
Figure 6.CpG methylation level of divergent sequences between T2T-CHM13 and T2T-YAO-hp.Here, green, blue, and orange represent the number of hyper-methylated, intermediate methylated and hypo-methylated, respectively; gray represents the number of CpG sites that failed to be mapped by WGBS data; dark and light colors are used to distinguish T2T-YAO-hp and T2T-CHM13.

Figure 7 .
Figure 7. CGI methylation level of two reference genomes.(A) CGI methylation level of two reference genomes using the WGBS data of Ovary.(B) CGI methylation level of two reference genomes using the WGBS data of Testis.Here, green, blue, and orange represent the proportion of CGIs that are hyper-methylated, intermediate methylated, and hypo-methylated, respectively.Gray represents the proportion of CGIs where all CpGs failed to be mapped by WGBS data which named "invalid".Darker and lighter colors are used to distinguish the T2T-YAO-hp and T2T-CHM13.