Deep whole-genome sequencing of 90 Han Chinese genomes

Abstract Next-generation sequencing provides a high-resolution insight into human genetic information. However, the focus of previous studies has primarily been on low-coverage data due to the high cost of sequencing. Although the 1000 Genomes Project and the Haplotype Reference Consortium have both provided powerful reference panels for imputation, low-frequency and novel variants remain difficult to discover and call with accuracy on the basis of low-coverage data. Deep sequencing provides an optimal solution for the problem of these low-frequency and novel variants. Although whole-exome sequencing is also a viable choice for exome regions, it cannot account for noncoding regions, sometimes resulting in the absence of important, causal variants. For Han Chinese populations, the majority of variants have been discovered based upon low-coverage data from the 1000 Genomes Project. However, high-coverage, whole-genome sequencing data are limited for any population, and a large amount of low-frequency, population-specific variants remain uncharacterized. We have performed whole-genome sequencing at a high depth (∼×80) of 90 unrelated individuals of Chinese ancestry, collected from the 1000 Genomes Project samples, including 45 Northern Han Chinese and 45 Southern Han Chinese samples. Eighty-three of these 90 have been sequenced by the 1000 Genomes Project. We have identified 12 568 804 single nucleotide polymorphisms, 2 074 210 short InDels, and 26 142 structural variations from these 90 samples. Compared to the Han Chinese data from the 1000 Genomes Project, we have found 7 000 629 novel variants with low frequency (defined as minor allele frequency < 5%), including 5 813 503 single nucleotide polymorphisms, 1 169 199 InDels, and 17 927 structural variants. Using deep sequencing data, we have built a greatly expanded spectrum of genetic variation for the Han Chinese genome. Compared to the 1000 Genomes Project, these Han Chinese deep sequencing data enhance the characterization of a large number of low-frequency, novel variants. This will be a valuable resource for promoting Chinese genetics research and medical development. Additionally, it will provide a valuable supplement to the 1000 Genomes Project, as well as to other human genome projects.


Ethics Statement
Participants in our study were all from 1000 Genome Project, and all individuals consented that their genomic data can be used in the analysis of the project and can be freely distributed for future studies.
And the public distribution of the sequencing data and genetic variations and genotypes was also been explicitly consented. The study had been also approved by the Institutional Review Board on Bioethics and Biosafety.

Sequencing
Library preparation was done by following the manufacturer's instructions (Illumina). We performed the cluster generation using the Illumina cluster station, and the workflow was as follows: template hybridization, isothermal amplification, linearization, blocking, denaturation and sequencing primer hybridization. The fluorescent images were processed to sequences, using the standard Illumina basecalling pipeline. We build 5 ranks of lanes with different insert size length (170bp, 500bp, 2kb, 5kb, 10kb, 20kb) (Table1). The average sequencing depth of CHS was 71.87 ± 23.52, and that of CHB was 82.36 ± 14.13. The average genome coverage of CHS was 99.65% ± 0.34%, and that of CHB was 99.60% ± 0.30% (Table 2).

Genome assembly
We used the SOAPdenovo2 algorithm[12] to assembly each individual genome denovo. Before genome assembly, for the data of each individual, we undertook several processes to filter low quality reads and correct base calling errors. We filtered reads with adapters (match length >= 10bp, mismatch <=3), filtered reads with the percent of N larger than 10%, filtered reads with more than 40% low quality bases, deduplicated reads to remove probable PCR duplications, calculated k-mer frequency of all reads to generate frequency tables and remove reads with low frequency k-mers.
We finally used reads with insert size below 2k to assemble the contigs, and used all reads to assemble the scaffolds. The k-mer size of denovo assembly was 63, and the merge level was 2. In total, 90 genome assembly results were generated. The average genome size was 2,951,301,058 ± 12,168,854 bps. The average N50 was 2,865 ± 97 bps. The average contig size was 49,339 ± 6,088 bps.

Structural variations calling and genotyping
We aligned the contigs to the hg19 reference genome to discover Structural Variants (SV). Contigs of more reliable and with less assembly errors can make the SV calling more reliable. Two stages were involved in SV calling: the first phase is fast mapping, which using the BWA-SW to locate contigs in the reference genome. The second phase is the LASTZ aligner, to hold large gaps on the alignments. We discover SVs by large gap discovery in sequence alignments. We genotype SVs by combining read depth and coverage in the SV regions. An average number of 3102 ± 190 SVs were detected for each sample.
In total, 108,434 candidate SVs were finally called. We used the software Genome STRiP [13] to genotype all these SVs in the Chinese population, and 96,637 SVs were finally successfully genotyped.
Most of the deletion breakpoints were found to be distributed in the simple repeat and the Alu regions ( Figure 1). 36,831 SVs were located in the gene region, 35,723 SVs were in the intron, 7,134 in the CDS regions, 5,100 in 5-UTR regions and 4,686 were located in the 3-UTR regions.

Figure
Click here to download Figure Figure1 The annotation results of deletion breakpoints.docx