De novo assembly of a Tibetan genome and identification of novel structural variants associated with high-altitude adaptation

Abstract Structural variants (SVs) may play important roles in human adaptation to extreme environments such as high altitude but have been under-investigated. Here, combining long-read sequencing with multiple scaffolding techniques, we assembled a high-quality Tibetan genome (ZF1), with a contig N50 length of 24.57 mega-base pairs (Mb) and a scaffold N50 length of 58.80 Mb. The ZF1 assembly filled 80 remaining N-gaps (0.25 Mb in total length) in the reference human genome (GRCh38). Markedly, we detected 17 900 SVs, among which the ZF1-specific SVs are enriched in GTPase activity that is required for activation of the hypoxic pathway. Further population analysis uncovered a 163-bp intronic deletion in the MKL1 gene showing large divergence between highland Tibetans and lowland Han Chinese. This deletion is significantly associated with lower systolic pulmonary arterial pressure, one of the key adaptive physiological traits in Tibetans. Moreover, with the use of the high-quality de novo assembly, we observed a much higher rate of genome-wide archaic hominid (Altai Neanderthal and Denisovan) shared non-reference sequences in ZF1 (1.32%–1.53%) compared to other East Asian genomes (0.70%–0.98%), reflecting a unique genomic composition of Tibetans. One such archaic hominid shared sequence—a 662-bp intronic insertion in the SCUBE2 gene—is enriched and associated with better lung function (the FEV1/FVC ratio) in Tibetans. Collectively, we generated the first high-resolution Tibetan reference genome, and the identified SVs may serve as valuable resources for future evolutionary and medical studies.

Strategy-2: with the contigs assembled from PacBio reads, we first hybridassembled them into preliminary scaffolds using BioNano optical map using hybrid scaffold pipeline (same as above). Then we mapped 10X Genomics linked-reads with the preliminary scaffolds based on the supported linked-reads (Supplementary Fig. 2b and Supplementary Table 3).
Next, we aligned the Hi-C reads using BWA [2] with default parameters. Long scaffolds within each chromosomal linkage group were then assigned based on the Hi-C-based proximity-guided assembly. The original cross-linked long-distance physical interactions were then processed into paired-end sequencing libraries. First, all the reads from the Hi-C libraries were filtered by the HiC-Pro software (v2.8.1) [3], and the paired-end reads were uniquely mapped onto the draft assembly scaffolds, which were then grouped into 24 chromosome clusters using SALSA software [4] (Supplementary   Table 23 and 24). The clustering errors were corrected by referring to GRCh38.
PBJelly v.15.8.24 [5] was used to close gaps of draft genomes. Briefly, all the gaps (length ≥25 bp) on the assembly were identified. Then, the long Pacbio reads were aligned to the scaffold genome using PBJelly. After read alignment, the supporting procedure was parsed by checking the multi-mapping information. After the gapsupporting sequence reads are identified, PBJelly assembles the reads for each gap to generate a high-quality gap-filling consensus sequence. Finally, the assembly was polished with Illumina reads by aligning the paired-end short-reads to the assembly using BWA. Picard was used to remove duplications within reads, and base-correction of the assembly was performed using Pilon [6] (Supplementary Fig. 2).

Gap closure in the human reference genome GRCh38
We closed the gaps in the human reference genome (GRCh38) by using the approach of the previous study [8]. A region consisting of continuous runs of Ns in the GRCh38 was defined as a gap. We extracted these GRCh38 gaps based on the BED file format, and the 5kb flanking sequences upstream and downstream of the gaps were mapped to the assembly by MUMmer (nucmer -f -r -l 15 -c 25). A gap is defined as closed only if the two flanking sequences in GRCh38 could both be aligned to the ZF1 assembly with consistent orientation, and the aligned length is over 2.5kb. The added bases were precisely counted according to the position of the two flanking sequences on the ZF1 assembly.

Evaluation of consensus quality and sequence quality
Consensus quality of the ZF1 assembly was evaluated by comparing each chromosome with the reference genome GRCh38 using MUMmer [9] (arguments: nucmer --mum -c 1000 -l 100; delta-filter -i 85 -l 1000 -1).

Detection of SVs
For long-read PacBio data, we used mapping software NGLMR [15] to align the errorcorrected reads ('preads') from Falcon output to the human reference genome GRCh37.
We used GRCh37 instead of GRCh38 for SV detection because the majority of the previously reported SVs were based on GRCh37, and the downstream analyses included the comparisons of these SVs among different populations. Then we used Sniffles [15] to call SVs from the bam file and we required each variant with support from at least ten reads. For the NGS short-read Illumina data, we mapped the reads to the human reference genome GRCh37 using BWA. After sorting and removing duplicates, we used CNVnator [16], Pindel [17], Lumpy [18], BreakDancer [19] and BreakSeq2 [20] to call SVs. We further merged the results from five algorithms by MetaSV [21]. We also used PopIns [22] to call the non-reference non-repetitive insertions from Illumina data. For BioNano data, we detected SVs by Irys Solve (v3.1).
For the 10X Genomics data, Long Ranger (v2.1.6) was applied to call genetic variants including SVs, SNVs and INDELs. The SNVs and INDELs were used in the phasing analysis.

Estimation of SV mutation rate
We used the Watterson's to estimate the mutation rate, which is calculated as below: where K is the total number of segregating SV sites and n is the number of haploid genomes. Then we assumed the effective population size Ne=10,000 [23] and calculated the mutation rate µ as below:

SV annotation and enrichment analysis
Annotation for ZF1 SVs were defined by VEP (http://www.ensembl.org/info/docs/tools/vep/index.html) [24]. Repeat analysis for SVs region were used by RepeatMasker v4.0.1. Function enrichment analysis was perform by DAVID v6.7 [25]. In addition, we calculated the odds ratio to evaluate the enrichment of the genes affected by SVs in a set of priori candidate genes (the hypoxia regulatory genes or previously reported adaptive genes in Tibetans).
Odds Ratio = ( 9 ( 9 where S 1 denotes the number of SV genes presenting in the priori candidate gene list; S 0 denotes the number of SV genes absent from the priori candidate gene list; N 1 denotes the number of non-SV genes presenting in the priori candidate gene list; N 0 denotes the number of non-SV genes absent from the priori candidate gene list. The sum of S 1 , S 0 , N 1 and N 0 is the total number of genes across the genome. An odds ratio significantly above 1 (p < 0.05, the Chi-squared test) indicates that the SV genes are enriched in the priori gene set.

Overlap enrichment analysis of SVs versus genomic elements
We performed permutation tests for functional genomic elements overlapped with four different class of SVs (DEL, DUP, INS and INV). The genomic elements used in this study were from previous study [23]. The null distribution (random background) of the overlap counts was calculated from the true genomic elements overlapped with the random shuffled SV locations. We shuffled each type of SV for 1,000 times, and generated 1,000 random SV sets. The same number of SVs and the same length distribution of SVs of each SV type was kept in each shuffled set. We adopted log2 fold change of the observed overlap statistic versus the mean of the null distribution to present the enrichment of genomic element-SV overlap.

SV genotype estimation using NGS data
We

Simulation of SV frequency differentiation between Tibetans and Han Chinese
We employed ms [31] to generate a null V ST distribution to assess the SV frequency differentiation between Tibetans and Han Chinese under neutral evolution. Following previous studies [26], we assumed that Tibetans and Han Chinese split 10,000 years ago (T3), and after the divergence, a bottleneck event in Tibetans occurred till 9,000 years before present (T2). We also considered an exponential growth of effective population size (Ne) for Han Chinese starting at 2,000 years before present (T1). We assumed the Ne of the Tibetan-Han Chinese common ancestry (N1) to be 20,000, the Ne at T2 in Tibetan to be 5,000 (N2), and the Ne for present Tibetans and Han Chinese at T3 to be 20,000 (N3) and 50,000 (N4) respectively ( Supplementary Fig. 21). We assumed generation time of 25 years and the mutation rate of SV to be 10 -5 per generation [32]. The following ms command was used to perform the simulation:

SV validation using PCR and Sanger sequencing
We genotyped the candidate SVs in ~900 unrelated Tibetans and ~100 unrelated Han Chinese samples using PCR and Sanger sequencing. The primers were designed by Primer Premier 5, the extended 200bp sequences were included at SV breakpoints as PCR target and sequenced using Sanger sequencing.

Physiological traits measurement and association analysis of Tibetan populations
We collected physiological traits data and blood samples from 1,039 Tibetan volunteers who are native residents at the sampled locations, and they were from three different altitude regions in Tibet, including Lhasa (elevation: 3,658m), Bange (elevation: 4,700m) and Langkazi (elevation: 5,108m). We also sampled a Han Chinese population (n=100) from Dalian, China (elevation: 60m) as the reference. We filtered the samples based on the following criteria: 1) healthy (by physical examination and self-report); 2) normotensive, non-anemic, normal pulmonary function and non-pregnant; 3) 18≤age≤70; 4) nonsmoking (by self-report). The ethnic identity was confirmed by selfclaims and by report of the first language learned, and related individuals were excluded.
Written informed consents were obtained from all participants. The HB concentration and other blood parameters were measured immediately using an automated hematology analyzer (Sysmex pocH-100i, Japan). SPO2 was measured at forefinger tip with a hand-held pulse oximeter (Nellcor NPB-40, CA) at rest, and the fingertip was cleaned with alcohol swab before measurement. Serum NO levels were measured by protocol as we described before [33]. For lung function test, we performed on a Microlab Spirometer 3500K, version 5.X.X Carefusion (Micro Medical Ltd.,

Rochester, United Kingdom) according to the ATS (American thoracic society)
recommendation and the international standardized guideline [34]. Subjects were kept sitting position with a nose clip, and after two or three slow vital capacity tests, we collected the results of at least three forced vital capacities. The highest of the recorded FVC values was reported in the present study. PEF, MVV, FEF and FEV1 were measured as primary data. Stringent quality control was conducted in the entire procedure.
Genetic association analysis of physiological traits was performed using PLINK 1.07 [35]. We used additive model to evaluate the association between SVs and or ZF1 contigs) using bwa aln [2]. We treated all the archaic pair-end RURs as singleend reads during mapping. After mapping, we considered the regions with archaic RURs reaching an average depth with range between 1/3 and 1.5 folds of the genomewide depth of the archaic reads mapped to reference human GRCh37 (i.e. Neanderthal: (17,75), Denisovan: (10, 45)), as such depth range indicates that the archaic genome might likely contain one or two copies of these sequences. We further used Lastz (-notransition --nogapped -step=20 -filter=identity:90 -filter=coverage:90) [30] to align the sequences of these regions to GRCh37 and removed the sequences with highsimilarity in the human reference genome. The proportion of each individual genome sharing the novel sequences with archaic hominids was calculated as the total region length with the depth falling the range above divided by the total size of the individual genome.
To obtain the positions of these sequences regarding the human reference genome, we aligned individual contigs (ZF1, HX1 and AK1) with GRCh37 using MUMmer [9], and we only considered the sequences where their flanking positions could be determined based on GRCh37 coordinates. As shown in Supplementary Fig. 22, we required both the aligned segments larger than 500-bp (c>500 & d>500) and filtered out the contigs with less than 50% coverage of alignments (a/(c+d)<0.5). We required the gap between two alignments on ZF1 contig to be larger than that on human reference genome (a>b). The region with archaic reads mapping must contain more than five reads, and the region length must be greater than half of the gap between two alignments on ZF1 contig (a'>a/2). The inserted sequences meeting all the above conditions were considered as novel sequences with clear positions on the human reference genome. As the sub-telomeric and sub-centromeric regions contain lots of repeats, we further removed the sequences located within 1Mb of telomeres or centromeres. We focused on the sequence that the archaic RUR could only align to one of the three Asian individual contigs but not the other two individual contigs. Such sequences were referred as individual-specific novel sequences shared with archaic hominids. In order to check whether these individual-specific novel sequences could be found in other modern humans other than East Asians, by using Lastz (--notransition --nogapped -step=20 -filter=identity:95 -filter=coverage:95), we further aligned the ZF1-specific sequences to two additional modern human de novo assemblies (NA12878 and NA19240 downloaded from NCBI with accession PRJNA323611) [12], which represent European and African genomes respectively. To assess whether the ZF1specific sequences present in non-human primates, using Lastz (--notransition -nogapped -step=20 -filter=identity:80 -filter=coverage:80), we aligned the sequences to the genomes of chimpanzee, gorilla and orangutan [12].

Estimation of EHH
To examine whether there is a selective sweep around the SCUBE2 622-bp insertion (Chr11: 9068607) and the MKL1 163-bp deletion (Chr22:40935468-40935631), we first phased the genotypes in the 1-Mb region around these variants (Chr11: 8568607-whole-genome sequences of 39 Tibetan highlanders and 38 Han Chinese lowlanders, using SHAPEIT v2 [39] (r837) without any reference populations. We then calculated the EHH statistics for the focal SVs compared to the alternative alleles in these two populations, and the result was visualized using an R package rehh [40] with default parameters. The ancestral allele for each locus was determined according to the ancestral sequences released by the 1000 Genomes Project.