Assessment of human diploid genome assembly with 10x Linked-Reads data

Abstract Background Producing cost-effective haplotype-resolved personal genomes remains challenging. 10x Linked-Read sequencing, with its high base quality and long-range information, has been demonstrated to facilitate de novo assembly of human genomes and variant detection. In this study, we investigate in depth how the parameter space of 10x library preparation and sequencing affects assembly quality, on the basis of both simulated and real libraries. Results We prepared and sequenced eight 10x libraries with a diverse set of parameters from standard cell lines NA12878 and NA24385 and performed whole-genome assembly on the data. We also developed the simulator LRTK-SIM to follow the workflow of 10x data generation and produce realistic simulated Linked-Read data sets. We found that assembly quality could be improved by increasing the total sequencing coverage (C) and keeping physical coverage of DNA fragments (CF) or read coverage per fragment (CR) within broad ranges. The optimal physical coverage was between 332× and 823× and assembly quality worsened if it increased to >1,000× for a given C. Long DNA fragments could significantly extend phase blocks but decreased contig contiguity. The optimal length-weighted fragment length (W\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}${\mu _{FL}}$\end{document}) was ∼50–150 kb. When broadly optimal parameters were used for library preparation and sequencing, ∼80% of the genome was assembled in a diploid state. Conclusions The Linked-Read libraries we generated and the parameter space we identified provide theoretical considerations and practical guidelines for personal genome assemblies based on 10x Linked-Read sequencing.

We thank the reviewer for these positive comments and address each point below.
That said, I think there are some analyses missing that should be included: 1. I think you should variant call off of the de novo assemblies to see if there are any differences you are missing because you're only looking at things at a very high structural level.
We have now called SNVs and SVs from our de novo assemblies and from other methods. Please find our results in the responses to points 2-4 of reviewer2.
2. How is phasing affected? I don't see any data on that other than total diploid regions. You should include the changes to the phase block N50. It's mentioned in the abstract, but I don't see it anywhere else.
We have showed the trend of phased block N50 in different linked-read sets in Figure  S14, now we also provided the values of phase block N50s in Table S6 3. Besides NA50 you should include assembly errors such as breakpoints, translocations, inversions, relocations, etc….. You have a nice dataset here, you should try to get more out of it.
Thank you for the suggestions. We have re-run QUAST and generated several detailed statistics which are now shown in Table S4. These results are consistent with the contig N50s reported in Figure 3.
Minor comments: 58-66, Probably should add this reference for PacBio CCS sequencing, contig N50 is 15 mb, https://www.biorxiv.org/content/10.1101/519025v2 We have added this reference 65-66, I'd argue that this statement is a bit strong, cost is lowering, and throughput is increasing for these systems This is now lines 70-72. We have rephrased the sentence and now write: "However, long-fragment sequencing suffers from extremely high cost (in the case of PacBio CCS), or low base quality (in the case of single-pass reads of either technology), hampering its usefulness for personal genome assembly." 68 Not a complete sentence We fixed this Ref 27 isn't our stLFR paper, the doi for that is 10.1101/gr.245126.118, and it is commercially available now in some parts of the world We have added the new reference and deleted the confusing words in this sentence.
Reviewer #2: Zhang and co-authors present a parameter study for 10x linked-read sequencing experiments with the objective of evaluating the influence of experimentally controllable parameters on the final diploid assembly quality. The authors perform basic performance evaluation in terms of common metrics such as N50 values and provide technical recommendations for designing linked-read sequencing experiments. Additionally, Zhang et al. implemented a software tool for simulating linked-read sequencing data, which they use for parameter assessment given the known (simulated) truth.
While such studies that provide guidance to users of a sequencing technology are very valuable in principle, I have a number of concerns that should be addressed: 1. There is a closely related article by Luo et al. (2017, DOI: 10.1016/j.csbj.2017.002) that has been missed. The authors should clarify what the added value of their study is beyond the work by Luo et al. This comment applies to both aspects: guidance to users in terms of 10x sequencing experiments and the utility/features of their data simulation tool (note that Luo et al. also provide a simulator).
We appreciate and cite the work by Luo et al. However, our study provides (1) a more flexible simulation tool and (2) an extensive set of new sequence data.
Regarding (1) A. We explicitly allow users to input CF, CR, Wμ_FL and μ_FL, which have strong connections with library preparation and Illumina sequencing. For example, CF is driven by input DNA amount and μ_FL by DNA preparation and potential size selection. LRSIM only lets the user set the total number of reads. B. The usability of LRTK-SIM is better than LRSIM. LRSIM requires many third party packages and software to be installed first, such as Inline::C perl library, DWGSIM etc. It is not convenient for the users with insufficient computer experience. LRTK-SIM was written in Python and no third-party software was required. It can be installed and gotten started easily. LRTK-SIM can parallel simulate multiple libraries with a variety of parameters simultaneously. The users can compare the performance of different parameters in one run.
Regarding (2) Luo et al. compared the influence of different parameters by simulation only, which does not always reflect the situation in real sequencing. In our study, we prepared six real libraries with different parameters and could validate our observations from simulation data.
2. The focus of this manuscript is on guiding researchers who are after a cost-effective characterization of individual human genomes. In my view, Zhang et al. should go the full distance and additionally compare to standard Illumina sequencing followed by mapping and variant calling as a baseline. The assembly metrics employed are not so very informative when it comes to the question of which variation (relative to the reference genome) is been missed/captured in standard approaches.
While human assembly is the focus, we believe that much of the interest in our work will come mainly from researchers who are interested in assembling novel genomes. We use human as an assembly model because assembly quality can be gauged by comparison to the reference sequence. Nonetheless ... Beyond comparing to standard Illumina sequencing, including a detailed comparison to reference-based processing of 10x data (e.g. using LongRanger) would be interesting. In this way, this study would by much more helpful for planning sequencing studies.
... in response to this comment, we now systematically investigate SNV and SV calls from our assemblies. We compare with standard Illumina data and reference-based processing of our 10x data. The standard Illumina data were downloaded from Genome In A Bottle and analyzed with SVABA to generate SV calls, and with BWA and FreeBayes to generate SNV calls. Long ranger was used to generate SNVs and SVs (only deletions) for 10x reference-based analysis. We noted that R9 failed to be analyzed by Long Ranger due to its extremely large CF. We compared SNV and SV calls among the different approaches using vcfeval (https://github.com/RealTimeGenomics/rtg-tools) and truvari (https://github.com/spiralgenetics/truvari), respectively.
We found that SNVs from reference-based processing of Illumina and 10x data were comparable, and both of them were better than assembly-based SNV calls. For SVs, our assemblies generated many calls that were missed by the reference-based strategy.
We now provide several additional supplementary tables (Table S7-S12) to present these results.
3. The main reason (in my view) for pursuing de novo assembly of human genomes is to access structural variation that is missed otherwise. An evaluation on how much structural variation is (accurately) captured would be of interest to many readers. This is actually something that the authors point out in the Discussion themselves: "Arguably, the metric that matters most in the context of a personal genome is the discovery of variation that lower-cost approaches do not enable." As implied by the quote, we agree with the reviewer's comment. Consequently, we now compare three linked-read sets from HG002 with the Tier 1 SV benchmark from Genome in a Bottle by using truvari (https://github.com/spiralgenetics/truvari). The results are summarized in Table S13. 4. PacBio CCS reads are available for HG002 (see Wenger at al., http://dx.doi.org/10.1101/519025). Mapping those CCS reads back to your diploid assemblies and calling variants provides an easy and powerful opportunity to assess the sequence quality from an independent technology.
These data became available while our manuscript was in review. We note that the PacBio CCS calls on HG002 are generally reasonably accurate but are not guaranteed to be correct in the absence of a gold standard. Therefore, we prefer to compare them in an overlap analysis with our calls, as opposed to implying that they are a gold standard by using the term "validation". We used vapor (https://github.com/millslab/vapor) to validate our SV calls based on PacBio CCS reads from HG002 and include Table S14 to show the validation rates.
Beyond this, your evaluation could be improved by also adding an assembly evaluation perspective that is more biologically motivated, e.g., number of recovered genes/disrupted genes or similar (this should be supported by Quast-LG/BUSCO).
We have added this analysis in Table S4.
Minor comments -line 51: pedigree based phasing is quite powerful even for trios (where it is able to phase all variants that are homozygous in at least one individual), so I disagree to the statement that this is only feasible in large pedigrees. We fixed this and removed confusing words.
-lines 60ff: it is unclear which study your are referring to here, please add the citation at the end of the sentence (N50 31.1Mb) We included a new reference here.
-line 68: broken sentence; also, putting the citation at the end of the sentence increases readability We fixed this issue.
-lines 71/72: again, unclear which study you are referring to ("Long Fragment Read") We included a new reference here.
-lines 125ff: is there a specific reason why five and three? (And not, e.g., five and five?) Also, the meaning of L, M, and H in the subscript of L should be explained Because we generated two additional libraries (L_1L and L_1M for NA12878) to evaluate the effects of CF and CR in assembly, and we believe the trend should be consistent in the two samples. L, M and H represent low, medium and high CF in the experiments. We have clarified this in the manuscript.
-line 129: percent of what? The percent of GEM in 10x Chromium system.
-line 151: please be more specific about which version of hg38 was used (detail once if identical hg38 was used throughout the rest of the paper [lines 165, 171, 195 and so on...]) The reference was downloaded from 10x website with the version of GRCh38 Reference 2.1.0.
-line 172: please provide an exact reference for the high confidence regions that you used (e.g., file URL) We have added the URL in the manuscript.
-line 208: "in in" We fixed this.
-line 208: this sentence is talking about real data, so the reference to Fig 2C and 2D does not match. We clarified this in the manuscript.
-line 209: "...but not dramatically... [...] ...appreciably" -this is subjective language, please rephrase and be more fact-oriented (for instance by including the numbers you refer to in parentheses). We included the numbers and rephrased the sentence to be more fact-oriented.
-line 251: what is the denominator for these 91% all bases that are not Ns in the reference genome? (Note that for this analysis, the version of hg38 matters, see comment above). "N"s do not contribute to the denominator.
-The authors mention stLFR in line 278. There's a new preprint that's worth citing/discussing: http://dx.doi.org/10.1101/324392 We have cited their latest version.
-line 296: "extremely long" please say what extremely long means here We defined "extremely long" as the DNA fragments longer than 200kb.
-line 570: please be more specific what you mean by "in-house programs", and where the respective sources are available (is that the "Evaluate_diploid_assembly" github?) All the source codes for assembly evaluation are available in https://github.com/zhanglu295/Evaluate_diploid_assembly. We added this information in the sentence.
-please add a -preferably open source -license file to your github repositories We added the license files in the GitHub.
-"sample prep" is jargon and should be replaced by "sample preparation" (eg. line 41, but also elsewhere) We have updated all the "sample prep" to "sample preparation" in the manuscripts.

17
In this study, we investigate in depth how the parameter space of 10x library preparation and

18
sequencing affects assembly quality, on the basis of both simulated and real libraries.

19
Findings: We prepared and sequenced eight 10x libraries with a diverse set of parameters from 20 standard cell lines NA12878 and NA24385 and performed whole genome assembly on the data.

21
We also developed the simulator LRTK-SIM to follow the workflow of 10x data generation and 22 produce realistic simulated Linked-Read data sets. We found that assembly quality could be 23 improved by increasing the total sequencing coverage (C) and keeping physical coverage of DNA 24 fragments (CF) or read coverage per fragment (CR) within broad ranges. The optimal physical 25 coverage was between 332X and 823X and assembly quality worsened if it increased to greater 26 than 1,000X for a given C. Long DNA fragments could significantly extend phase blocks, but 27 decreased contig contiguity. The optimal length-weighted fragment length (W ) was around 50 28 -150kb. When broadly optimal parameters were used for library preparation and sequencing, ca.

29
80% of the genome was assembled in a diploid state.

30
Conclusion: The Linked-Read libraries we generated and the parameter space we identified 31 provide theoretical considerations and practical guidelines for personal genome assemblies 32 based on 10x Linked-Read sequencing.

Introduction 37
The human genome holds the key for understanding the genetic basis of human evolution,

50
The lack of long-range contiguity between end-sequenced short fragments limits their application 51 for reconstructing personal genomes. Long-range contiguity is important for phasing variants and 52 dealing with genomic complex regions. For haplotyping, variants can be phased by population-53 based methods [4,5] or family-based recombination inference [6,7]. However, such approaches 54 are only feasible for common variants in single individuals or when a trio or larger pedigree is 55 sequenced. Furthermore, highly polymorphic regions such as the HLA in which the reference 56 sequence does not adequately capture the diversity segregating in the population are refractory 57 to mapping-based approaches and require de novo assembly to reconstruct [8]. Short-read/short-58 fragment data are challenged by interspersed repetitive sequences from mobile elements and by 59 segmental duplications, and only support highly fragmented genome reconstruction [9, 10].

61
In principle, many of these challenges can be overcome by long-read/long-fragment sequencing 62 [11,12]. Assembly of Pacific Biosciences (PacBio) or Oxford Nanopore (ONT) data can yield 63 impressive contiguity of contigs and scaffolds. In one study [13], scaffold N50 reached 31.1Mb by 64 hierarchically integrating PacBio long reads and BioNano for a hybrid assembly, which also 65 uncovered novel tandem repeats and replicated the structural variants that were newly included 66 in the updated hg38 human reference sequence. Another study [14] produced human genome 67 assemblies with ONT data, in which a contig N50 ~3Mb was achieved, and long contigs covered 68 all class I HLA regions. A recent whole genome assembly of NA24385 [15] with high quality 69 PacBio CCS reads generated contigs with an N50 of 15Mb. However, long-fragment sequencing 70 suffers from extremely high cost (in the case of PacBio CCS), or low base quality (in the case of 71 single-pass reads of either technology), hampering its usefulness for personal genome assembly.

73
Hierarchical assembly pipelines in which multiple data types are used as another approach for 74 genome assembly [16]. For example, in the reconstruction of an Asian personal genome, fosmid 75 clone pools and Illumina data were merged, but because fosmid libraries are highly labor intensive 76 to generate and sequence, this approach is not generalizable to personal genomes. The "Long 77 Fragment Read" (LFR) approach [17], where a long fragment is sequenced at high depth via 78 single-molecule fragmented amplification, reported promising personal genome assembly and 79 variant phasing by attaching a barcode to the short reads derived from the same long fragment.

80
However, because LFR is implemented in a 384 well plate, many long fragments would be 81 labelled by the same barcodes, making it difficult for binning short-reads, and the great 82 sequencing depth required rendered LFR not cost-effective.

84
An alternative approach is offered by the 10x Genomics Chromium system, which distributes the 85 DNA preparation into millions of partitions where partition-specific barcode sequences are 86 5 attached to short amplification products that are templated off the input fragments. Because of 87 the limited reaction efficiency in each partition, the sequencing depth for each fragment is too 88 shallow to reconstruct the original long-fragment, distinguishing this approach from LFR [18].

89
However, to compensate for the low read coverage of each fragment, each genomic region is 90 covered by hundreds of DNA fragments, giving overall sequence coverage that is in a range

Library preparation, physical parameters and sequencing coverage 130
We made six DNA preparations that varied in fragment size distribution and amount of input DNA, 131 three each from NA12878 and NA24385. From these, we prepared eight libraries, five from 132 NA12878 and three from NA24385 (Table S1) (Table S1).

Linked-Reads subsampling 155
The high sequencing coverage in the libraries allowed subsampling to facilitate the matching of 156 parameters among the different libraries, for purposes of comparability; these subsampled 157 Linked-Read sets are denoted (Figure 1). We aligned the 10x Linked-Reads to human 158 reference genome (hg38, GRCh38 Reference 2.1.0 from 10x website) followed by removing PCR 159 duplication by barcode-aware analysis in Long Ranger [21]. Original input DNA fragments were 160 inferred by collecting the read-pairs with the same barcode that were aligned in proximity to each 161 other. A fragment was terminated if the distance between two consecutive reads with the identical 8 barcode larger than 50kb. Fragments were required to have at least two read pairs with the same 163 barcode and a length of at least 2 kb. Partitions with fewer than three fragments were removed.

164
We subsampled short-reads for each fragment to satisfy the expected CR.

Generating 10x simulated libraries by LRTK-SIM 167
To compare the observations from real data with a known truth set, we developed LRTK-SIM, a 168 simulator that follows the workflow of the 10x Chromium system and generates synthetic Linked-

184
From this diploid reference genome, LRTK-SIM generated long DNA fragments by randomly 185 shearing each haplotype with multiple copies into pieces whose lengths were sampled from an 186 exponential distribution with mean of . These fragments were then allocated to pseudo-187 9 partitions, and all the fragments within each partition were assigned the same barcode. The appear to influence assembly quality ( Figure S10). In total, we generated 17 simulated Linked-

196
Read datasets to explore the overall parameter space (Table S2-S3) and 11 to match the 197 parameters of the abovementioned real libraries (Figure 1).

Human genome diploid assembly and evaluation 200
The scaffolds were generated by the "pseudohap2" output of Supernova2, which explicitly 201 generated two haploid scaffolds, simultaneously. Contigs were generated by breaking the 202 scaffolds if at least 10 consecutive 'N's appeared, per definition by Supernova2. For the 203 simulations of human chromosome 19, we used the scaffolds from the "megabubbles" output.

204
Contig and scaffold N50 and NA50 were used to evaluate assembly quality. Contigs longer than 205 500bp were aligned to hg38 by Minimap2 [29]. We calculated contig NA50 on the basis of contig

Genomic variant calls from diploid assembly 210
We compared single nucleotide variants (SNVs) and structural variants (SVs) from the diploid 211 regions of our assemblies with the ones from standard Illumina data and reference-based 10 processing of our 10x data.

227
Performance of diploid assembly: influence of total coverage Diploid assembly by Linked-

228
Reads requires sufficient total read coverage (C=CRCF) to generate long contigs and scaffolds.

229
In this experiment, to explore the roles of both physical coverage (CF) and per-fragment read 230 coverage (CR), we first generated eight simulated libraries whose total coverage C ranged from 231 16x to 78x: four with CR fixed and increasing CF and four with fixed CF, and increasing CR (Table   232 S2). Contig and scaffold N50s increased along with increasing either CF or CR (Figure 2A and 233 2B). To investigate whether the trend was also present in the real datasets, we analyzed six real 234 libraries (three by varying CF, and the other three by varying CR; Figure 1): as C increased, we 235 varied CF and CR independently by fixing the other parameter. Contig and scaffold N50s also 236 increased in these simulation (Figure 2C and 2D) and real linked-read sets (Figure 2E and 2F) 237 as a function of total coverage C. Contig lengths did increase a little (621.4kb to 758.1kb for 238 11 simulation; 110.7kb to 119.6kb for real data) when C was increased beyond 56X. Accuracy, which 239 we define as the ratio between NA50 (N50 after breaking contigs or scaffolds at assembly errors) 240 and N50 (Figure 2C and 2E), changed 18% for simulation and 7% for real data (587.5kb to 241 713.3kb for simulation; 97.1kb to 104.5kb for real data). For scaffolds in the real data sets, when 242 C increased from 48X ( 3 ) to 67X ( 4 ), both scaffold N50 and NA50 were significantly improved 243 (N50: 13.4Mb to 30.6Mb; NA50: 6.3Mb to 12.0Mb), but the accuracy dropped slightly from 46.6% 244 to 39.1%, which indicated that scaffold accuracy may be refractory to extremely high C ( Figure   245 2F). These results indicated that assembly length and accuracy were comparable over a broad 246 range of CF and CR at constant C, which implied that assembly quality was mainly determined by 247 C.

249
Performance of diploid assembly: influence of fragment length and physical coverage. To 250 investigate if input weighted fragment length (as measured by W ) influenced assembly quality,

294
Overlapping the diploid regions from the assemblies of the same individual revealed that 50.24% 295 and 67.27% of the genome for NA12878 and NA24385 (Figure S13), respectively, were diploid 296 in all the three assemblies. NA12878 was lower because of the low percentage of diploid regions 297 in assembly 6 ( lengths were mainly determined by total coverage C and increased in real data with increasing 302 fragment length ( Figure S14, Table S6).

304
Performance of diploid assembly: quality of variant calls. The ultimate goal of human genome 305 assembly is to accurately identify genomic variants. We compared the SNVs and SVs from our 306 assemblies with the calls from referenced-based processing of standard Illumina and 10x data, 307 and benchmarked them using gold standard from Genome in a Bottle and PacBio CCS reads.

308
We found the SNVs from referenced-based processing of standard Illumina and 10x data were 309 comparable and both of them were better than assembly-based calls (Table S7 and S8) For SVs, 310 our assemblies generated many calls that were missed by the reference-based strategy (Table   311 S9-S12) and even by the Tier 1 benchmark of Genome in a Bottle ( For standard Illumina sequencing, library complexity is usually sufficient to generate tremendous 325 numbers of reads from unique templates and read coverage can be increased simply by 326 sequencing more. However, the 10x Chromium system performs amplification in each partition,

327
and generally only about 20% to 40% of the original long fragment sequence can be captured as 328 short fragments and eventually as reads, resulting in shallow sequencing coverage per fragment.

329
Sequencing more deeply does not increase the per-fragment coverage much as most of the extra 330 reads are from PCR duplicates. The solution is to sequence multiple 10x libraries constructed 331 from the same DNA preparation and merge them for analysis. This means that CR remains in the 332 standard range where PCR duplicates are relatively rare, but CF increases proportionally to the 333 number of libraries used. A practical limitation to this approach is that Supernova2 limits the 334 number of barcodes to 4.8 million.

336
Our results showed that in practice, CF should be between 335X and 823X, but no larger than 337 1000X, given the optimal coverage of C=56X recommended by 10x and the requirement for 338 sufficient per-fragment read coverage. Surprisingly, we observed that including more extremely 15 long fragments was detrimental for assembly quality. This is possibly due to the loss of barcode 340 specificity for fragments spanning repetitive sequences. From a computational perspective, too 341 many long fragments are harmful to deconvolving the de bruijn graph, as more complex paths 342 need to be picked out. In our experiments, W between 50kb and 150kb is the best choice to 343 generate reliable assemblies.

345
Parameters driving assembly quality 346 Our results regarding assembly quality, and the 10x parameters that influence it, may be useful 347 for efforts in which de novo assemblies are important for generation of an initial reference 348 sequence. We show that maximization of N50 does not necessarily reflect assembly quality, 349 which we were able to compare to NA50 because there exists a high-quality human reference 350 genome. Contig and scaffold lengths mostly increased with ascending sequencing coverage, and 351 at sufficient overall sequence coverage it did not matter much whether the increasing coverage 352 C was accomplished by increasing CR or CF. However, both contig and scaffold accuracy 353 decreased with increasing C. We also found, counterintuitively, that contig and scaffold length 354 mostly decreased with increasing fragment length, a phenomenon that may be due to the specific 355 implementation; however, until there is another assembler that can be compared to Supernova2 356 it will not be possible to reason about this effect. In addition, intrinsic properties of the genome 357 matter greatly, as removal of repeats or lack of variation dramatically improves assembly quality.

359
Diploid assembly is the appropriate approach for assembly of genomes of diploid organisms that 360 harbor variation. Therefore, an important metric to evaluate diploid assembly is the fraction of the 361 genome that is assembled in a diploid state. The short input fragment length of 6 resulted in 362 roughly 20% less of the genome in a diploid state (<60% vs <80%) compared to the other libraries 363 of the same individual. This observation suggests that in addition to metrics such as N50,

364
16 evaluation of assembly quality should also include the fraction of the genome (or the assembly) 365 that is in a diploid state.

Cost-benefit analysis 368
Overall, we have attempted to give practical guidelines to assembly of 10x data with Supernova2 369 and evaluate the performance across a wide range of metrics. Arguably, the metric that matters 370 most in the context of a personal genome is the discovery of variation that lower-cost approaches 371 do not enable. We estimate that the cost increase over standard Illumina sequencing is about 2x, 372 given the 10X preparation cost and the higher level of sequence coverage required. There may 373 be many applications for which this combination of excellent single nucleotide variant detection 374 (via barcode-aware read mapping) and precise structural variant discovery (via assembly), 375 achieved by the same data set, is worth the price.

Comparison with hybrid assemblies 378
Hybrid assembly strategies have been applied successfully to produce human genome assembly 379 of long contiguity [13,14,41]. In these studies, long contigs are first produced by single-molecule 380     to 300X in C and D; CR was fixed to 0.2X and CF was fixed to 300X in E and F.  Table 1. Genomic coverage of contigs generated by Supernova2. Non-PAR: nonpseudoautosomal regions of X chromosome. 6 , 7 and 8 are female; 9 , 10 and 11 are male.