Identifying, understanding, and correcting technical biases on the sex chromosomes in next-generation sequencing data

Mammalian X and Y chromosomes share a common evolutionary origin and retain regions of high sequence similarity. This sequence homology can cause the mismapping of short sequencing reads derived from the sex chromosomes and affect variant calling and other downstream analyses. Understanding and correcting this problem is critical for medical genomics and population genomic inference. Here, we characterize how sequence homology can affect analyses on the sex chromosomes and present XYalign, a new tool that: (1) facilitates the inference of sex chromosome complement from next-generation sequencing data; (2) corrects erroneous read mapping on the sex chromosomes; and (3) tabulates and visualizes important metrics for quality control such as mapping quality, sequencing depth, and allele balance. We show how these metrics can be used to identify XX and XY individuals across diverse sequencing experiments, including low and high coverage whole genome sequencing, and exome sequencing. We also show that XYalign corrects mismapped reads on the sex chromosomes, resulting in more accurate variant calling. Finally, we discuss how the flexibility of the XYalign framework can be leveraged for other use cases including the identification of aneuploidy on the autosomes. XYalign is available open source under the GNU General Public License (version 3).


Introduction
Out of the urgent need to understand the effects of sex chromosome homology on 103 next-generation sequencing analyses, in this paper we first test whether sequence 104 homology between sex chromosomes can confound aspects of read mapping and lead to 105 downstream errors in sequence analysis. We then present XYalign, a tool developed to 106 perform three major tasks: (1) aid in the characterization of an individual's sex 107 chromosome complement; (2) identify and correct for technical artifacts arising from sex 108 chromosome sequence homology; and (3) tabulate and visualize important metrics for 109 quality control such as mapping quality, sequencing depth, and allele balance. We show 110 how XYalign can be used to identify XX and XY individuals across sequencing depths 111 and capture techniques. We also show that the default steps taken by XYalign correct 112 many mismapped reads on the sex chromosomes, resulting in more accurate variant 113 calling. Finally, because XYalign is designed to be both scalable and customizable, we 114 discuss how it can be used in a variety of situations including genetic sex identification in 115 both XX/XY and ZZ/ZW systems, identification of sex-linked sequences and 116 pseudoautosomal regions in new draft genomes, correction of technical biases in genomic 117 and transcriptomic data, detection of aneuploidy, and investigation of mapping success 118 across arbitrary chromosomes. 119 120 Methods 121 Implementation 122 XYalign is implemented in Python and uses a number of third-party Python 123 packages including Matplotlib (Hunter, 2007), NumPy (Oliphant, 2006 (https://github.com/pysam-developers/pysam), and SciPy (Jones et al., 2001). It further 126 wraps the following external tools: repair.sh and shuffle.sh from BBTools 127 (https://sourceforge.net/projects/bbmap/), BWA (Li, 2013), Platypus (Rimmer et al.,128 2014), Sambamba (Tarasov et al., 2015), and SAMtools (Li et al., 2009). 129 XYalign is composed of six modules that can be called individually or serve as 130 steps in a full pipeline: PREPARE_REFERENCE, CHROM_STATS, ANALYZE_BAM, 131 CHARACTERIZE_SEX_CHROMS, STRIP_READS, and REMAPPING. Below, we 132 discuss each module as a step in the full XYalign pipeline using human samples (XX/XY 133 sex determination) as an example. Note, however, that XYalign will work with other sex 134 chromosome systems (e.g., ZZ/ZW) and on arbitrary chromosomes (e.g., detecting 135 autosomal aneuploidy). We describe examples of XYalign commands in the section titled 136 "Use Cases." 137 The PREPARE_REFERENCE module generates two versions of the same 138 reference genome: one for the homogametic sex (e.g., XX) and one for the heterogametic 139 sex (e.g, XY). In the simplest case, it will completely hard-mask the Y chromosome with 140 Ns in the XX version of the reference. Optionally, it will also accept one or more BED 141 files containing regions to hard mask in both reference versions. If pseudoautosomal 142 regions (PARs) are present on both sex chromosome sequences in the reference, we 143 strongly suggest masking the PARs on the Y chromosome, allowing reads from these 144 regions to map exclusively to the X chromosome in XY individuals. In XYalign, we use 145 hard masks, rather than omitting the Y chromosome in the XX reference version because 146 these hard masks allow files from both references to share the same sequence dictionaries 147 and indices, thus permitting seamless integration of files from both references into 148 downstream analyses (e.g., joint variant calling).

149
The CHROM_STATS module provides a relatively quick comparison of mapping 150 quality and sequencing depth across one or more chromosomes and over multiple BAM 151 files. While this provides a less detailed perspective than ANALYZE_BAM or 152 CHARACTERIZE_SEX_CHROMS (detailed below), we envision it to be especially 153 useful in at least two different cases. First, in well-characterized systems (e.g., human), 154 comparing chromosome-wide values of mean mapping quality and depth represent a 155 quick and often sufficient way to identify the sex chromosome complement (e.g., XX or 156 XY) of individuals across a population. Second, in uncharacterized systems, the 157 CHROM_STATS output provides information that can help with the identification of 158 sex-linked scaffolds. It is important to note, however, that results for both cases will vary 159 based on ploidy and with differences in the degree of sequence homology between the 160 sex chromosomes. 161 The ANALYZE_BAM module runs a series of analyses designed to aid in the 162 identification of sex-linked sequence and characterize the sex chromosome content of an 163 individual. In doing so, it provides more detailed metrics than CHROM_STATS. site, as well as mean read balance and variant count per genomic bin or window across a 171 chromosome. We anticipate these data will not only be useful for masking regions 172 containing incorrect genotypes but will also aid in the identification of PARs as well.

173
XYalign next traverses the BAM file, calculating mean mapping quality and an 174 approximation of mean depth in windows across the genome. During traversal, secondary 175 and supplementary read mappings are ignored, and depth is calculated as the total length 176 of all reads mapping to a genomic window divided by the total length of the window. We 177 have found that this heuristic approximation is very similar to calculations of exact depth, 178 particularly as window sizes increase, and is much faster to compute across entire 179 chromosomes. XYalign will output a table containing genomic coordinates, mean depth, 180 and mean mapping quality for each window. It will then filter windows based on user-181 defined thresholds of mean depth and mapping quality and output two BED files 182 containing windows that passed and failed these thresholds, respectively, which can be 183 used for additional masking in downstream applications. Finally, XYalign will output 184 plots of mapping quality and depth in each window across each chromosome. 185 After running ANALYZE_BAM, the windows meeting thresholds can be used by 186 the CHARACTERIZE_SEX_CHROMS module to systematically compare mean depth 187 in pairs of chromosomes using three different approaches. The first is a bootstrap analysis 188 that provides 95% confidence intervals of mean window depth for each of the 189 chromosomes in a given pair to test for overlap. The second is a permutation analysis to 190 test for differences in depth between the two chromosomes. The third is a two-sample 191 Kolmogorov-Smirnov test (Massey Jr., 1951 genome output from PREPARE_REFERENCE. Finally, XYalign will re-run the 215 ANALYZE_BAM module to analyze the remapped BAM file and provide metrics to 216 allow a before-and-after comparison. 217 While we anticipate that this full pipeline will be useful in certain situations, it is 218 neither the only nor the best-suited option for most users. Rather, we expect that most 219 users will call modules individually. We give examples of other implementations below 220 and provide recommendations for incorporating XYalign into bioinformatic pipelines in 221 the discussion.

223
Operation 224 XYalign is available via PyPI (https://pypi.python.org/pypi), Bioconda (Grüning 225 et al., 2017), and Github (https://github.com/WilsonSayresLab/XYalign), with 226 documentation hosted at Read the Docs (http://xyalign.readthedocs.io/en/latest/). A full 227 environment containing all dependencies can be most easily installed and managed using 228 Anaconda (https://www.continuum.io/) and Bioconda (Grüning et al., 2017). It has been 229 tested on a variety of UNIX operating systems (including Linux and MacOS), but it is not 230 currently supported for the Windows operating system. 231 XYalign is typically invoked from the command line, but it can be imported into 232 Python scripts for more customized use cases. The next section lists a number of example 233 commands that illustrate how to call the full pipeline as well as individual modules.

235
Use Cases 236 To highlight some features of XYalign and its flexibility, we used two datasets 237 from publicly available sources (Supplemental Table S1): (1) exome, low-coverage 238 whole-genome, and high-coverage whole-genome sequencing data from one male 239 (HG00512) and one female (HG00513) from the 1000 Genomes Project ( Finally, in each region, we counted variants present in a given filtered VCF file using the 337 BEDtools (Quinlan and Hall, 2010) "intersect" command: where <BED_file> is the BED file containing genomic coordinates (Supplemental Table  342 S2).

344
Specific commands 345 We This region retains more than 98% sequence similarity between the X and Y chromosome 361 (Ross et al., 2005), likely leading to the reduction in mapping quality. Interestingly, we 362 observe a similar decrease in mapping quality on the Y chromosome beginning near 2.9 363 Mb and ending near 6.6 Mb, corresponding with known coordinates of the XTR on the Y 364 chromosome (Figure 3). In fact, integrating mapping quality and depth recapitulates 365 established genomic features of both sex chromosomes (e.g., ampliconic regions, PARs, 366 and XTRs) described in previous studies (Figures 1-3 reference genome, we observed clear improvements in read mapping (Figures 1-2). On 398 the X chromosome, all metrics exhibited striking improvements in PAR1, PAR2, and 399 XTR (Figures 1 and 2). Furthermore, the Y chromosome of the XX individual no longer 400 exhibited any variant calls or mapped reads, though many passed filters before processing 401 (variants before: 4266; variants after: 0; mapped reads before: 5,729,007; reads mapped 402 after: 0). While this is expected given the hard masking of the Y chromosome, it is worth 403 emphasizing that this is consistent with the biological state of the individual. 404 We found that these improvements in mapping on the X chromosome after 405 masking the Y chromosome substantially impacted downstream variant calling (  Inferring Genetic Sex 438 In our analyses, the most striking measure for assessing an individual's sex 439 chromosome complement was the distribution of read balances across a chromosome 440 (Figure 4). Specifically, when we plotted the distribution of the fraction of reads 441 containing a nonreference allele at a given variant site, we observed that diploid 442 chromosomes (e.g., autosomes, and chromosome X in XX individuals) exhibited peaks 443 both around 0.5 and 1.0, consistent with the presence of heterozygous sites and sites 444 homozygous for a nonreference allele, respectively (Figure 4). In the case of the X 445 chromosome in XY individuals, we observed a single peak near 1.0, consistent with an 446 expected haploid state (i.e., no heterozygous sites; Figure 4). We observed one exception 447 to this pattern: the Y chromosome exhibited a peak around 0.2 in addition to the one near 448 1.0 ( Figure 4). All variants included in analyses met thresholds for depth, site quality, and 449 genotype quality, so quality does not appear to be a driving factor of this pattern. This 450 pattern also remained after genomic windows of low mapping quality and irregular depth 451 were removed. We are currently unable to explain these results and more work is thus 452 required to understand the factors responsible for this pattern and whether similar results 453 are obtained on the W chromosome in ZW systems. 454 Across datasets, we observed variation in relative depth of the X and Y 455 chromosomes in XX and XY individuals, particularly among different sequencing 456 strategies: exome, low-coverage whole-genome, and high-coverage whole-genome 457 sequencing ( Figure 5A). However, within datasets, XX and XY individuals were clearly 458 differentiated ( Figure 5; Supplemental Figure S1). This pattern suggests that a general 459 threshold for assigning different genetic sexes across a range of organisms and 460 sequencing experiments might be difficult to implement. That being said, within species, 461 some combination of depth, mapping quality, and read balance is likely to be informative.

462
For example, in humans, relative mapping quality appears to be informative in some 463 sequencing strategies, particularly exome sequencing ( Figure 5B). This should be 464 explored in each experiment, however, as we did not observe this differentiation in the 465 uncorrected 1000 Genomes high-coverage samples (Supplemental Figure S2). 466 Generating these results for all individuals in a study is easy to do with XYalign: XYalign might not be the most appropriate option for detecting local phenomena such as 544 copy number variants. Finally, XYalign may also be extended to other types of data, 545 including RNA sequencing data, where the same fundamental challenge (gametologous 546 sequence between the X and Y) can affect mapping and variant calling. In particular, we 547 expect biases to manifest in differential expression and biased-allelic expression, and 548 suggest that the PREPARE_REFERENCE module be considered for all RNA sequencing 549 experiments in systems with sex chromosomes. 550 551 Conclusion 552 We showed that the complex evolutionary history of the sex chromosomes creates 553 mapping biases in next-generation sequencing data that have downstream effects on 554 variant calling and other analyses. These technical artifacts are likely present in most 555 genomic datasets of species with chromosomal sex determination. However, many of 556 these biases can be corrected through the strategic use of masks during read mapping and 557 the filtering of variants. We developed XYalign, a tool that facilitates the characterization 558 of an individual's sex chromosome complement and implements this masking strategy to 559 correct these technical biases. We illustrated how XYalign can be used to identify the 560 presence or absence of a Y chromosome, characterize mapping biases across the genome, 561 and correct for these mapping biases. XYalign provides a framework to generate more 562 robust short read mapping and improve variant calling on the sex chromosomes. 563 564 Software Availability 565 XYalign is available on Github (https://github.com/WilsonSayresLab/XYalign) under a 566 GNU General Public License (version 3). We have also deposited a static version of the 567 source code used for analyses in this paper at Zenodo (Webster et al., 2018). 568 569 Author Contributions 570 MAWS and THW conceived the research. All authors participated in the initial design of 571 the software. THW was responsible for subsequent design, development, and 572 implementation of the software. BG, EK, TNP, WW, and THW tested the software.

573
THW analyzed the data. THW and MAWS wrote the manuscript. All authors were 574 involved in the revision of the manuscript and have agreed to the final content. 575 576 Competing Interests 577 578 No competing interests were disclosed.