hapCon: estimating contamination of ancient genomes by copying from reference haplotypes

Abstract Motivation Human ancient DNA (aDNA) studies have surged in recent years, revolutionizing the study of the human past. Typically, aDNA is preserved poorly, making such data prone to contamination from other human DNA. Therefore, it is important to rule out substantial contamination before proceeding to downstream analysis. As most aDNA samples can only be sequenced to low coverages (<1× average depth), computational methods that can robustly estimate contamination in the low coverage regime are needed. However, the ultra low-coverage regime (0.1× and below) remains a challenging task for existing approaches. Results We present a new method to estimate contamination in aDNA for male modern humans. It utilizes a Li&Stephens haplotype copying model for haploid X chromosomes, with mismatches modeled as errors or contamination. We assessed this new approach, hapCon, on simulated and down-sampled empirical aDNA data. Our experiments demonstrate that hapCon outperforms a commonly used tool for estimating male X contamination (ANGSD), with substantially lower variance and narrower confidence intervals, especially in the low coverage regime. We found that hapCon provides useful contamination estimates for coverages as low as 0.1× for SNP capture data (1240k) and 0.02× for whole genome sequencing data, substantially extending the coverage limit of previous male X chromosome-based contamination estimation methods. Our experiments demonstrate that hapCon has little bias for contamination up to 25–30% as long as the contaminating source is specified within continental genetic variation, and that its application range extends to human aDNA as old as ∼45 000 and various global ancestries. Availability and implementation We make hapCon available as part of a python package (hapROH), which is available at the Python Package Index (https://pypi.org/project/hapROH) and can be installed via pip. The documentation provides example use cases as blueprints for custom applications (https://haproh.readthedocs.io/en/latest/hapCon.html). The program can analyze either BAM files or pileup files produced with samtools. An implementation of our software (hapCon) using Python and C is deposited at https://github.com/hyl317/hapROH. Supplementary information Supplementary data are available at Bioinformatics online.


Calculating Confidence Interval via 14.7% Likelihood Region
Our model is defined only for contamination within range 0 to 1. Therefore, when the maximum 7 likelihood estimateĉ for the contamination rate c is 0 the first derivative there might not be zero. 8 In this case, we approximate the likelihood curve using quadratic interpolation. Specifically, let 9 l(x) be the likelihood of the model at contamination rate c = x, then, for x close enough to 0, we The log of likelihood ratio −2(l(θ) − l(θ M LE )) asymptotically approaches χ 2 1 as a result of Wilk's estimates with little bias for all coverages and contamination rates tested here (Fig. S1), thereby 45 confirming the correctness of our implementation in under the model simulations.

47
In this simulation scenario, we randomly sampled one sequence for each marker covered by at 48 least one sequence. This procedure simulates how the so-called pseudohaploid data is gener-49 ated from aDNA data. We note that, even though each site is only covered by one sequence, a 50 high coverage sample will have more sites covered and therefore still contain more information 51 than a low coverage sample. Our results show that our method produces robust contamina-52 tion estimates even for such pseudohaploid data (Fig. S2), which none of the existing male X 53 chromosome contamination estimation tools can do. Compared with Fig. S1 where the full read 54 counts data is used, the estimates obtained from pseudohaploid data only have minimal up-55 ward bias. In practice, we recommend using our method on read counts directly generated 56 from a BAM file to make full use of all information, but this simulation of pseudohaploid data 57 demonstrates the power of utilizing haplotype structure for estimating contamination.  we simulated 10% contamination as described in Section 2.1 except that we varied the miscopy-97 ing rate from 1e −4 to 1e −2 , evenly spaced on a log scale. We then estimated contamination rates 98 when setting ϵ r = 1e −3 for all simulated samples.

99
The results indicate that the contamination estimate is robust to mis-specified copying error 100 rate (Fig. S4) Figure S3: Effect of Mis-specified Error Rate at Various Coverages assumed, upward bias remains small. Similar to the case of mis-specified error rate discussed 102 above, we observe that bias remains on the same order of magnitude as the mis-specification 103 of mis-copying rate. In empirical data the error rate is usually on the order of 1e −3 − 1e −4 , 104 depending on the genetic distance between the endogenous haplotype and the modern haplo-105 types, whereas one wishes to estimate contamination on the order of 1e −2 or higher. Therefore, 106 mis-specified mis-copying rate should not introduce substantial biases in empirical analyses.

108
The haplotype copying jump rate models the rate of jumping to a different haplotype to copy 109 from. Following Ringbauer et al. [2021], who described that ρ = 300 yields good performance 110 for most modern human ancient DNA data when using Li&Stephens model for inferring ROH, 111 we set ρ = 300 as the default value. We then assessed whether our model is robust with respect 112 to a mis-specified jump rate. To do so, we simulated 10% contamination as described in Sec-   tion 2.1 except that we varied the jump rate from 100 to 1000, evenly spaced on a log scale. We 114 then estimated contamination rate when setting the jump rate to ρ = 300. 115 The results show that the estimated contamination remains unbiased for a wide range of 116 simulated jump rates when using the default value ρ = 300 (Fig. S5). Only at high jump rates 117 close to ρ ′ = 1000, we observe some upward bias. This upward bias remains minimal at rela-

125
Once an organism dies, its DNA begins degrading. In particular, the deamination process turns   Our results indicate that, post-mortem damage leads to little bias for our contamination 155 estimates ( Fig. S7, S8, S9), even for data without any UDG treatment and with 5' terminal C→T 156 transition rate as high as 36.6% (e.g, in Bacho Kiro CC7-335). 157 We also examined how different levels of post-mortem damage affect the parameter ϵ g , 158 which is the error rate per aligned aDNA sequence base as described in the main text. This er-159 ror rate is intended to model several sources of errors, including sequencing error, post-mortem 160 damage and mismappings. Therefore, the ϵ g inferred from sites adjacent to the polymorphic 161 sites increase with increasing levels of aDNA damage. Indeed, as shown in Fig.S10,S11 and 162 S12, the estimated ϵ g increases as the damage level increases. We note that in general the av-163 erage ϵ g estimated at low coverage is the same as the average ϵ g estimated at higher coverage. 164 Specifically, we use a short red horizontal line to indicate the ϵ g averaged over 100 replicates at 165 5x coverage for each of the damage levels, and at low coverage we observe that the blue dots 166 center around the horizontal red bar, but with higher variance at low coverages as expected.

167
Comparing the estimated ϵ g at different contamination rates for each of the damage levels, we 168 observe that it remains stable across contamination rates (see Table.S1), indicating that our as-169 sumption that the sites adjacent to the polymorphic sites are fixed (and thus also not altered by    these samples contain highly divergent genetic ancestries (Fig.S15, S16, S17).

212
In the main text we tested hapCon on Mota, an ancient African genome from present-day 213 Ethiopia that predates the Eurasian genetic backflow. We observed a small amount of over-    To provide further support (aside from Mota) that our method can work for African aDNA . 242 We used I10871 as the endogenous source and B French-3 from SGDP as the contamination 243 source, and we visualized the results in Fig.S18. We observe that hapCon's performance on  We observe that, for the same contamination level and sample, contamination estimates with 252 hapCon tend to increase for higher coverage. Here we explore whether this phenomenon lead 253 to substantial biases in the high coverage regimes and whether it would affect the qualitative 254 assessment of a sample being highly contaminated or not. 255 We used Ust Ishim as the endogenous source and B French-3 from SGDP as the contaminant 256 sources. We simulated contamination rate ranging from 0% to 25% in steps of 5% at coverages 257 1x, 2x, 5x, 10x, 15x. As throughout, coverages always refer to average coverage on male chrX. 258 We note that the full data of UstIshim on chrX is 18x (after quality filtering), thus the replicates 259 are not fully independent anymore in the high coverage regime. 260 We applied hapCon with both the 1240k and 1000G panel on these simulated data, and vi-261 sualized the results in Fig.S19,S20,S21. We observe that both hapCon (Fig.S19,S20) and ANGSD 262 (Fig.S21) display some degree of varying biases associated with coverage. In most cases the 263 biases decrease as the coverages increases, but also generally do not seem to converge to 0,   Average Coverage on chrX Estimated Contamination Figure S19: Performance of hapCon with 1240k Panel at High Coverage. Note the changing Y axis scale (adapted to the range of estimates). The red horizontal line represents the simulated contamination rate plus the ANGSD estimate of background contamination in the Ust Ishim BAM file (0.675%, 95% CI: 0.637%-0.713%). We also visualized trend of estimated contamination as a function of coverage by connecting the dots that represent the mean value of estimated contamination (averaged across 100 replicates).  Endogenous Source Figure S22: Zoom-in for Fig.2 in main text This is a zoom-in into simulated contamination in the range of 0%-10% for the Fig.2 in the main text.    Figure S25: Comparing hapCon on 1000G Panel with Different MAF Cutoff at Various Contamination Level. We performed mixed BAM simulation (use Loschbour as the endogenous source and B French-3 as the contaminant source) at coverage 0.1x as described in the main article. We compared hapCon's performance on the 1000G panel with varying minor allele frequency cutoffs. S _ H u n g a r ia n -2 ( 2 3 .9 7 % ) S _ G e o r g ia n -2 ( 2 4 .9 1 % ) S _ S p a n is h -1 ( 2 4 .9 3 % ) S _ K o r e a n -1 ( 2 5 .7 8 % ) S _ H u n g a r ia n -2 ( 2 3 .9 7 % ) S _ G e o r g ia n -2 ( 2 4 .9 1 % ) S _ S p a n is h -1 ( 2 4 .9 3 % ) S _ K o r e a n -1 ( 2 5 .7 8 % ) Simulated Coverage on chrX: 0.1x S _ S a r d in ia n -1 ( 2 2 .4 6 % ) S _ F r e n c h -1 ( 2 3 .4 2 % ) S _ H u n g a r ia n -2 ( 2 3 .9 7 % ) S _ G e o r g ia n -2 ( 2 4 .9 1 % ) S _ S p a n is h -1 ( 2 4 .9 3 % ) S _ K o r e a n -1 ( 2 5 .7 8 % ) Simulated Coverage on chrX: 0.05x Contamination Source Estimated Contamination Figure S29: Effects of Genetic Similarity between the Endogenous and Contaminant Source on Contamination Estimation for Simulated 0% Contamination We used B French-3 as the endogenous source and used as contamination source S Sardinian-1, S French-1, S Hungarian-2, S Georgian-2, S Spanish-1, S Korean-1, which are indicated on the x-axis. The percentage number in the parenthesis of x-labels are genetic distances between the endogenous and contaminant sources, calculated as described in the main article. We mixed the BAM files of the endogenous and contaminant source and downsampled to desired genome-wide coverage. We then analyzed the mixed BAM files using hapCon with 1240k panel. For S Korean-1 we used CHB allele frequency as a proxy, and for all the others we used CEU allele frequency. This figure visualized the results for simulated 0% contamination. S _ H u n g a r ia n -2 ( 2 3 .9 7 % ) S _ G e o r g ia n -2 ( 2 4 .9 1 % ) S _ S p a n is h -1 ( 2 4 .9 3 % ) S _ K o r e a n -1 ( 2 5 .7 8 % ) Simulated Coverage on chrX: 0.1x S _ S a r d in ia n -1 ( 2 2 .4 6 % ) S _ F r e n c h -1 ( 2 3 .4 2 % ) S _ H u n g a r ia n -2 ( 2 3 .9 7 % ) S _ G e o r g ia n -2 ( 2 4 .9 1 % ) S _ S p a n is h -1 ( 2 4 .9 3 % ) S _ K o r e a n -1 ( 2 5 .7 8 % ) Simulated Coverage on chrX: 0.05x Figure S30: Effects of Genetic Similarity between the Endogenous and Contaminant Source for Simulated 5% Contamination Same as Fig.S29, but with 5% simulated contamination. Simulated Coverage on chrX: 1x S _ S a r d in ia n -1 ( 2 2 .4 6 % ) S _ F r e n c h -1 ( 2 3 .4 2 % ) S _ H u n g a r ia n -2 ( 2 3 .9 7 % ) S _ G e o r g ia n -2 ( 2 4 .9 1 % ) S _ S p a n is h -1 ( 2 4 .9 3 % ) S _ K o r e a n -1 ( 2 5 .7 8 % ) Simulated Coverage on chrX: 0.5x S _ S a r d in ia n -1 ( 2 2 .4 6 % ) S _ F r e n c h -1 ( 2 3 .4 2 % ) S _ H u n g a r ia n -2 ( 2 3 .9 7 % ) S _ G e o r g ia n -2 ( 2 4 .9 1 % ) S _ S p a n is h -1 ( 2 4 .9 3 % ) S _ K o r e a n -1 ( 2 5 .7 8 % )

Contamination Source Estimated Contamination
Simulated Coverage on chrX: 0.1x S _ S a r d in ia n -1 ( 2 2 .4 6 % ) S _ F r e n c h -1 ( 2 3 .4 2 % ) S _ H u n g a r ia n -2 ( 2 3 .9 7 % ) S _ G e o r g ia n -2 ( 2 4 .9 1 % ) S _ S p a n is h -1 ( 2 4 .9 3 % ) S _ K o r e a n -1 ( 2 5 .7 8 % ) Simulated Coverage on chrX: 0.05x   Figure S34: Visualizing long IBD on chrX between Two Karitiana Samples We downsampled the BAM files of B Karitiana-3 and S Karitiana-1 both to 15x and merged them together. As the two Karitiana samples are both males, this results in a synthetic diploid X chromosome. We then applied hapROH [Ringbauer et al., 2021] to detect ROH blocks in this synthetic diploid X chromosome, which is equivalent to IBD between the two haploid X chromosome of the two male Karitiana samples. We found a 19.98cM(163.79cM-183.77cM) long IBD block and visualized it here. The brown curve depicts the posterior probability of being in non-IBD state at each of the 1240 target SNPs, and the blue dots at the bottom depicts the marker density along the chromosome. The blue dots at the top represent potentially heterozygote sites. Each site that have at least one sequence supporting both the reference and alternative alleles is represented by a blue dot at the top. Due to several sources of errors (e.g, sequencing error, mismappings), there are several apparent "heterozygous" sites in the IBD region; however, such "heterozygous" sites in IBD region are much more sparse than that in the non-IBD region. The horizontal blue bar at the very top of the figure is the inferred IBD block.  We performed the simulation using the default setting as described in Section 2.1 except that we used the CEU as the contamination source (rather than the global allele frequencies). Panels a-c show the results for no simulated contamination and panels d-f for simulated 10% contamination. We explored several different settings for inference by removing divergent haplotypes (AFR) from the reference panel and by using different allele frequencies as the proxy of the contamination source (CEU vs. OOA, where OOA denotes the allele frequencies of all populations in the 1000Genome except for AFR). Settings are indicated in the upper right corner of each subpanel. Simulated Contamination Rate: 0% Figure S36: Misspecified Contamination Ancestry in Mixed BAM Simulation with 0% Simulated Contamination. We used the same mixed BAM simulation as described in section "Simulated whole genome sequencing data" in the main article (using Loshcbour as the endogenous source and B French-3 as the contaminant source). For each coverage, we used the 1240k reference panel with CEU, FIN, GBR, IBS, TSI, YRI, CHB, PEL as the contamination ancestry to test the robustness of our method with respect to mis-specified contamination ancestry.   Figure S38: Comparing runtime of hapCon and ANGSD. We measured runtime of hapCon and ANGSD on BAM files of individual I1496(5211-4958 calBCE, Hungary, obtained from Allen Genome Diversity Project), down-sampled to eight target coverages. For hapCon, we used two different reference panels (1240k and 1000G panel). Each point represents the runtime averaged over 10 independent runs.