Abstract

Summary

Leveraging local ancestry and haplotype information in genome-wide association studies and downstream analyses can improve the utility of genomics for individuals from diverse and recently admixed ancestries. However, most existing simulation, visualization and variant analysis frameworks are based on variant-level analysis and do not automatically handle these features. We present haptools, an open-source toolkit for performing local ancestry aware and haplotype-based analysis of complex traits. Haptools supports fast simulation of admixed genomes, visualization of admixture tracks, simulation of haplotype- and local ancestry-specific phenotype effects and a variety of file operations and statistics computed in a haplotype-aware manner.

Availability and implementation

Haptools is freely available at https://github.com/cast-genomics/haptools.

Documentation

Detailed documentation is available at https://haptools.readthedocs.io.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Existing frameworks for complex trait analysis are typically based on variant-level analysis. However, phenotypic effects may also be mediated by haplotypes (Corder et al., 1993; Williams et al., 2014) (combinations of variants on the same chromosome) or by the local ancestry background on which a variant falls (Atkinson et al., 2021; Naslavsky et al., 2022). Incorporating these effects may improve the utility of genomic information for diverse and recently admixed individuals, but current tools have limited support for including these features. Here, we present haptools, an open-source toolkit for facilitating local ancestry aware and haplotype-based analysis of complex traits. Haptools supports fast simulation of admixed genomes, visualization of admixture tracks, simulating haplotype- and local ancestry-specific phenotype effects and computing a variety of common file operations and statistics in a haplotype-aware manner. Overall, haptools provides a valuable set of utilities for developing and benchmarking methods for ancestry-aware analysis of complex traits.

2 Features and methods

Haptools consists of a suite of command-line utilities and a corresponding Python library for performing simulations and common file operations on haplotypes, local ancestry labels and individual variants (Supplementary Figs S1 and S2, Table 1). Haptools is compatible with standard file formats as inputs and outputs, including VCF, PLINK and the newer PGEN format which results in greatly improved computational performance (Supplementary Fig. S3). In the following sections, we summarize the current core functionality available in haptools.

Table 1.

Summary of current haptools utilities

CommandDescription
simgenotypeSimulate admixed genomes
karyogramGenerate chromosome paintings for admixed individuals
simphenotypeSimulate phenotypes for complex traits with variant-, haplotype- or local ancestry-specific effects
transformObtain a VCF of pseudo-genotypes from a set of haplotypes
ldCompute linkage disequilibrium between haplotypes (or genotypes) and a specific target haplotype
indexSort, compress and index .hap files
CommandDescription
simgenotypeSimulate admixed genomes
karyogramGenerate chromosome paintings for admixed individuals
simphenotypeSimulate phenotypes for complex traits with variant-, haplotype- or local ancestry-specific effects
transformObtain a VCF of pseudo-genotypes from a set of haplotypes
ldCompute linkage disequilibrium between haplotypes (or genotypes) and a specific target haplotype
indexSort, compress and index .hap files
Table 1.

Summary of current haptools utilities

CommandDescription
simgenotypeSimulate admixed genomes
karyogramGenerate chromosome paintings for admixed individuals
simphenotypeSimulate phenotypes for complex traits with variant-, haplotype- or local ancestry-specific effects
transformObtain a VCF of pseudo-genotypes from a set of haplotypes
ldCompute linkage disequilibrium between haplotypes (or genotypes) and a specific target haplotype
indexSort, compress and index .hap files
CommandDescription
simgenotypeSimulate admixed genomes
karyogramGenerate chromosome paintings for admixed individuals
simphenotypeSimulate phenotypes for complex traits with variant-, haplotype- or local ancestry-specific effects
transformObtain a VCF of pseudo-genotypes from a set of haplotypes
ldCompute linkage disequilibrium between haplotypes (or genotypes) and a specific target haplotype
indexSort, compress and index .hap files

2.1 .hap file format

Haptools implements a custom file format (*.hap) for flexible representation of haplotype-level and other information. These files consist of a collection of haplotypes. Each haplotype is defined by a set of one or more variants and their alleles, and optionally a local ancestry label, that tend to be inherited together on an individual chromosome (Supplementary Fig. S2). Unlike previous haplotype representations, the format is compatible with tabix (Li, 2011) and can be easily sorted and queried at the variant or haplotype level. Details and additional motivation for the .hap format are given in the Supplementary Methods.

2.2 Haptools simgenotype

The simgenotype utility simulates random mating between individuals of ancestral populations under a user-specified population history model, which defines admixture proportions and the number of generations of admixture. It outputs haplotype breakpoints and genotypes of simulated admixed individuals in VCF or PGEN format. simgenotype is adapted from admix-simu (Williams, 2016) with minor modifications to improve run time (Supplementary Material). We benchmarked simgenotype against admix-simu and AdmixSim2 (Zhang et al., 2021) (Supplementary Fig. S4). While AdmixSim2 simulation run time is fastest, both AdmixSim2 and admix-simu require more run time overall because genotypes must be preprocessed into a custom input format. By contrast, simgenotype does not require additional preprocessing and supports directly simulating from file formats (VCF and PGEN) supported by large existing datasets such as the 1000 Genomes Project (Auton et al., 2015).

2.3 Haptools karyogram

karyogram takes breakpoints generated by simgenotype as input and generates a karyogram to visualize chromosome segments. It is adapted from an existing script (Martin, 2017). Example karyograms for individuals simulated under demographic models for admixed populations in the Americas are shown in Figure 1a and Supplementary Figure S5.

Example analyses performed using haptools. (a) An example karyogram depicting local ancestry tracts simulated by the simgenotype command. (b) Manhattan plot showing association summary statistics (−log10 P-values) for a trait with a single SNP (circled) simulated to be causal only when it occurs on an African haplotype. The SNP (rs12740374) is highly significant in simulated African but not European individuals. It has an intermediate P-value in a sample of simulated admixed individuals. (c) Manhattan plot showing association summary statistics for a trait simulated with either two causal SNPs (rs36046716 and rs1046282; left) or a single causal haplotype (composed of alleles from the two SNPs; right). Red = SNP-level P-values and orange = haplotype-level P-values for the variants of interest. When the haplotype is causal (right), it has a more significant P-value than the SNPs it is composed of. This large effect could be missed by variant-level association tests. Detailed methods underlying results shown in the figures are in the Supplementary Material.
Fig. 1.

Example analyses performed using haptools. (a) An example karyogram depicting local ancestry tracts simulated by the simgenotype command. (b) Manhattan plot showing association summary statistics (−log10 P-values) for a trait with a single SNP (circled) simulated to be causal only when it occurs on an African haplotype. The SNP (rs12740374) is highly significant in simulated African but not European individuals. It has an intermediate P-value in a sample of simulated admixed individuals. (c) Manhattan plot showing association summary statistics for a trait simulated with either two causal SNPs (rs36046716 and rs1046282; left) or a single causal haplotype (composed of alleles from the two SNPs; right). Red = SNP-level P-values and orange = haplotype-level P-values for the variants of interest. When the haplotype is causal (right), it has a more significant P-value than the SNPs it is composed of. This large effect could be missed by variant-level association tests. Detailed methods underlying results shown in the figures are in the Supplementary Material.

2.4 Haptools transform

transform produces a VCF file of pseudo-genotypes, in which each haplotype is encoded as a bi-allelic variant record, for a set of haplotypes in .hap format. This operation can facilitate downstream tasks which require variant-level information as input. For example, to perform association testing based on haplotypes, one could first use transform to generate a VCF encoding haplotypes as bi-allelic variants, and then use a standard framework such as PLINK (Chang et al., 2015) to perform association tests.

2.5 Haptools simphenotype

The simphenotype utility simulates phenotypes for complex traits with variant-, haplotype- or local ancestry-specific effects. Causal haplotypes are loaded from a VCF file output by transform and their effect sizes are specified in a .hap file. By default, simphenotype simulates quantitative traits. Users may specify case–control traits by providing disease prevalence, which results in that percentage of individuals with the highest trait value being labeled as cases. We evaluated simphenotype by simulating quantitative traits under various scenarios, including individual causal variants (Supplementary Fig. S6), local ancestry effects (Fig. 1b, Supplementary Fig. S7), and haplotype-level effects (Fig. 1c).

3 Discussion

Accounting for ancestry and other more complex effects is becoming increasingly critical in association testing and downstream analysis pipelines. Haptools helps fulfill an unmet need to handle ancestry and haplotype level information in a standardized way that is compatible with existing file formats and workflows and is computationally efficient. It enables easily performing tasks such as genotype and phenotype simulation, local ancestry visualization and power analyses which previously have been done primarily using a variety of custom scripts. Overall, haptools will help enable more systematic incorporation of ancestry and haplotype-level features in future workflows.

Data availability

The datasets used for validation and example generation are available from the haptools documentation page: https://haptools.readthedocs.io/en/stable/project_info/example_files.html, and the haptools-paper repository, https://github.com/CAST-genomics/haptools-paper.

Acknowledgements

We thank Jonathan Margoliash for helpful discussions and Tara Mirmira and Wilfredo Gonzalez-Rivera for feedback on haptools utilities. We also acknowledge Bogdan Pasaniuc and Noah Zaitlan, who helped design the original admixture model file format.

Funding

This work was supported in part by the National Institutes of Health [1RM1HG011558 to M.G. and R35GM133805 to A.L.W.]. A.R.M. was also supported by the National Science Foundation Graduate Research Fellowship [DGE-2038238] and grants from the National Institutes of Health [T32GM008806 and T32GM139790].

Conflict of Interest: A.L.W. is an employee of and holds stock in 23andMe, Inc. and is the owner of HAPI-DNA LLC.

References

Atkinson
E.G.
et al. (
2021
)
Tractor uses local ancestry to enable the inclusion of admixed individuals in GWAS and to boost power
.
Nat. Genet
.,
53
,
195
204
.

Auton
A.
et al. (
2015
)
A global reference for human genetic variation
.
Nature
,
526
,
68
74
.

Chang
C.C.
et al. (
2015
)
Second-generation PLINK: rising to the challenge of larger and richer datasets
.
Gigascience
,
4
,
7
.

Corder
E.H.
et al. (
1993
)
Gene dose of apolipoprotein E type 4 allele and the risk of Alzheimer’s disease in late onset families
.
Science
,
261
,
921
923
.

Li
H.
(
2011
)
Tabix: fast retrieval of sequence features from generic TAB-delimited files
.
Bioinformatics
,
27
,
718
719
.

Martin
A.
(
2017
)
Human demographic history impacts genetic risk prediction across diverse populations
.
Am. J. Hum. Genet
, 100,
635
649
.

Naslavsky
M.S.
et al. (
2022
)
Global and local ancestry modulate APOE association with Alzheimer’s neuropathology and cognitive outcomes in an admixed sample
.
Mol. Psychiatry
,
27
,
4800
4808
.

Williams
A.
(
2016
) admix-simu: program to simulate admixture between multiple populations. https://doi.org/10.5281/zenodo.45517.

Williams
A.L.
et al. (
2014
)
Sequence variants in SLC16A11 are a common risk factor for type 2 diabetes in Mexico
.
Nature
,
506
,
97
101
.

Zhang
R.
et al. (
2021
)
AdmixSim 2: a forward-time simulator for modeling complex population admixture
.
BMC Bioinformatics
,
22
,
506
.

Author notes

The authors wish it to be known that, in their opinion, Arya R Massarat and Michael Lamkin should be regarded as Joint First Authors.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Associate Editor: Russell Schwartz
Russell Schwartz
Associate Editor
Search for other works by this author on:

Supplementary data