-
PDF
- Split View
-
Views
-
Cite
Cite
Gerald van Eeden, Caitlin Uren, Marlo Möller, Brenna M Henn, Inferring recombination patterns in African populations, Human Molecular Genetics, Volume 30, Issue R1, 1 March 2021, Pages R11–R16, https://doi.org/10.1093/hmg/ddab020
- Share Icon Share
Abstract
Although several high-resolution recombination maps exist for European-descent populations, the recombination landscape of African populations remains relatively understudied. Given that there is high genetic divergence among groups in Africa, it is possible that recombination hotspots also diverge significantly. Both limitations and opportunities exist for developing recombination maps for these populations. In this review, we discuss various recombination inference methods, and the strengths and weaknesses of these methods in analyzing recombination in African-descent populations. Furthermore, we provide a decision tree and recommendations for which inference method to use in various research contexts. Establishing an appropriate methodology for recombination rate inference in a particular study will improve the accuracy of various downstream analyses including but not limited to local ancestry inference, haplotype phasing, fine-mapping of GWAS loci and genome assemblies.
Introduction
Genetic recombination is defined as the exchange and rearrangement of genetic material between successive generations. Homologous recombination during meiosis arises when this exchange is between homologous pairs of chromosomes and is initiated by the induction of double-strand breaks (DSBs). When these DSBs are repaired, only a few result in the exchange of genetic material, termed crossover events (1). Crossover events result in a contrasting combination of genotypes in gametes that is passed on to the next generation.
Not all sections of the chromosome are equally likely to contain DSBs or a resulting recombination event. This is governed by numerous factors including sex, age, autosomal versus sex chromosomes, proximity to the telomeres or centromeres, various regulatory enzymes etc., while levels of identity-by-descent, linkage disequilibrium and varying degrees of admixture impact our ability to accurately measure the recombination rate (2–7).
We are particularly interested in the factors that influence our ability to accurately measure not only the rate of recombination but the location and extent of recombination hotspots. This is noteworthy given the vast genetic diversity of the recombination hotspot regulatory protein PRDM9 in African populations (8), the difference in hotspot diversity between European and African populations (9) and both recent and ancient admixture events extending to, in some circumstances, populations with 5-way admixture (10). Therefore, care should be taken when selecting one of the various recombination inference methods that have been developed in recent years.
In this review, we therefore discuss principles underlying a variety of recent methods to infer recombination in human populations and suggest the most appropriate (Fig. 1 and Table 1), given the genetic diversity and levels of admixture in African populations.

A decision tree for recombination inference method selection. The figure is an extreme oversimplification and serves only as a starting point. Use the series of questions to find the recombination rate inference method that would be the most likely fit for a given use case. The questions should not necessarily exclude any method, but serve as a guide. For many use cases there will be more than one appropriate method.
Recombination Inference Methods
Gamete-based inference
Gamete-based inference uses the phased genetic information of a donor and the genetic information derived from the donor’s gametes to infer crossover events. A crossover is said to have occurred, if there is a shift in phase between haplotypes. However, a phase shift caused by gene conversion is not considered a true crossover (11). True crossovers are counted to calculate the recombination fraction which can then be converted to genetic distances. Peñalba and Wolf (12) provide a detailed explanation on how this is done and they explain the use of the three main mapping functions employed to calculate additive measures of genetic distance. The authors also provide a well written review of the factors that affect recombination rate variation.
Gamete-based inference is a useful method to infer recombination at a high resolution and was used in early recombination hotspot studies (13). However, it has certain limitations. In humans, gamete-based inference refers to sperm-typing due to the large number of gametes necessary to produce adequate results (14). Thus, recombination can only feasibly be studied in males. Due to the relatively high level of heterochiasmy in humans (14), this might not be sufficient to study recombination at a fine scale across populations. It is also expensive to produce fine scale recombination maps using this method (15). Most studies will therefore be limited to specific regions of the genome (16). Despite these limitations, gamete-based inference remains a valuable means of studying chromosomal abnormalities caused by abnormal recombination (17) and the evolutionary history of a genomic region (18).
. | Pros . | Cons . |
---|---|---|
Gamete-based | Sex-specific (14) Produces high-resolution maps (14) Unaffected by past demographic changes (12) Can be used to study chromosomal abnormalities (17) | Not always feasible for both sexes (14) Genome-wide inference costly (15) |
Pedigree-based | Sex-specific (27) Unaffected by past demographic changes (12) | Requires pedigrees (29) Large sample size needed (19) Limited to very recent recombination (12) |
LAI-based | Produces high-resolution maps with a few thousand samples (43) | Limited to populations with admixture (29) Limited to recent recombination (29) Dependent on local ancestry inference (29) |
IBD-based | Produces high-resolution maps with a few thousand samples (29) Not limited by admixture (29) | Limited to recent recombination (29) Dependent on IBD estimates (29) |
LD-based | Does not require a large sample size (30) Produces high-resolution maps (16) | Computationally expensive (30) Produces time-averaged estimates (29) Biased by demographic changes (29,30) |
Regression-based | Does not require a large sample size (30) Computationally fast (30) | Produces time-averaged estimates (29) Biased by demographic changes (29,30) |
Demography-aware | Does not require a large sample size (43) Produces high-resolution maps (43) Accounts for demographic changes (43) | Produces time-averaged estimates (29) Requires knowledge of a population’s demographic history (43) |
. | Pros . | Cons . |
---|---|---|
Gamete-based | Sex-specific (14) Produces high-resolution maps (14) Unaffected by past demographic changes (12) Can be used to study chromosomal abnormalities (17) | Not always feasible for both sexes (14) Genome-wide inference costly (15) |
Pedigree-based | Sex-specific (27) Unaffected by past demographic changes (12) | Requires pedigrees (29) Large sample size needed (19) Limited to very recent recombination (12) |
LAI-based | Produces high-resolution maps with a few thousand samples (43) | Limited to populations with admixture (29) Limited to recent recombination (29) Dependent on local ancestry inference (29) |
IBD-based | Produces high-resolution maps with a few thousand samples (29) Not limited by admixture (29) | Limited to recent recombination (29) Dependent on IBD estimates (29) |
LD-based | Does not require a large sample size (30) Produces high-resolution maps (16) | Computationally expensive (30) Produces time-averaged estimates (29) Biased by demographic changes (29,30) |
Regression-based | Does not require a large sample size (30) Computationally fast (30) | Produces time-averaged estimates (29) Biased by demographic changes (29,30) |
Demography-aware | Does not require a large sample size (43) Produces high-resolution maps (43) Accounts for demographic changes (43) | Produces time-averaged estimates (29) Requires knowledge of a population’s demographic history (43) |
. | Pros . | Cons . |
---|---|---|
Gamete-based | Sex-specific (14) Produces high-resolution maps (14) Unaffected by past demographic changes (12) Can be used to study chromosomal abnormalities (17) | Not always feasible for both sexes (14) Genome-wide inference costly (15) |
Pedigree-based | Sex-specific (27) Unaffected by past demographic changes (12) | Requires pedigrees (29) Large sample size needed (19) Limited to very recent recombination (12) |
LAI-based | Produces high-resolution maps with a few thousand samples (43) | Limited to populations with admixture (29) Limited to recent recombination (29) Dependent on local ancestry inference (29) |
IBD-based | Produces high-resolution maps with a few thousand samples (29) Not limited by admixture (29) | Limited to recent recombination (29) Dependent on IBD estimates (29) |
LD-based | Does not require a large sample size (30) Produces high-resolution maps (16) | Computationally expensive (30) Produces time-averaged estimates (29) Biased by demographic changes (29,30) |
Regression-based | Does not require a large sample size (30) Computationally fast (30) | Produces time-averaged estimates (29) Biased by demographic changes (29,30) |
Demography-aware | Does not require a large sample size (43) Produces high-resolution maps (43) Accounts for demographic changes (43) | Produces time-averaged estimates (29) Requires knowledge of a population’s demographic history (43) |
. | Pros . | Cons . |
---|---|---|
Gamete-based | Sex-specific (14) Produces high-resolution maps (14) Unaffected by past demographic changes (12) Can be used to study chromosomal abnormalities (17) | Not always feasible for both sexes (14) Genome-wide inference costly (15) |
Pedigree-based | Sex-specific (27) Unaffected by past demographic changes (12) | Requires pedigrees (29) Large sample size needed (19) Limited to very recent recombination (12) |
LAI-based | Produces high-resolution maps with a few thousand samples (43) | Limited to populations with admixture (29) Limited to recent recombination (29) Dependent on local ancestry inference (29) |
IBD-based | Produces high-resolution maps with a few thousand samples (29) Not limited by admixture (29) | Limited to recent recombination (29) Dependent on IBD estimates (29) |
LD-based | Does not require a large sample size (30) Produces high-resolution maps (16) | Computationally expensive (30) Produces time-averaged estimates (29) Biased by demographic changes (29,30) |
Regression-based | Does not require a large sample size (30) Computationally fast (30) | Produces time-averaged estimates (29) Biased by demographic changes (29,30) |
Demography-aware | Does not require a large sample size (43) Produces high-resolution maps (43) Accounts for demographic changes (43) | Produces time-averaged estimates (29) Requires knowledge of a population’s demographic history (43) |
Pedigree-based inference
Pedigree based inference relies on having information about parent–offspring (PO) pairs within the data to detect recombination events between successive generations (19). Since recombination events are inferred from phase transitions in the offspring, inaccurate phase data can thus lead to incorrect recombination event inference. Accurately determining the phase of an individual is thus important.
There are various methods used for phasing, but the most common methods use hidden Markov models (HMM) (20–24). Some software implementations, like MaCH (25), choose a random subset of haplotypes to condition upon, whereas Impute2 (26) and SHAPEIT2 (21) select the most similar haplotypes to the region of the sample being considered. This makes both Impute2 and SHAPEIT2 ideal candidates for phasing in admixed individuals, because the subset of haplotypes chosen would most likely contain representative haplotypes from all the ancestries present in the admixed sample being considered. SHAPEIT2 also has a secondary algorithm, duoHMM (21), that uses pedigree information to correct switch errors after phasing. Conveniently, duoHMM’s output is a detailed log of regions in which recombination events occurred for each PO pair followed by the probability of the event having occurred in that region. After applying various filtering steps, the data can then be used to calculate the recombination fraction (the proportion of haplotypes for which a recombination event is inferred at a given locus) which can then be converted to genetic distances and normalized using the relevant mapping function.
Pedigree-based inference is well suited to inferring recombination in populations with complex ancestries, because it is not affected by different patterns of LD within each ancestry (see below) (12). It is also a valuable method for inferring sex-specific recombination rates (27). Since there is an average of 26.2 recombination events in males and 39.6 recombination events in females that occur between successive generations (28), a very large sample size and high density SNP data are necessary to generate high-resolution recombination maps using this method (19,29).
Linkage-disequilibrium-based inference
Linkage-disequilibrium-based (LD-based) methods use patterns of linkage disequilibrium in polymorphism data to detect historical recombination events that stretch into the distant past (30). Many techniques that utilize this method have been developed over the years (31–34), however, LDHat (16) is by far the most widely used and many publicly available maps, including the 1000 Genomes Project maps (35), were made using this method. LD-based methods generally provide an estimate of the population recombination rate (ρ). ρ is the recombination rate per base pair per generation (r) scaled by the effective population size (Ne) and is represented as ρ = 4Ner (30). Since Ne is often unknown, it has become standard practice to use a high-resolution recombination map generated by other means, like pedigree-based inference, to scale ρ to attain r (35). For instance, by using the overlapping segments of the deCODE map.
Unlike pedigree-based and gamete-based inference, LD-based inference provides a genome-wide, high-resolution map with a small number of individuals. Furthermore, having phased data is not a requirement for methods that can make use of genotype data (16). However, LD-based methods have various aspects that need to be considered before being used as the recombination rate inference method of choice. Depending on the population being investigated, a demographic model might have to be specified (36). LD-based methods by default assume that certain demographic parameters, like the population size and mutation rate, remain constant over time (30,36–38). If the population has undergone drastic demographic changes and these changes are not accounted for, the resultant inference will be distorted (30,37–39). The work of Dapper and Payseur (38) explores this topic thoroughly. LD-based methods are also computationally expensive to run, especially when >50 chromosomes (or 25 individuals) are being analyzed (30,40). Therefore, LD-based methods generally require multinode computational clusters for genome-wide inference (40). Additionally, LD-based inference result in time-averaged (29) and sex-averaged recombination maps (16).
Regression-based inference
Methods that make use of regression based on LD summary statistics have been developed recently. Two of the prominent methods in this category are FastEPRR (40) and LDJump (30). These methods work by partitioning the genome into segments and calculating ρ for each segment by regression on specific summary statistics, for instance Watterson’s θ, Tajima’s D estimator or the haplotype heterozygosity. When <50 chromosomes are being used, both LDJump (30) and FastEPRR (40) perform equally well, but when >50 sequences are used, it is recommended to use FastEPRR (30). Both methods also perform on par with LDHat at large scales (30,40), but both are computationally faster than LDHat by several orders of magnitude (30). Furthermore, when regression-based methods include summary statistics that are dependent on demography, like Tajima’s D estimator, they yield more accurate estimates than LD-based methods (30). This improvement in accuracy and the increased computational efficiency are the primary benefits of regression-based methods over LD-based methods. However, regression-based methods have many of the limitations of LD-based methods. For instance, inferred maps are still time-averaged and sex-averaged, and a demographic model is still important in many cases. Some methods also struggle inferring recombination in windows smaller than two kilobases (30). Adrion et al. (41) recently developed a method, called ReLERNN, that makes use of a recurrent neural network and does not rely on summary statistics. It is worth taking note of this method due to its ability to make use of very small sample sizes, while maintaining a high level of accuracy even when the demographic model is misspecified (41).
Demographic-model-aware inference
Building on the success of LD-based methods to infer recombination at fine scales, demographic model-aware inference seeks to address the assumption that a population’s size remains constant over time. Not only does this assumption lead to biased estimates (30,37,39), but it can produce false positives when inferring recombination hotspots (36). Kamm et al. (42) developed LDpop which can compute exact two-locus sampling probabilities under arbitrary piecewise-constant demographic histories. These likelihoods can then be used with other recombination inference software that use likelihood lookup tables, like LDhat, to infer the recombination rate and account for variable population size over time.
Spence and Song (43) extended the methodology in LDpop to include a computationally efficient recombination inference method called pyrho. Pyrho improves upon the runtime of LDhat by at least 10-fold by avoiding the use of Markov chain Monte Carlo methods and instead uses a penalized likelihood framework and gradient-based optimisation. LDpop and pyrho also allow an increase in sample size to a few hundred individuals. Thus, it can include more meioses in the inference than computationally feasible with LDhat. Pyrho produces more accurate results than LDhat at fine scales, whether LDhat is used with or without a demographic model. However, maps produced by pyrho are sex- and time-averaged. The authors suggest that a larger sample size should favor recent recombination events, but it is unclear to what degree. Furthermore, pyrho and LDpop both require population size histories and accommodate a large number of epochs. More recently Barroso et al. (44) developed a method, called iSMC, that simultaneously infers the recombination rate and the demography even with a single unphased diploid individual.
Local-ancestry-based inference
Populations with recent admixture can be utilized to develop a reflective population based relative recombination map. Individuals who are recently admixed have a mosaic of different ancestral segments along their chromosomes. These segments are identified through local ancestry inference (LAI). The location of the switches in ancestry along the chromosome is indicative of recombination events and can therefore be utilized to develop a recombination map (9,45). This method however relies on numerous upstream analyses and the accuracy thereof. The first is the selection of proxy ancestral populations and how closely related these are to the true ancestral populations (45–47). The second, is the accuracy of the phasing of the data; this is greatly improved if related individuals and a well suited reference panel is used during this process (21,48). The third is the software chosen to infer ancestry switch-points and its robustness toward highly admixed populations; RFMix has been shown to be the most accurate tool for this purpose (46,49). The last upstream analysis that can affect the development of a population-specific recombination map is whether the method used to infer switch-points requires a recombination map or whether it infers recombination as part of its algorithm. Although software utilizing recombination maps is more common, it is our opinion that this could sway the placement of recombination events and thus a recombination map independent method is preferable. Once switch-points are inferred, the posterior mean number of ancestry switch-points is summed across individuals with a resulting relative recombination rate (45). This approach has however proven to not account for multiple hits within a defined window and therefore a postprocessing Empirical Bayes Framework has been implemented to account for this (45). This statistical method has been implemented in RASPberry and proven to be accurate in a simulated African population although there has not been any further accuracy testing on populations with differing degrees of admixture (45).
Identity-by-descent-based inference
More recently, Zhou et al. (29) has shown that identity by descent (IBD) can be used to infer high-resolution population-specific recombination maps. Their method, IBDrecomb (29), produces maps with a similar accuracy than LDhat, but is far more computationally efficient. IBDrecomb, like pedigree-based methods and LAI-based methods, use the genomic consequences of recombination to infer the recombination rate. However, IBDrecomb uses the ends of IBD segments rather than phase switches and ancestry switches to infer past recombination events. First, IBDrecomb calculates the IBD coverage for a given interval across the chromosome. Then, the smallest IBD segments in each interval are removed until the IBD coverage of the interval being analyzed is equal to that of the interval with the lowest IBD coverage. The authors also employ coverage equalization to normalize underestimated IBD ends at chromosome ends. Thus, each segment now contains the same number of IBD segments. Finally, IBD ends within each segment are counted to estimate the relative recombination rate of the segment. The user needs to provide genetic map lengths in order to normalize the estimated relative recombination rates.
IBDrecomb includes methods that help correct errors in IBD estimation caused by phasing errors, gene conversion and genotype errors. IBDrecomb considers recent recombination and since it relies on IBD segments, it infers recombination that occurred before, during and after admixture. This would result in higher resolution maps over similar time scales. Furthermore, the recombination rate of populations that are not admixed can be inferred using this method.
Discussion
Due to the influence of recombination on evolutionary processes, it is important that we develop methods that accurately infer the recombination rate. To date there have been many attempts to do so and some differ drastically in their approach. Therefore, when choosing a method for recombination rate inference in African populations, one should ensure that the chosen method is compatible with the demographic history of the population under consideration.
LD-based, regression-based and demographic-model-aware inference are all based on using patterns of linkage disequilibrium to infer past recombination events. As a result these methods all produce sex- and time-averaged maps and are affected by demography. Additionally, less than a couple hundred samples are required by these methods in order to produce high-resolution recombination maps. Some notable differences between these methods are the ability to infer recombination accurately at fine scales and computational efficiency. These methods are ideal for investigating the fine scale differences in the recombination landscape of different populations as well as the evolutionary history of specific regions in the genome. Given the complex demographic histories of the populations in Africa, these methods should be applied with caution.
IBD-based methods, LAI-based methods and pedigree-based methods infer recent recombination events rather than recombination events that occurred over 1–20 generations (29,43). Thus, these methods are useful in investigating contemporary recombination. However, each of these methods have aspects that could affect the accuracy and resolution of the inferred map, if not accounted for. Pedigree-based methods require data from several orders of magnitude more individuals than IBD-based and LAI-based methods to attain a similar resolution (19,29). Some LAI-based methods require a recombination map for LAI inference which could potentially affect the placement of recombination events. IBD-based methods rely on accurate IBD information which could be difficult to obtain in populations that have undergone rapid expansion or contraction in the recent past (50).
In the context of African populations, special attention should be paid to the assumptions and limitations of each of these methods. The majority of these methods were fine tuned to European demographics and although some of the assumptions may hold, others, like assuming a constant population size, should be avoided. Furthermore, a population-specific recombination map might not be necessary (51). A publically available recombination map might be sufficient, if fine-scale recombination rate information is not required. In this review, albeit far from exhaustive, we attempt to capture the essence of some of the prominent recombination inference methods, while highlighting popular software in each category. None of these methods are applicable to all situations in which an estimate of the recombination rate is needed. However, more than one might be appropriate depending on the research goal, the available sample size, the level of admixture, the level of inbreeding and available budget. Therefore, it is important to consider each method on its merit in relation to the overall goal of the project to find the optimal solution.
Conflict of Interest statement: None declared.
Funding
This research was funded (partially or fully) by the South African government through the South African Medical Research Council and the National Research Foundation. The DST-NRF Innovation Doctoral Scholarship (to G.v.E.). Fellowship from the Claude Leon Foundation (to C.U.). National Institutes of Health (NIH) funding of R35GM133531 from NIGMS (to B.M.H.).
References
Palamara, P.F.,
Author notes
Marlo Möller and Brenna M. Henn Cosenior authors.
Gerald van Eeden and Caitlin Uren Cofirst authors.