An African origin for Mycobacterium bovis

Abstract Background and objectives Mycobacterium bovis and Mycobacterium caprae are two of the most important agents of tuberculosis in livestock and the most important causes of zoonotic tuberculosis in humans. However, little is known about the global population structure, phylogeography and evolutionary history of these pathogens. Methodology We compiled a global collection of 3364 whole-genome sequences from M.bovis and M.caprae originating from 35 countries and inferred their phylogenetic relationships, geographic origins and age. Results Our results resolved the phylogenetic relationship among the four previously defined clonal complexes of M.bovis, and another eight newly described here. Our phylogeographic analysis showed that M.bovis likely originated in East Africa. While some groups remained restricted to East and West Africa, others have subsequently dispersed to different parts of the world. Conclusions and implications Our results allow a better understanding of the global population structure of M.bovis and its evolutionary history. This knowledge can be used to define better molecular markers for epidemiological investigations of M.bovis in settings where whole-genome sequencing cannot easily be implemented. Lay summary During the last few years, analyses of large globally representative collections of whole-genome sequences (WGS) from the human-adapted Mycobacterium tuberculosis complex (MTBC) lineages have enhanced our understanding of the global population structure, phylogeography and evolutionary history of these pathogens. In contrast, little corresponding data exists for M. bovis, the most important agent of tuberculosis in livestock. Using whole-genome sequences of globally distributed M. bovis isolates, we inferred the genetic relationships among different M. bovis genotypes distributed around the world. The most likely origin of M. bovis is East Africa according to our inferences. While some M. bovis groups remained restricted to East and West Africa, others have subsequently dispersed to different parts of the world driven by cattle movements.

In addition, we added 81 publically available M.caprae genomes and eight previously unpublished sequences from M.bovis isolated in Ethiopia (n ¼ 7) and Burundi (n ¼ 1). The sequencing data have been deposited in the European Nucleotide Archive (EMBL-EBI) under the study ID PRJEB33773.
From this total of 3427 genomes, 63 sequences were excluded because they did not meet our criteria for downstream analyses (average whole-genome coverage below 7, ratio of heterogeneous SNPs to fixed SNPs above 1), yielding a final dataset of 3364 genomes (Supplementary Fig. S1 and Table S1). Geographical origin of the isolates, date of isolation and host metadata were recovered from EBI (Supplementary Table S1).
Whole-genome sequence analysis All samples were subject to the same whole-genome sequencing analysis pipeline, as described in [20]. In brief, reads were trimmed with Trimmomatic v0.33 [21]. Only reads larger than 20 bp were kept for the downstream analysis. The software SeqPrep (https://github.com/jstjohn/SeqPrep) was used to identify and merge any overlapping paired-end reads. The resulting reads were aligned to the reconstructed ancestral sequence of the MTBC [22] using the MEM algorithm of BWA v0.7.13 [23] with default parameters. Duplicated reads were marked using the MarkDuplicates module of Picard v2.9.1 (https://github.com/broadinstitute/picard). The RealignerTargetCreator and IndelRealigner modules of GATK v 3.4.0 were used to perform local realignment of reads around InDels [24]. Finally, SNPs were called with Samtools v1.2 mpileup [25] and VarScan v2.4.1 [26] using the following thresholds: minimum mapping quality of 20, minimum base quality at a position of 20, minimum read depth at a position of 7X and maximum strand bias for a position 90%. Only SNPs considered to have reached fixation within an isolate (frequency within isolate !90%) were considered. For SNPs with 10% frequency, the ancestor state was called. SNPs were annotated using snpEff v4.11 [27], using the M.tuberculosis H37Rv reference annotation (NC_000962.3) as the genome of M.bovis (AF2122/97) has no genes absent from H37Rv except for TbD1 (contains mmpS6 and the 5 0 region of mmpL6) [28].
In silico spoligotyping, genomic deletions and previously defined clonal complexes The spoligotype pattern of the 3364 genomes was determined in silico using KvarQ [29]. The results were submitted to the M.bovis spoligotype database https://www.mbovis.org/ [30] and SB numbers obtained.

Phylogenetic analyses
All phylogenetic trees were inferred with RAxML (v.8.2.12) using alignments containing only polymorphic sites. A position was considered polymorphic if at least one genome had a SNP at that position with a minimum percentage of reads supporting the call of 90%. Deletions and positions not called according to the minimum threshold of 7 were encoded as gaps. We excluded positions with more than 10% missing data, positions falling in PE/PPE genes, phages, insertion sequences and in regions with at least 50 bp identity to other regions in the genome [34]. Positions falling in drug resistance-related genes were also excluded. The alignment used to produce Maximum-likelihood phylogenies were computed using the general time-reversible model of sequence evolution (-m GTRCAT -V options), 1000 rapid bootstrap inferences, followed by a thorough maximum-likelihood search performed through CIPRES [35]. All phylogenies were rooted using a M.africanum Lineage (L) 6 genome from Ghana (SAMEA3359865).
Obtaining a representative dataset of M.bovis genomes: sub-sampling 1 Our phylogenetic reconstruction indicated that sequences belonging to clonal complex Eu1 and Eu2 were over-represented in the initial 3364 genome dataset, particularly from the USA, Mexico, New Zealand and the UK. To obtain a smaller dataset with a more even representation of the different phylogenetic groups, we pruned the 3364 genomes using the following criteria: (i) we removed all genomes with non-available country metadata (n ¼ 739), which resulted in 2625 genomes; (ii) we used Treemmer v0.2 [20] with the option -RTL 99 to keep 99% of the original tree length and the option -lm to include a list of taxa to protect from pruning. This list included all genomes belonging to clonal complexes Af1 and Af2, as well as any genome belonging to any unclassified clade; (iii) we visually identified monophyletic clades with all taxa from the same country and used Treemmer v0.2 [20], using options -lmc and -lm, to only keep a few representatives of each of these clades. To have representatives of the BCG clade, we kept 11 BCG genomes from [36]. This selection process rendered a dataset of 476 genomes.
Ancestral reconstruction of geographic ranges: subsampling 2 To infer the geographic origin of the ancestors of the main groups of M.bovis and M.caprae, we used the 476 genomes dataset (see Sub-sampling 1 section) and excluded all BCG genomes and all M.bovis from human TB cases or from unknown hosts, if the strains were isolated in a low incidence TB country (Europe, North America, Oceania). This is justified by the fact that the majority of such cases correspond to immigrants from high incidence countries that were infected in their country of origin, i.e. country of isolation does not correspond to the native geographic range of the strain and is thus not informative for the geographic reconstruction. M.bovis from patients in high incidence countries were kept (Supplementary Table S1). The resulting dataset was composed of 392 genomes.
For the ancestral reconstruction of geographic ranges, we used the geographic origin of the strains and the phylogenetic relationships of the 392 genomes. Geographic origin was treated as a discrete character to which 13 states, corresponding to UN-defined regions, were assigned. To select the best model of character evolution, the function fitMk from the package phytools 0.6.60 in R 3.5.0 [37] was used to obtain the likelihoods of the models ER (equal rates), SYM (symmetrical) and ARD (all rates different) [38]. A likelihood ratio test was used to compare the different log-likelihoods obtained. According to the former, the best fitting model was SYM, a model that allows states to transition at different rates in a reversible way, i.e. reverse and forward transitions share the same parameters (Supplementary Table S2). The function make.simmap in phytools package 0.6.60 in R 3.5.0 [37,39] was used to apply stochastic character mapping as implemented in SIMMAP [40] on the 392 genomes phylogeny inferred from the best-scoring ML tree rooted on L6, using the SYM model with 100 replicates. We summarized the results of the 100 replicates using the function summary in phytools package 0.6.60 in R [37].
Molecular dating of M.bovis and M.caprae: sub-sampling 3 For the molecular clock analyses, we considered only genomes for which the date of isolation was known (n ¼ 2058). For the eight genomes sequenced in this study, the date of isolation was retrieved at a later point, and these strains were not included in the dating analysis (Supplementary Table S1). We used a pipeline similar to that reported in [20]. We built SNP alignments including variable positions with less than 10% of missing data (alignment length; 24 828 variable positions). We added a L6 strain as outgroup (SAMEA3359865) and inferred the maximum-likelihood tree as described above. Since the alignment contained only variable positions, we rescaled the branch lengths of the trees: rescaled_branch_length ¼ ((bran-ch_length * alignment_length)/(alignment_length þ invariant_sites)). To evaluate the strength of the temporal signal, we performed root-to-tip regression using the R package ape [41]. Additionally, we used the least square method implemented in LSD v0.3-beta [42] to estimate the molecular clock rate in the observed data and performed a date randomization test with 100 randomized datasets. To do this, we used the quadratic programming dating algorithm and calculated the confidence interval (options -f 100 and -s).
We also estimated the molecular clock rates using a Bayesian analysis. For this, we reduced the dataset to 300 strains with Treemmer v0.2 in the following way: we randomly sub-sampled strains, maintaining the outgroup and at least one representative of four small clades of the tree that would have disappeared with simple random sub-sampling strategy (Af2 clonal complex: G42133; Af1 clonal complex: G02538; PZA_ sus_unknown1: G04143, G04145, G04147; M. caprae: G42152, G42153, G37371, G37372, G41838; Supplementary Table S1). The resulting alignment included 13 012 variable sites (subset1).
We used jModelTest 2.1.10 v20160303 [43] to identify the best fitting nucleotide substitution model among 11 possible schemes, including unequal nucleotide frequencies (total models ¼ 22, options -s 11 and -f). We performed Bayesian inference with BEAST2 [44]. We corrected the xml file to specify the number of invariant sites as indicated here: https://groups.goo gle.com/forum/#!topic/beast-users/QfBHMOqImFE, and used the tip sampling years to calibrate the molecular clock.
We used the uncorrelated lognormal relaxed clock model [45], the best fitting nucleotide substitution model according to the results of jModelTest (all criteria selected the transversional model as the best model), and three different coalescent priors: constant population size, exponential population growth and the Bayesian Skyline [46]. We chose a 1/x prior for the population size [0-10 9 ], a 1/x prior for the mean of the lognormal distribution of the clock rate [10 À10 -10 À5 ] and the standard Gamma distribution as prior for the standard deviation of the lognormal distribution of the clock rate [0-1]. For the exponential growth rate prior, we used the standard Laplace distribution [À1 to 1]. For all analyses, we ran two runs and used Tracer 1.7.1 [47] to evaluate convergence among runs and to calculate the estimated effective sample size (ESS). We stopped the runs when they reached convergence, and the ESS of the posterior and of all parameters were larger than 200. The number of generations ranged from 150 to 300 million depending on the run. We used Tracer [47] to identify and exclude the burn-in, which ranged from 4 to 30 million generations, depending on the run.
Since the BEAST analysis was based on a sub-sample of the data, we tested the robustness of the sub-sampling by repeating twice the sub-sampling with Treemmer [20]. This resulted in two alignments of 13 272 and 12 820 SNPs, respectively (sub-set2 and subset3). We then repeated all the BEAST analyses described above on these two additional datasets. For the BEAST analyses, all trees were summarized in a maximum clade credibility tree with the software Treeannotator (part of the BEAST package), after removing the burn-in and subsampling one tree in every 10 000 generations.

RESULTS AND DISCUSSION
Phylogenetic inference of M.bovis and M.caprae populations The phylogenetic reconstruction of all M.bovis and M.caprae sequences obtained (n ¼ 3364) confirmed that these two ecotypes correspond to two monophyletic groups, despite infecting similar hosts [48,49] (Supplementary Fig. S3). The range of host species from which M.bovis was isolated is broad, confirming that M.bovis can cause infection in many different mammalian species (Supplementary Table S1). Our collection of M.caprae included genomes from Japan (isolated in elephants from Borneo) [50], China (isolated in primates, Supplementary  Table S1) and Peru (host information unavailable, but possibly human [51]), suggesting that the host and geographic distribution of this ecotype ranges well beyond Southern-and Central Europe [52,53]. One group of M.caprae genomes with origin in Germany contained a deletion of 36-38 kb, which encompasses the region of difference RD4 usually thought to be specific to M. bovis (Supplementary Table S1). This is in agreement with a previous study reporting Alpine M.caprae isolates as RD4 deleted, using conventional RD typing [54].
For all M.bovis genomes, we determined in silico clonal complexes Eu1, Eu2, Af1 and Af2 and spoligotypes, and mapped them on the phylogenetic tree and onto a world map (Fig. 1, Supplementary Figs S2 and S3 and Table S1). All previously described clonal complexes corresponded to monophyletic groups in our genome-based phylogeny ( Supplementary Fig.  S3). The phylogenetic tree also revealed M.bovis representatives that did not fall into any of the previously described clonal complexes (n ¼ 175, 5.3%, Fig. 1, Supplementary Figs S2 and S3 and Table S1). These belonged to eight monophyletic clades with unknown classification and to a few singleton branches ( Supplementary Fig. S3). The tree topology showed a strong bias towards closely related strains, in particular among Eu1, which reflects the different sampling and WGS efforts in the different geographic regions ( Fig. 1 and Supplementary Fig. S3). Closely related genomes inform the local epidemiology but not the middle/long-term evolutionary history of the strains and were thus excluded from further analysis (Sub-sampling 1, see Methods section).
Two deep divergence events in M.bovis populations were notorious: one giving rise to an unclassified lineage we named M.bovis PZA_sus_unknown1 (RD4 deleted as other M.bovis), which included five samples from Uganda (isolated from Bos taurus cattle, Supplementary Table S1), three from Malawi (isolated from humans, Supplementary Table S1) and one isolated from an antelope in Germany (Supplementary Table S1). These M.bovis isolates lacked the PncA H57D mutation that is responsible for the intrinsic PZA resistance of canonical M.bovis as reported previously [55] (Fig. 2). This clade, retained the region of difference N-RD17, unlike all remaining M.bovis, and is probably related to a group of strains previously isolated from cattle in Tanzania and reported as 'ancestral' [31] (Fig. 2). In agreement with that, our PZA_sus_unknown1 clade had other deletions reported as specific to these Tanzanian isolates; a larger deletion encompassing RDpan (RDbovis(a)_Dpan) and RDbovis(a)_kdp [31]. The second deep branching lineage included all other M.bovis strains descendent from an ancestor that acquired the PncA H57D mutation and therefore encompasses all previously described clonal complexes [8,14], as well as the other previously unclassified clades we describe here.
From the M.bovis PZA-resistant ancestor strains, two main splits occurred; one split led to the ancestor of Af2 and its previously unclassified sister clade which we called unknown2 and which contains the BCG vaccine strains (Fig. 2). M.bovis strains with spoligotyping patterns similar to BCG have previously been referred to as 'BCG-like'. However, our genome-based phylogeny shows that BCG-like spoligotyping patterns are present in several clades and have thus little discriminatory power [10] (Fig. 2, Supplementary Table S4). In Af2 and unknown2, the region of difference RDpan is present [31] (Fig. 2). Otherwise, RDpan is deleted in all other M.bovis except for the unknown3 clade, in which it is polymorphic (Fig. 2). The other split led to the ancestor, from which all remaining M.bovis strains evolved, i.e. Af1, Eu2 and Eu1 as well as other groups (Fig. 2). Interestingly, Af1 does not share a MRCA with Af2 but with Eu1 and Eu2 as well as with another unclassified group, which we called unknown3 (Fig. 2). Clonal complexes Eu1 and Eu2 share a MRCA together with five other unknown clades (unknown 4unknown 8). Eu2 is more closely related to clades unknown4 and 5, than to Eu1 (Fig. 2). Eu1 in turn shares a common ancestor with three other clades unknown6, 7 and 8 (Fig. 2).

The temporal and geographic origin of M.bovis
Our reconstruction of ancestral geographical ranges points to East Africa as the most likely origin for the ancestor of all M.bovis ( Fig. 2 and Supplementary Fig. S4). This is supported by the fact that the basal clade M.bovis-PZA_sus_unknown1 has an exclusively East African distribution and is PZA susceptible. PZA susceptibility in M.bovis is probably an ancestral character given that all other lineages of the MTBC are PZA susceptible. Alternatively, the ancestral M.bovis PZA susceptible populations could have had a much broader geographic distribution, which later became restricted to East Africa. For M.caprae, the sampling was too small and biased (Supplementary Table S1) and no conclusions can be confidently drawn. We performed tip-dating calibration using the isolation dates of the strains with both Bayesian methods and LSD (see Methods section). Both the tip-to-root regression and the randomization tests performed indicated a temporal signal in the data (Supplementary Fig. S5). We estimated a clock rate of between 6.66 Â 10 À8 and 1.26 Â 10 À7 for the BEAST analyses (Supplementary Table S3), and between 6.10 Â 10 À8 and 8.29 Â 10 À8 for the LSD analysis. These results are in line with the results of previous studies [15,56]. The common ancestor of M.bovis was estimated to have evolved between the years 256 and 1125 AD by the Bayesian analysis (cumulative range of the 95% highest posterior density (HPD) of all nine BEAST analyses) and in the year 388 AD by LSD (Fig. 3, Supplementary Trees S1-S10). Together, these estimates suggest that M.bovis has emerged in East Africa sometime during the period spanning the third to the twelfth century AD (Fig. 3). However, the credibility intervals of the different analysis spanned several centuries. This was due to the intrinsic uncertainty of dating ancestral nodes when the clock calibration is based on the sampling time of recently sampled tips (all strains considered in this study have been sampled in the last 40 years). Moreover, by relying exclusively on recent calibration points to date older nodes, we ignored the issue of the time dependency of the estimated clock rates [57]. According to the time-dependency hypothesis, evolutionary rate estimates depend on the age of the calibration points, with older calibration points resulting in lower rates. In MTBC, this topic was discussed at length in other publications, and the available data to date does not allow to confidently accept or reject that hypothesis [56,[58][59][60][61]. Our analysis assumes that rates of evolution do not depend on the age of the calibration points. Therefore, we potentially underestimate the age of the older nodes of the tree. Molecular archaeological evidence suggests indeed that our dating analyses possibly underestimate the age of the MRCA of M.bovis. In particular, 2000 years old M.bovis DNA reported as having the RD4 and RD17 deletions was found in human remains in Siberia [62]. The region of difference RD17 corresponds to RDEu1 [13], and defines the clonal complex Eu1 of M.bovis, which we estimate has evolved between the years 1236 and 1603 AD (Fig. 3).
As discussed elsewhere [56], both tip dating and the analysis of ancient DNA have potential pitfalls, and these discrepancies cannot be reconciled without additional data. Nevertheless, the tip-dating calibration provided accurate results for the emergency of BCG strains and for the introduction of M.bovis to New Zealand [63,64] (Supplementary Trees S1-S10), indicating that the method can reliably infer divergence times at least for events occurred in the last 200 years.
Insights into the detailed population structure of M.bovis around the world Understanding the evolutionary history of the M.bovis populations requires understanding their geographic distribution at a continental scale. Our WGS data set has limited geographical resolution due to the biased sampling of certain regions of the world, and to the partial unavailability of associated metadata such as the origin  Table S4). Patterns SB0120 and SB0134, known as 'BCG-like' and reported to be relatively prevalent [9], as well as SB0944, are phylogenetically uninformative; SB0120 is present in several clades, and SB0134 and SB0944 have evolved independently in two different M.bovis populations (Fig. 2 and Supplementary  Table S4). Our results suggest that the sister clade of all contemporary PZA-resistant M.bovis, PZA_sus_unknown1, is restricted to East Africa. The same holds true for Af2, which is in accordance with and are coloured according to geographical UN region. Inferred spoligotype patterns from WGS described in M.bovis spoligotype database [30] are indicated for the unknown clades. The red circles at the tips correspond to the eight newly sequenced genomes. Regions of difference (RD) as in [31] are indicated; superscript þ and À refer to presence of the region or its deletion, respectively previous reports [8,12]. Our findings further suggest that the geographical distribution of the Af2 sister clade unknown2 includes East Africa (Eritrea, Ethiopia), but also Southern Europe (Spain and France). Informative spoligotypes of the un-known2 clade show that it also circulates in North Africa ( Fig. 2 and Supplementary Table S4). Of note, the original strain, from which all BCG vaccine strains were derived, was isolated in France [65]. Our inferences suggest that a common ancestor of Af2 and unknown2 evolved in East Africa, and while Af2 remained geographically restricted, its sister clade unknown2 has subsequently dispersed (Fig. 2). All remaining M.bovis descended from a common ancestor, for which the geographical origin was impossible to infer reliably with our data. However, the tree topology showed that from this ancestor several clades have evolved which are important causes of bovine TB today in different regions of the world (i.e. the clonal complexes Eu1, Eu2 and Af1; Fig. 2).
The most basal clade within this group is unknown3, which contained 25 genomes mostly isolated from humans (Supplementary Table S1). The in silico derived spoligotypes suggest that the geographical spread of unknown3 ranges from Western Asia to Eastern Europe, but also includes East Africa ( Fig. 2 and Supplementary Table S4). The next split in our phylogeny corresponds to Af1, which has been characterized extensively using the deletion RDAf1 and spoligotyping, and shown to be most prevalent in countries from West-and Central Africa [11]. Here, we could only compile nine Af1 genomes, of which five originated in Ghana [66], and the remaining had either a European or an unknown origin. The small diversity of Af1 spoligotypes found in our WGS dataset [11] indicates strong undersampling ( Fig. 2 and Supplementary Table S4). Nevertheless, it was possible to estimate the divergence of the Af1 clade from the remaining M.bovis to a period ranging from the year 921 to 1449 AD (Fig. 3), making it unlikely that Af1 was originally brought to West Africa by Europeans [67].
The next split comprised clades unknown4, unknown5 and Eu2. Clade unknown4 was composed of 33 genomes with little geographic information and for which the most common spoligotyping pattern was the uninformative SB0120 (n ¼ 19). Additional unknown4 spoligotypes indicate that strains belonging to this clade circulate in Southern Europe, Northern and Eastern Africa (Fig. 2 and Supplementary Table S4), supporting dispersion events between Africa and Southern Europe. Clade unknown5 comprised only nine genomes isolated mostly from Zambian cattle. Its corresponding spoligotype is also SB0120, limiting further geographical inferences. In contrast to the strains from clades unknown4 and un-known5, among the 323 Eu2 genomes, no genomes of East African origin were found, and Africa was only represented by nine South African genomes [68]. By far, most Eu2 were isolated in the Americas. Previous studies have shown that Eu2 dominates in Southern Europe, particularly in the Iberian Peninsula [14], thus possibly the source of Eu2 in the Americas. There were no representatives of Eu2 from the Iberian Peninsula in our dataset. However, our molecular dating analysis revealed that the common ancestor of Eu2 evolved during the period 1416-1705 AD (Fig. 3), which would be compatible with an introduction from Europe into the Americas.
Clonal complex Eu1, unknown6, unknow7 and unknown8, form a sister group to the previously described. Eu1 has previously been characterized based on the RDEu1 deletion and spoligotyping, showing that it is highly prevalent in regions of the world that were former trading partners of the UK [8,13]. That geographic range is well represented in our dataset, including many genomes from the UK (n ¼ 215) and Ireland (n ¼ 45) (Supplementary Table S1). The latter were very closely related, suggesting that there was probably fixation of just a few genotypes in this region as previously proposed [8]. In contrast, most branching events within Eu1 correspond to M.bovis isolated in North-and Central America as well as New Zealand, resulting from the expansion of clonal families not seen in the British Islands. Consequently, most of the genetic diversity of Eu1 exists outside of its putative region of origin. Our molecular dating is compatible with this view, indicating that the ancestor of Eu1 is likely to have emerged between the years 1236 and 1603 AD (Fig. 3), with several Eu1 sub-clades evolving in the last 200-300 years (Supplementary Trees S1-S10).
The closest relative to Eu1 is a genome from Ethiopia (un-known8) with the spoligotyping pattern SB1476, commonly found in Ethiopia [12]. Unknown6 comprised seven genomes from North America ( Fig. 2 and Supplementary Tables S1 and S4), whereas unknown7 included eight genomes, four of which were isolated in Western Europe and another four without country of origin available. Spoligotyping patterns indicate that the geographical range of unknown7 includes as well Southern Europe, Northern and Eastern Africa.

CONCLUSIONS AND IMPLICATIONS
We screened the public repositories and compiled 3364 genome sequences of M.bovis and M.caprae from 35 countries. Despite the biased geographic distribution of our samples, our results provide novel insights into the phylogeography of M.bovis and M.caprae. Our whole-genome-based phylogeny showed that although certain spoligotypes are associated with specific monophyletic groups, prevalent patterns such as the so-called "BCG-like" should not be used to infer phylogenetic relatedness. Moreover, our data extend the previously known phylogenetic diversity of M.bovis by eight previously uncharacterized clades in addition to the four clonal complexes described previously. Among those, Pza_sus_unknown 1 shares a common ancestor with the rest of M.bovis, has an exclusively East African distribution and does not share the PncA mutation H57D, conferring intrinsic resistance to PZA.
Our further inferences suggest that M.bovis evolved in East Africa. The evolutionary success of M.bovis is linked to the fact that it can infect and transmit very efficiently in cattle. Cattle have been domesticated twice independently; once in the Near East (B.taurus) and once in the Indus Valley (Bos indicus) approximately 10 000 year ago, and both were introduced to Africa at different time points and various locations, subsequently interbreeding with local wild species [69]. Whereas B.taurus was introduced probably during the six millennium BC possibly through Egypt, B.indicus was most likely introduced twice, first during the second millennium BC and later during the Islamic conquests [70]. M.bovis could have emerged after the introduction of cattle, benefiting from the development of African pastoralism and expanding within the continent. The timing of these events is difficult to estimate; the initial introductions of cattle predate by several thousands of years our inferred temporal origin of M.bovis. But as discussed, our estimates are possibly affected by the current uncertainty in dating deeper evolutionary events within the MTBC. Alternatively, M.bovis could have emerged in the Near East and been introduced to Africa together with cattle. We cannot test this hypothesis formally, as the Near East is poorly represented in our dataset. However, this scenario is difficult to reconcile with the restricted East African distribution of the Pza_sus_unknown_1 clade, as the Near East was also the origin of taurine cattle both in Europe and Asia, where no clade retaining the ancestral characteristic of PZA susceptibility was found.
While some M.bovis groups remained restricted to East Africa, others have dispersed to different parts of the world. The contemporary geographic distribution of M.bovis clades suggest that Eastand North Africa, Southern Europe and Western Asia have played an important role in shaping the population structure of these pathogens. However, these regions were not well represented in our dataset. Thus, more M.bovis genomes from these regions are necessary to generate better insights, particularly given the central role of these regions in the history of cattle domestication [71]. From a more applied perspective, our work provides a global phylogenetic framework that can be further exploited to find better molecular markers for studying M.bovis in settings where genome sequencing cannot be easily implemented.

Supplementary data
Supplementary data is available at EMPH online.