Phylogenetic analysis of HIV-1 shows frequent cross-country transmission and local population expansions

Abstract Understanding of pandemics depends on the characterization of pathogen collections from well-defined and demographically diverse cohorts. Since its emergence in Congo almost a century ago, Human Immunodeficiency Virus Type 1 (HIV-1) has geographically spread and genetically diversified into distinct viral subtypes. Phylogenetic analysis can be used to reconstruct the ancestry of the virus to better understand the origin and distribution of subtypes. We sequenced two 3.6-kb amplicons of HIV-1 genomes from 3,197 participants in a clinical trial with consistent and uniform sampling at sites across 35 countries and analyzed our data with another 2,632 genomes that comprehensively reflect the HIV-1 genetic diversity. We used maximum likelihood phylogenetic analysis coupled with geographical information to infer the state of ancestors. The majority of our sequenced genomes (n = 2,501) were either pure subtypes (A–D, F, and G) or CRF01_AE. The diversity and distribution of subtypes across geographical regions differed; USA showed the most homogenous subtype population, whereas African samples were most diverse. We delineated transmission of the four most prevalent subtypes in our dataset (A, B, C, and CRF01_AE), and our results suggest both continuous and frequent transmission of HIV-1 over country borders, as well as single transmission events being the seed of endemic population expansions. Overall, we show that coupling of genetic and geographical information of HIV-1 can be used to understand the origin and spread of pandemic pathogens.


Introduction
A high mutation rate, short generation time, lack of proofreading, and frequent recombination fuel HIV-1 molecular evolution with genetic variation (Castro-Nallar et al. 2012;Volz, Koelle, and Bedford 2013). This has resulted in the diversification of HIV-1 group M into nine distinct subtypes (A-D, F-H, J, and K). Recombination events between multiple subtypes have also resulted in new circulating recombinant forms (CRFs) of the virus. The prevalence of the different subtypes and CRFs differs markedly, with some more prevalent than others in specific geographic regions (Hemelaar et al. 2011(Hemelaar et al. , 2019. The divergence and expansion of the different subtypes of HIV-1 is elusive, as they are likely to have happened before the characterization of HIV-1 in 1983 (Gilbert et al. 2007).
Phylogenetics can be applied to infer the relatedness between HIV-1 sample genotypes, and geographical information contained in the genetic relationship can be used to infer the states of ancestral genotypes (Castro-Nallar et al. 2012). However, the retrospective nature of phylogenetics makes the approach dependent on the comprehensiveness of the sample collection, i.e. samples representing all genotypic states should ideally be included in the collection. In practice, it is not possible to sample all genotypic states, and especially it is difficult to include samples of ancestral genotypes only available from historical sample collections. Nonetheless, a few available historical collections of HIV-1 covering decades have been successfully sequenced and used to date origins of types of HIV-1 (Bletsa et al. 2019;Gryseels et al. 2020;Worobey et al. 2016). Beyond the geographical information contained in phylogenies, they can be analyzed with respect to other sample information, e.g. temporal, epidemiological, and demographic information, to further understand the observed genetic diversification and distribution of HIV-1 (Castro-Nallar et al. 2012).
Here, we sequenced HIV-1 genomes sampled from participants in the Strategic Timing of AntiRetroviral Treatment (START) trial (Group, Insight Start Study et al. 2015), which constitute a well-defined and demographically diverse cohort of ART-naïve HIV+ participants, and analyzed them together with a set of HIV-1 genomes defined by Los Alamos National Laboratory (LANL) to comprehensively reflect HIV-1 genetic diversity. With phylogenetic ancestry reconstruction, we used the sample collection to delineate and characterize the transmission of subtypes across countries.

Description of sample collection from START clinical trial
All participants are from the START trial, which is managed by the International Network for Strategic Initiatives in Global HIV Trials (INSIGHT). The study design and data collection plan for START have previously been reported (Group, Insight Start Study et al. 2015). A total of 4,685 antiretroviral therapy (ART)-naive participants from 35 countries (6 geographical regions) were included in the START study. The participants were enrolled between April 2009 and December 2013. Here, 3,785 participants with a viral load ≥1,000 cp/ml and with at least 2 baseline plasma samples available in the START biobank were included for HIV-1 genome sequencing.

Ethics
Samples included in this study were derived from participants who consented to the clinical trial, START (NCT00867048) (Group, Insight Start Study et al. 2015), run by INSIGHT. The study was approved by the institutional review board or ethics committee at each contributing center, and written informed consent was obtained from all participants. All informed consents were reviewed and approved by participant site ethics review committees.

Viral concentration and RNA extraction
Plasma samples were thawed at room temperature. Following this, 500 µl plasma was transferred to new RNAse-free tubes and centrifuged at 2,000×g for 15 minutes. The supernatant was then extracted and centrifuged at 21.000×g for 75 minutes, and 360 µl of the top supernatant was discarded. Viral RNA was extracted using QIAamp viral RNA extraction kit (Qiagen) on a QIAcube robot according to the manufacturer's guidelines.

Reverse-transcription and amplification of viral RNA
Reverse-transcription polymerase chain reaction (RT-PCR) was used to amplify two amplicons of HIV-1 separately, covering positions 1485-5058 and 5967-9517 (later referred to as amplicons A and B) in the HXB2 genome sequence (GenBank accession number K03455), as described previously (primer sequences are listed in Supplementary Table S1) (Gall et al. 2012). The reverse-transcription and amplification were performed using SuperScript III One-Step RT-PCR System with Platinum Taq High Fidelity (Thermo Fisher Scientific) according to the manufacturer's instructions with 10 µl viral RNA for each amplicon (Thermocycler settings are listed in Supplementary Table S7). The PCR products were purified using Ampure XP (Beckman Coulter) PCR purification according to the manufacturer's instructions. The two amplicons were pooled for each sample prior to DNA library preparation.

Amplicon DNA library preparation and sequencing
Libraries of DNA from pooled amplicons were prepared using a Nextera XT (Illumina, San Diego, CA, USA) sample preparation kit to target an insert size of 300 nucleotides (nts). A modified protocol was used, in which input DNA and reagent use was halved, except for normalization of libraries, where 1.5× magnetic normalization beads were used. DNA libraries were sequenced on an Illumina MiSeq machine using a MiSeq 150-cycle V3 reagent kit (Illumina), producing 75-nt paired-end reads.

Removal of sequence readouts from human genome
Sequencing reads from each sample were aligned against the human reference genome sequence Genome Reference Consortium Human Build 37 patch release 13 (GRCh37.p13) using Bowtie2 version 2.2.8 (Langmead and Salzberg 2012) (option '-X 1000') and only read pairs in which neither of the reads aligned to GRCh37.p13 were retained.

Alignment of reads against HXB2 reference to create consensus sequence
Cleaned reads from the previous step were aligned against the genome sequence of HIV-1 HXB2 (GenBank accession number K03455.1) with SSAHA2 version 2.5.3 (Ning, Cox, and Mullikin 2001) (option '-solexa' and '-pair 11000'), and resulting sequence alignment map (SAM) files were converted to sorted binary alignment map (BAM) files using SAMtools 1.1 (Li et al. 2009 for which a consensus sequence was determined for either 90 per cent of one of the amplicon regions or 50 per cent of both amplicon regions were included for further subtype and phylogenetic analysis.

HIV-1 subtyping
The consensus sequence of each sample was analyzed with REGA HIV-1 Subtyping Tool version 3 (Pineda-Pena et al. 2013). The output was manually inspected to check for the presence of subtype-specific sequences within the given consensus sequence. Shannon-Weaver and 1-Simpson diversity indexes were calculated using the vegan package in R (version 3.2.5). The diversity indexes were calculated based on the distribution of subtypes A, AB, B, BC, BF, C, and CRF01_AE across geographical regions.

Maximum likelihood phylogenetic reconstruction
A maximum likelihood tree was constructed using ExaML (Exascale Maximum Likelihood) version 3.0.16 that uses the RAxML search algorithm (Kozlov, Aberer, and Stamatakis 2015). The sequences of pure subtypes A, B, C, D, F, and G, as well as subtype CRF01_AE were used for tree construction. Recombination hampers phylogenetic analysis, so we focused our analysis of ancestry on subtypes A, B, C, and CRF01_AE, respectively (Felsenstein 2004). We used a generalized time-reversible model with a gamma distribution, and the analysis was parallelized using openmpi. The following three commands were used to produce the trees: (1) 'raxmlHPC-AVX -y -m GTRGAMMA -p 12345 [sample consensus sequences in phylip format] -n StartingTree', (2) 'examl-OMP -s maximum_likelihood_tree.unpartitioned.binary -t RAxML_parsimonyTree.StartingTree -m GAMMA -n Tree1', and (3) 'examl-AVX -s maximum_likelihood_tree.unpartitioned.binary -t TreeSet -f E -m GAMMA -n T3uE'. The maximum likelihood trees were visualized using the Interactive Tree of Life online tool ( Fig. 1) (Letunic and Bork 2019). Least-Squares Dating (LSD) was used to extract monophyletic clades containing subtypes A, B, C, and CRF01_AE, respectively, and produce separate subtype-specific rooted trees ).
Ancestral states were reconstructed with PastML using an F81like model, which generalizes states to the 4-state F81 model for nucleotide substitution (Felsenstein 1981). Under F81-like model, migration rate from a state i (e.g. location) to a different state j (i != j) is proportional to the equilibrium frequency of j, πj. In addition to the state equilibrium frequencies, PastML optimizes the rescaling factor (analogous to mutation rate under strict molecular clock), which is applied to all the tree branches and represents the average number of character changes per branch unit (e.g. year for dated trees).
We assessed the robustness of the ancestral state reconstruction by shuffling tip state annotations prior to PastML analysis. The random shuffling analysis was repeated three times for each subtype.

Genotyping of HIV-1 from the START clinical trial
We obtained HIV+ plasma samples from 3,785 START trial participants with a viral load ≥1,000 copies/ml. For each sample, we sequenced two 3.6-kb amplicons of the HIV-1 genome that cover partial gag (54 per cent) and pol (99 per cent), and full the env gene, respectively. While we achieved an average sequencing coverage of the two HIV-1 genomic regions of 7,188-fold (median 5,334fold) and 1,061-fold (median 303-fold), respectively, sequencing success varied across both samples and genomic regions. For further analysis, we only included those 3,197 samples for which sequence information was available for at least 90 per cent of either of the genomic regions or at least 50 per cent of both genomic regions.
We analyzed each of the 3,197 genomes with REGA HIV-1 Subtyping Tool (Pineda-Pena et al.) to identify subtype-specific sequences, and we assigned the samples to be either a pure subtype (n = 2,354 samples distributed on subtypes A-D, F, and G), or a recombinant subtype, in cases where the genome showed the presence of sequences specific to >1 pure subtypes (n = 843 samples). We did not detail the recombinants down to specific CRF references, except for CRF01_AE (n = 147). For example, if a genome showed to contain sequence identities specific to both subtype C and D, respectively, it was denoted as subtype 'CD' without further detailing of genomic mosaic structure defined by recombinant breakpoints.

Combined phylogenetic analysis of data from START and LANL
We combined our START sequences of pure subtypes (A, B, C, D, F, and G) and CRF01_AE (n = 2,501 sequences) with all sequences of the same subtypes from the LANL Filtered Web Alignment (n = 2,632) that consist of sequences that represent the fullest spectrum (diversity) of sequences in LANL (Table 1).
The LANL Filtered Web Alignment sequences spanned seventytwo countries (eight samples have no country information) of which twenty-four countries overlapped with the thirty-five countries in START (Supplementary Tables S2-S3). Sequences from START were sampled between April 2009 and December 2013 ( Supplementary Fig. S1). The majority (94 per cent) of sequences in  Fig. S1). We constructed a maximum likelihood phylogeny of all samples in the combined dataset consisting of subtypes A, B, C, D, F, G, and CRF01_AE (n = 5,133; Fig. 1). The phylogeny confirmed that all samples of each subtype clustered as monophyletic groups, i.e. there was concordance between the assigned subtypes and the evolutionary relationship of samples. Subtypes A, B, C, and CRF01_AE were the most frequent subtypes in both the START and the LANL datasets, and our further analysis was focused on extracted monophyletic clades for each of these four subtypes (i.e. a rooted phylogenetic tree for each subtype; Supplementary Figs S2-S5).

Origin and cross-country transmission of subtype B
Subtype B samples from LANL Filtered Web Alignment (n = 1,196) encompassed 43 countries of which 15 countries overlapped with the 32 countries with subtype B samples from START (n = 1,959; Supplementary Tables S2-S3).
We annotated the tips (samples) of the subtype B phylogenetic tree according to country of origin to infer the ancestral states (Fig. 2). We found that our collection of subtype B samples emerged in the USA (marginal probability 89 per cent). To assess  if our ancestral state estimates were a result of genuine structure in the data or due to collection skewness (subtype B samples were in general most often observed in the USA (25 per cent)).
We repeated the ancestral reconstruction with geographical annotation randomly shuffled between samples (Firth et al. 2010). Random shuffling of country annotation led to the estimate of ancestral state being unknown ( Supplementary Fig. S6). As such, the random shuffling of annotation supported the estimate that the ancestral state was in the USA and was a result of genuine structure in the data.
We used ancestral state reconstruction to infer events of transmission between countries to explain the observed country of origin of our samples (Fig. 2). We inferred 530 events of transmission of subtype B from the USA to Europe or Australia, and all transmission events led to clusters of one to eight samples in the recipient country. We also inferred sixty-one transmissions of subtype B back again to the USA, Europe, or Australia.
While we also inferred 179 transmission events from the USA to Latin America leading to small clusters represented by one to four samples, six transmission events from the USA to Latin America distinguished themselves as they led to larger clusters of 19, 19, 22, 39, 106, and 166 samples, respectively (Fig. 2). Five of these clusters included samples from both START and LANL datasets.
We found that transmission from the USA led to a large cluster represented by twenty-seven samples from South Korea (Fig. 2). Also, we found a cluster of twelve samples in Thailand to originate from the USA or Malaysia, and that the Thailand cluster seeded a large cluster in China represented by thirty-nine samples (Fig. 2).
While subtype B originally emerged in Central Africa (Gilbert et al. 2007), ancestral reconstruction estimates that all African subtype B samples in our collection are the results of introduction of subtype B from other geographical regions back to Africa.

Origin and cross-country transmission of subtype C
The second most prevalent subtype within our combined dataset was subtype C (n = 949). Subtype C samples from LANL Filtered Web Alignment (n = 720) encompassed 30 countries of which 10 countries overlapped with the 21 countries with subtype C samples from START (n = 229) (Supplementary Tables S2-S3). Ancestral state reconstruction estimated the ancestor of subtype C samples to be in South Africa (marginal probability 99 per cent), and root cluster encompassed 398 samples across both datasets (Fig.  3). The country of the ancestor became unknown when the country annotation was randomly shuffled (Supplementary Fig. S7). Accordingly, the annotation shuffling supported the estimation of the subtype C ancestral state to be in South Africa and was the outcome of genuine structure in the data.
From the ancestral cluster in South Africa, we inferred four events of transmission that led to clusters represented by >10 samples: (1) transmission to a thirty-nine-sample cluster in India; (2) transmission to a twenty-nine-sample cluster in Tanzania with transmission onward via Sweden (cluster with twenty-nine samples) to Brazil (cluster with twenty samples) and from Tanzania back to South African cluster represented by twenty-seven samples; (3) transmission to a four-sample cluster in Belgium with transmission onward to Brazil (cluster with eleven samples); and (4) transmission to a thirteen-sample cluster in Botswana with transmission onward to an Indian cluster (forty-two samples) (Fig. 3). The largest Brazilian cluster included samples from both START and LANL datasets.

Origin and cross-country transmission of subtype A
START subtype A samples (n = 106) encompassed 12 countries, whereas LANL subtype A samples (n = 231) encompassed 22 countries of which 4 countries overlapped with START (Supplementary Tables S2-S3). The ancestor of subtype A samples was estimated to be in Uganda (marginal probability 99.8 per cent), and the root cluster of sixty-six samples encompassed samples from both datasets (Fig. 4). Estimate of the ancestral state became unknown when the country annotation was randomly shuffled (Supplementary Fig. S8); thus, the shuffling of country annotation supports that the estimation of subtype A ancestral state to be in Uganda was the outcome of genuine structure in the data. From the ancestral cluster in Uganda, we inferred up to 11 transmissions leading to clusters of sizes 1-3 and a transmission leading to a large cluster in Kenya (53 samples) with onward transmission to clusters including a total of 184 samples (Fig. 4).
The large Kenyan cluster was the source of two introductions to Europe that led to clusters represented by a minimum of four samples: that is, we delineated clusters of transmission from the Kenyan cluster to (1) Tanzania (twelve samples) to Cyprus (nine samples) to Greece (eighteen samples) to Portugal (four samples) and (2) Congo (ten samples) with onward transmission to Ukraine (thirteen samples) that split into a Russian (fifteen samples) and a Polish cluster (four samples), respectively (Fig. 4). While both START and LANL datasets included Ugandan samples, the other before-mentioned countries were either unique to LANL (Congo, Ukraine, Russia, Poland, Tanzania, Cyprus) or START (Greece, Portugal) dataset, respectively.

Origin and cross-country transmission of subtype CRF01_AE
For subtype CRF01_AE, the LANL dataset encompassed samples (n = 292) from 14 countries of which three countries overlapped with the samples (n = 147) from 12 countries encompassed by the START dataset (Supplementary Tables S2-S3).
Most (234 of 439 samples) of subtype CRF01_AE samples were from Thailand, but while the majority (81 per cent) of subtype CRF01_AE samples in START dataset were from Thailand, subtype CRF01_AE samples in LANL dataset were more distributed in both China (42 per cent) and Thailand (39 per cent). The LANL subtype CRF01_AE dataset also included three samples taken in 1990 in the Central African Republic, and a cluster with these three samples placed the origin of CRF01_AE to be Central African Republic (marginal probability 64 per cent; Fig. 5). Estimate of the ancestral state became unknown when the country annotation was randomly shuffled (Supplementary Fig. S9); thus, the shuffling of country annotation supports that the estimation of subtype CRF01_AE ancestral state to be in Central African Republic was the outcome of genuine structure in the data.
A transmission cluster in Thailand (n = 232 samples, including samples from both LANL and START datasets) was the largest cluster identified for CRF01_AE. From Thailand, we inferred fortytwo transmissions that led to small clusters outside Asia, whereas we found transmission from Thailand to other Asian countries that led to six large clusters of seven to twenty-seven samples (Fig.  5).

Subtype distribution across regions for START clinical trial samples
Finally, we investigated the distribution of subtypes across geographical regions for START samples. Seven subtypes (A, AB, B, BC, BF, C, and CRF01_AE) were represented by at least 100 samples in the START collection ( Table 2). The seven subtypes showed differences in distribution across geographical regions (Pearson's Chi-squared test with simulated P-value based on 2,000 replicates; P-value = 0.0005). Subtype B was the most dominant in Australia, Europe and Israel, Latin America, and the USA; subtype C was most dominant in Africa; and subtype CRF01_AE was most dominant in Asia (Table 2). The population of HIV-1 in the USA was the most homogenous (Shannon-Weaver index 0.11), whereas the African population was the most diverse (Shannon-Weaver index 1.52; Table 2).

Discussion
In this study, we sequenced HIV-1 genomes from 3,197 participants in a clinical trial sampling across 35 countries and analyzed our data with another 2,632 genomes that comprehensively reflect the HIV-1 genetic diversity.
We used the genetic information to delineate and characterize cross-country transmission of subtypes. Our transmission analysis of subtype B showed many parallel transmissions out from the USA that led to small transmission clusters in other regions (fifty-one countries). We also found evidence for transmission back again to the USA from twenty-nine different countries. We find that this supports a model of continuous and frequent transmission of subtype B over country borders.
While we in general found transmission to be identified by clusters represented by a few samples, transmission of subtype B from the USA to Latin America showed examples of single transmissions that led to large clusters. We suggest that this is indicative that the HIV-1 population expansion within Latin America was local rather than being fueled by new transmissions from other regions. A similar indication of local clonal expansion was observed for subtype C for which an ancestral population in South Africa led to large clusters in India. This is in agreement with previous reports that the introduction of subtype C to India occurred a limited number of times (Neogi et al. 2012). We found the root of subtype CRF01_AE to be in central Africa, which is also the known origin for this subtype (Gao et al. 1996;Murphy et al. 1993). Also, the distribution of subtype CRF01_AE in the START clinical trial confirms that subtype CRF01_AE is endemic in Asia (Angelis et al. 2015), and while we found frequent transmissions to Europe and Australia, our phylogenetic analysis shows that there is no expansion of subtype CRF01_AE outside Asia.
A limitation of our analysis is that even though our sample collection includes 84 countries, many geographical regions are not well represented, and results from one country may not represent nearby countries, as there may exist large differences in subtype distribution between countries in the same region. This is illustrated by Greece that, unlike other European countries, had a high prevalence of subtype A in the START clinical trial. This agrees with a report on people living with HIV-1 in southwestern Greece   (Davanos et al. 2015), and we furthermore estimated that the Greek cluster of subtype A was transmitted from Uganda. Also, we found that subtype B ancestral state was estimated to be in the USA; nonetheless, we note that, as we have no Caribbean samples in our collection, our analysis is not able to detect the suggested cryptic subtype B circulation in the Caribbean (Worobey et al. 2016).
We found it confirmatory that transmission clusters often included both START and LANL samples and as such both datasets both supported the existence of these clusters. While some coun tries were represented in both datasets, most countries were unique to each of the two datasets. In this regard, we found it confirmatory that we were able to delineate transmission over countries that were one after the other unique to one of the datasets. Finally, we also found it confirmatory that inferred cross-country transmission often followed the geographical position of countries, e.g. we observed a north-eastwards transmission of subtype C: South Africa to Botswana to India to China or Nepal. Nonetheless, while we found that the combined dataset showed to mutually support our findings, we note that analysis of combined datasets should be interpreted with caution as results may be confounded by general differences between the datasets, e.g. differences in sequence quality.
We took advantage of the START sampling scheme to investigate the distribution of subtypes across geographical regions during 2009-13. While we identified the majority of START samples be pure subtypes (A-D, F, and G), we found 26 per cent recombinant subtypes including more than >100 samples of recombinants AB, BC, BF, and CRF01_AE, respectively. This is in line with that recombinant forms of HIV-1 have been estimated to account for 18-20 per cent of infections worldwide (Buonaguro, Tornesello, and Buonaguro 2007;Hemelaar et al. 2011Hemelaar et al. , 2019. It was beyond the scope of this study to perform a detailed characterization of the individual mosaics and their ancestry (except for CRF01_AE); nonetheless, we note that our observed subtype distributions confirm a high prevalence of subtypes BC and BF in Africa and Latin America, respectively (Carr et al. 2001;Melo, Jamal, and Zanotto 2012). Also, we found subtype AB in Uganda, Thailand, and Greece, but our sample collection does not cover Eastern Europe where an AB recombinant was first described in 1998 to circulate among injecting drug users in Eastern Europe (Liitsola et al. 1998), and further reports on distribution and transmission of AB recombinants remain scarce (Neogi et al. 2017).
Like other studies using the same amplicon-based sequencing approach (Cornelissen et al. 2017;Ratmann et al. 2017), we were limited by our ability to produce an evaluable result for all available START samples, and likewise, we observe that failure increased for the 3 ′ end of the genome and samples with low viral load. Also, our use of machine extraction as opposed to manual RNA extraction may affect our success rate (Cornelissen et al. 2017).
Finally, we note that our two datasets are fundamentally different in the sense that the START samples represent HIV-1 from ART-naïve clinical trial participants with CD4+ cell counts greater than 500 cells/mm 3 (Group, Insight Start Study et al. 2015), whereas LANL samples were selected to represent the fullest spectrum of HIV-1 genetic diversity.
Overall, our data both extend current knowledge and exemplify the limitations in phylogeographic analysis, where inference of transmission is dependent on both representative geographical and genotypic sampling, and as such the estimated countries of ancestors should only be regarded as the best proxies, given not all countries are represented. Also, we note that our combined analysis encompasses eighty-four countries, but forty-four of these are represented by less than ten sequences.
In conclusion, we have used a combined dataset of HIV-1 genomes to present an analysis on the ancestry and transmission of HIV-1 across multiple viral subtypes and geographical regions. We envision that the presented data on HIV-1 genotypes can facilitate comparative genetic studies and be analyzed in concert with other information, e.g. markers of HIV pathogenesis, to further understand HIV-1.

Data availability
All non-personally identifiable data will be made available to the readers of Virus Evolution Genome sequences and associated metadata have been deposited at Dryad repository: https://doi.org/ 10.5061/dryad.pnvx0k6k0.

Supplementary data
Supplementary data is available at Virus Evolution online.