Maintenance and reappearance of extremely divergent intra-host HIV-1 variants

Abstract Understanding genetic variation in human immunodeficiency virus (HIV) is clinically and immunologically important for patient treatment and vaccine development. We investigated the longitudinal intra-host genetic variation of HIV in over 3,000 individuals in the US National HIV Surveillance System with at least four reported HIV-1 polymerase (pol) sequences. In this population, we identified 149 putative instances of superinfection (i.e. an individual sequentially infected with genetically divergent, polyphyletic viruses). Unexpectedly, we discovered a group of 240 individuals with consecutively sampled viral strains that were >0.015 substitutions/site divergent, despite remaining monophyletic in the phylogeny. Viruses in some of these individuals had a maximum genetic divergence approaching that found between two random, unrelated HIV-1 subtype-B pol sequences within the US population. Individuals with these highly divergent viruses tended to be diagnosed nearly a decade earlier in the epidemic than people with superinfection or virus with less intra-host genetic variation, and they had distinct transmission risk factor profiles. To better understand this genetic variation in cases with extremely divergent, monophyletic viruses, we performed molecular clock phylogenetic analysis. Our findings suggest that, like Hepatitis C virus, extremely divergent HIV lineages can be maintained within an individual and reemerge over a period of years.


Introduction
Intra-host genetic variation found in human immunodeficiency virus (HIV) infection is produced by complex evolutionary dynamics, including rapid evolution and genetic recombination (Shankarappa et al. 1999;Zanini et al. 2015). Within the HIV-1 protease and polymerase (pol) genomic region commonly used for drug resistance testing, the maximum divergence between intra-host variants tends to be <0.01-0.02 substitutions/site (Hightower et al. 2013;Poon et al. 2015;Zanini et al. 2015). In North America, typical HIV-1 subtype B strains from different individuals are between 0.03 and 0.08 substitutions/site divergent (Poon et al. 2016;Wertheim et al. 2017a). Within a given individual, HIV diversity, especially in the envelope (env) region, tends to be periodically purged by selective sweeps (Shankarappa et al. 1999;Laird Smith et al. 2016;Landais et al. 2017).
HIV-1 superinfection occurs when an individual is sequentially infected with HIV from two different sources (Ramos et al. 2002;Smith et al. 2004;Smith, Richman, and Little 2005;Koning et al. 2013), which are often identified through a polyphyletic relationship in a phylogenetic tree (Wagner et al. 2014). Viral population subsequent to superinfection can reflect a mixture of the descendants of the two infecting strains, recombinant products of the infecting strains, or a single predominant strain. Superinfection can potentially affect the host immune response, disease progression, antiretroviral therapy (ART) and vaccine design and efficacy (Koelsch et al. 2003;Smith et al. 2005;Ronen et al. 2014;Wagner et al. 2017). There is a high incidence rate of superinfection: 4.96 per 100 person-years in highrisk cohorts of men who have sex with men (MSM) (Wagner et al. 2014) and 2.2 per 100 person-years in people who inject drugs (PWIDs) (Hu et al. 2005).
We investigated the longitudinal intra-host genetic variation of HIV pol, with the intent of characterizing cases of superinfection in a US National HIV Surveillance System (NHSS). We employed a combined phylogenetic and genetic distance-based approach. As part of this investigation, we discovered a group of individuals with extremely divergent viral genotypes that were monophyletic in an HIV phylogeny. This finding suggests that extremely divergent HIV pol lineages can be maintained over the course of prolonged infection. Here, we characterize this unexpected pattern of HIV genetic variation and discuss implications for the detection of HIV molecular transmission clusters in a surveillance context.

Epidemiologic and sequence data
HIV-1 pol sequences reported to the US NHSS from 2000 through Fall 2015 were included in the study (see Oster et al. 2015, for a description of the development of this sequence database). Sequence and epidemiological data were included in our analysis if they were from an individual with at least four longitudinally reported pol sequences, each sampled at least 30 days apart. Sequences reported to the NHSS are generated using bulk Sanger sequencing and this consensus sequence represent a snapshot of intra-host viral diversity at the time of sampling. All sequences were required to be a minimum of 500 nucleotides in length. In total, 3,655 people met these criteria, totaling 17,688 sequences.

Subtype classification and characterization of drug resistance associated mutations
HIV-1 subtypes and circulating recombinant forms were classified using a local installation of COMET v.1 (COntext-based Modeling for Expeditious Typing) (Struck et al. 2014). Non-B subtypes were included in phylogenetic analysis for rooting purposes, necessary to establish monophyly versus polyphyly. However, sequences from individuals with non-B subtypes (n ¼ 152 individuals) were excluded from subsequent analyses given the variable substitution rates across HIV subtypes (Abecasis, Vandamme, and Lemey 2009;Wertheim, Fourment, and Kosakovsky Pond 2012). Drug resistance associated mutations (DRAMs) were identified using the HIV Drug Resistance Database via the Sierra Web Server 2.0 (https://hivdb.stanford. edu/page/webservice/) (Liu and Shafer 2006).

Calculating viral genetic divergence
To determine intra-host genetic distance, we used a local installation of HIV-TRACE (HIV TRAnsmission Cluster Engine) (Kosakovsky Pond et al. 2018). Briefly, HIV-1 pol sequences were aligned in pairwise fashion to a reference sequence (HXB2; coordinates 2,253-3,749). TN93 (Tamura and Nei 1993) genetic distances were calculated among each pair of sequences from a given individual. Unlike in previous HIV-TRACE analyses of the NHSS data, all distances between nucleotide ambiguities were resolved (e.g. Y is 0 substitutions from both C and T) to lessen the likelihood that sequences from mixed infections or those of poor quality would spuriously be flagged as being highly divergent. For each person, we determined if consecutively sampled genotypes were more than 0.015 nucleotide substitutions/site divergent. This distance threshold was selected based on previous analysis of local and national HIV surveillance data in the USA (Oster et al. 2015;Wertheim et al. 2016;Wertheim et al. 2017a). In an HIV-1 surveillance context, if two individuals have HIV genetic sequences that are 0.015 nucleotide substitutions/site divergent, this similarity implies a direct or indirect epidemiological linkage (Wertheim et al. 2017a). Therefore, we queried the database for instances in which consecutive sequences from within a single individual would not be suggestive of epidemiological linkage.

Phylogenetic analysis
A maximum-likelihood phylogenetic tree was inferred from the 17,688 sequences using FastTree2 under a GTR þ CAT20 model (Price, Dehal, and Arkin 2010). Our inclusion criteria are biased towards individuals who are ART-experienced; therefore, we excluded 108 codons associated with DRAMs (Wheeler et al. 2010), as convergent evolution towards drug resistance can confound phylogenetic inference (Lemey et al. 2005). We used the ETE3 Toolkit (Huerta-Cepas, Serra, and Bork 2016) to determine whether the sequences from each of the 3,503 people with pure-subtype B virus were monophyletic or polyphyletic in the inferred phylogeny. A polyphyletic arrangement implies superinfection (Koelsch et al. 2003;Smith et al. 2004;Wagner et al. 2014), whereas monophyly suggests a single origin of infection (or potentially superinfection from a closely related source; see Section 4).

Regression analysis
Based on the genetic distance and phylogenetic analysis, we identified three populations for analysis: (1) monophyletic viruses with no consecutive strains exceeding 0.015 substitutions/site divergence [n ¼ 2, 914 individuals], (2) monophyletic viruses with at least one consecutive strain exceeding 0.015 substitutions/site divergence [n ¼ 240 individuals], and (3) polyphyletic viruses with at least one consecutive strains exceeding 0.015 substitutions/site divergence [n ¼ 149 individuals]. We excluded individuals with monophyletic virus in which a single virus was >0.015 substitutions/site from all other viruses in that person, because these instances cannot be easily distinguished from poor sequence quality (n ¼ 136 individuals). We also excluded individuals with non-monophyletic virus where the maximum genetic distance was <0.015 substitutions/site, because these instances cannot be easily distinguished from transmission within a local transmission cluster or poor resolution in a large phylogenetic tree (n ¼ 64 individuals). Our final dataset comprised 3,303 individuals.
We performed multivariate multinomial logistic regression analysis to investigate differences in these three populations. This regression analysis included year of diagnosis; transmission risk factor (MSM, PWIDs, people reporting high-risk heterosexual contact [heterosexual], perinatal, and other risk factors); and presence of common DRAMs (limited to M184V, K65R, K103N, Y181C, G190A, and L90M). MSM who reported injection drug use were classified as PWID. Regarding DRAMs, we considered mixed populations (i.e. sequence ambiguities indicating the presence of both DRAM and wild-type variants) to be presence of a DRAM.

Molecular clock analysis
We explored the viral dynamics in individuals with monophyletic, extremely divergent intra-host viruses. Sixty-three individuals had a maximum intra-host distance of !0.025 substitutions/site; we performed Bayesian molecular clock phylogenetic analysis on the eleven of these individuals with !10 viral genotypes using BEAST 1.8.2 (Drummond et al., 2012). For each individual, two independent runs of 5 million generations were performed, sampling every 500 generations and removing the first 10% as burn-in. A TN93 substitution model was implemented, including gamma rate variation. Month and year of genotype sampling was used to calibrate the molecular clock. Given the limited signal for calibrating a molecular clock in HIV trees of this size, we imposed a highly informative prior distribution on the substitution rate parameter of the strict molecular clock model, with a mean of 1.22 Â 10 À3 substitutions/site/year and standard deviation of 1 Â 10 À6 . This calibration comes from previous molecular dating using NHSS data (Wertheim et al. 2017b). A Bayesian Skyline coalescent prior with two steps was used. Convergence was assessed using TRACER 1.7 (Rambaut et al. 2018). We also performed maximum likelihood phylogenetic inference on these eleven trees using RAxML (Stamatakis 2014). The BEAST and RAxML phylogenies are available as Supplementary Material.

Recombination detection
Using the recombination detection program (RDP) in RDP4 we scanned for genetic recombination in 134 sequences from the 11 individuals with monophyletic viruses with the greatest intra-host genetic divergence (Martin et al. 2010).

Scan for superinfection
We interrogated the NHSS for evidence of superinfection. We identified instances in which virus from within a single individual was polyphyletic in the phylogeny and had a consecutively sampled virus that was >0.015 substitutions/site divergent. Of the 3,303 individuals infected with pure subtype B strains, 149 (4.5%) met these criteria for defining superinfection. Of these 149 individuals, only 9 individuals had viruses in which the divergent virus was genetically similar ( 0.015 substitutions/site) to another virus in the same host.

Within-host genetic divergence
To investigate patterns of longitudinal viral divergence, we identified a group for whom there was no evidence of superinfection: individuals with monophyletic virus in which consecutive viruses are never more than 0.015 substitutions/site divergent from the previous virus. Of the 3, 303 individuals infected with pure subtype B strains, we found 2,914 individuals (88.2%) who met these criteria. Unexpectedly, we found 240 individuals (7.3%) with monophyletic virus in which one or more consecutively sampled viruses was >0.015 substitutions/site from the previously sampled virus.

Maximum within-host genetic distance
Those 240 individuals who had highly divergent consecutively sampled viruses are similar to the extreme of the other 2,988 individuals with monophyletic virus (gray and red bars in Fig. 1). In contrast, virus from the 149 individuals with polyphyletic virus and probable superinfection formed a separate, more extreme distribution (blue bars in Fig. 1). The maximum genetic distance among these polyphyletic cases resembled random, within-Subtype B genetic distances in the US (Wertheim et al. 2017a). Individuals from all three groups had instances of within-host genetic divergence >0.03 substitutions/site, approaching random within subtype-B divergence. A similar pattern distinguishing these three groups can be seen in the mean within-host genetic distance ( Supplementary Fig. S1).

Distinguishing individuals with monophyletic and polyphyletic viruses
The phylogenetic and genetic distance approach to characterizing superinfection is limited by the inherent difficulty in distinguishing within-host diversity from superinfection from another person with a closely related virus (i.e. superinfection from within a transmission cluster). Therefore, it is possible that the tail of the distribution of uppermost genetic distances for individuals with monophyletic virus is actually superinfection from a closely related source.
Individuals with monophyletic, but extremely divergent virus were typically diagnosed significantly earlier in time (earliest 25%: 1992; median: 1996; latest 75%: 2002) Fig. 2B). These monophyletic individuals were more likely to have reported high-risk heterosexual activity or other risk factors. The proportion of PWID was not substantially different across these groups, which suggested that the significant AOR (Table 1) was attributable to early diagnosis years among PWID than non-PWID (median 1998 vs. 2003; Mann-Whitney U test; P < 0.001). DRAMs were significantly more common in individuals with monophyletic virus with extremely divergent, consecutively sampled virus than in individuals with monophyletic virus without extremely divergent virus (Table 1; Fig. 2C).

Investigating the patterns of extreme within-host genetic divergence
To better understand the evolutionary patterns that gave rise to extremely divergent intra-host viral variants, we performed Bayesian molecular clock analysis on individuals who had monophyletic virus and a maximum genetic distance of at least 0.025 substitutions/site (the upper 2.5% tail of maximum intrahost divergence in people with monophyletic virus). We restricted this analysis to the eleven individuals with at least ten viral genotype sequences to more clearly understand patterns of viral genetic variation. Genotype sampling in these eleven individuals was dense over the observation period. The 3,303 individuals previously analyzed had a mean of 1.3 genotypes reported per person-year. Within these eleven individuals (denoted here as Cases A through K), there was an average of 2.2 viral genotypes per person-year (240 genotypes over 60.6 person-years; Table 2).
The molecular clock analysis suggested that the extreme genetic distance observed in these eleven cases was consistent with their long duration of infection. In eight of the eleven cases, the 95% highest probability density for the inferred time of most recent common ancestor (TMRCA) included the year of diagnosis (Table 2). In three cases (Cases B, G, and J) the TMRCA was more recent than the year of diagnosis. In none of these cases did the TMRCA predate the year of diagnosis. However, we caution that date of diagnosis is necessarily the upper limit on the date of infection, which can preced the date of diagnosis by years. Further, the TMRCA should not a priori be expected to extend back to the date of infection.
Nine of these analyzed cases had a total of twenty-five instances of consecutively sampled viruses with >0.015 substitutions/site from the previous sequence (ranging between one and five instances per person) ( Table 2; Figs 3 and 4). We note that in many instances, these highly divergent viruses alternate between resistant and wild-type mutations at M184V. Case C exhibits five such events, alternating between M184V-resistant and distantly related wild-type clades. We found evidence for recombination in the pol region in only one of these individuals (Case I; Fig. 3). Nonetheless, we also observed fluctuation between M184V resistant and wild-type virus in this same Case. Only Cases A and J did not have any consecutively sampled virus that was >0.015 substitutions/site divergent. Nonetheless, the total genetic divergence detected in each of these two cases was over 0.025 substitutions/site.

Discussion
We report the results of an investigation into longitudinal genetic variation in HIV pol genotypes within the US NHSS. We found 149 (4.2%) individuals infected with highly divergent (i.e. >0.015 substitutions/site), consecutively sampled viral genotypes that were polyphyletic in a large HIV phylogeny. Surprisingly, we found >1.6-times as many individuals (240; 7.3%) with highly divergent, consecutively genotyped viruses that were monophyletic in the HIV phylogeny. This latter group was distinct from individuals with probable superinfection, comprising people who were diagnosed, on average, nearly a decade earlier than inferred cases of superinfection.  Furthermore, the maximum genetic distance within these individuals with extremely divergent, monophyletic virus closely resembled the maximum genetic distance observed in individuals without evidence of superinfection or highly divergent, consecutively sampled viral genotypes. A phylogenetic examination of eleven cases exhibiting extremely divergent, monophyletic virus suggests that decade worth of viral diversity is maintained within individuals. Many of these individuals had been infected for over 20 years, and this level of divergence is consistent with evolutionary rate in this region of HIV-1 pol, of about 1 Â 10 À3 substitutions/site/ year. However, this substitution rate is consistent with amonghost evolutionary rates, and the intra-host substitution may be substantially faster (Lythgoe and Fraser 2012;Alizon and Fraser 2013;Landais et al. 2017). However, the unusually long duration of infection and the slowing of evolution due to ART (Kearney et al. 2014;Lorenzo-Redondo et al. 2016) make it difficult to determine the appropriate rate prior for these cases. Moreover, the long-duration over which these individuals were surveilled raises the potential for a downward bias in viral substitution rate, inflating the TMRCA estimates (Ho et al. 2011). Regardless of the exact rate of evolution, the breadth of this accumulated genetic diversity in the eleven cases investigated in depth here was often detectable in genotypes sampled over a span of only a couple years (see Cases B and K in Table 2).
The maintenance of such highly genetically divergent strains, though common in chronic infection of another RNA virus, Hepatitis C virus, (Gray et al. 2011(Gray et al. , 2012Raghwani et al. 2016), has not been previously described for HIV-1. Longitudinal studies of HIV genetic variation have focused on the env due to its rapid evolutionary rate and immunological importance (Shankarappa et al. 1999;Laird Smith et al. 2016;Landais et al. 2017). A comprehensive investigation into longitudinal viral diversity across the entire HIV genome by Zanini et al. (2015) found that the env region underwent more frequent selective sweeps than the rest of the genome, resulting in the frequent purging of genetic diversity in the env region. However, Zanini et al. also documented rapid increases and decreases in pol diversity, though not to the extent reported here. Moreover, the only individual in the Zanini study that had pol divergence from baseline that approached the levels reported here (>0.02 substitutions/site) was assumed to be an instance of superinfection. Importantly, the mono-infected individuals in the Zanini (2015) study were followed <10 years since diagnosis, less than halfas-long as most of the cases with extremely divergent viruses described here.
We must consider the possibility that, rather than these cases representing the maintenance of extremely divergent populations, they are actually the result of superinfection from within a close-knit transmission cluster. This possibility would indicate that superinfection from within a transmission cluster occurs far more frequently than superinfection from an unrelated strain. Less plausibly, this possibility suggests that superinfection from individuals with closely related strains occurs preferentially in individuals with significantly older diagnosis dates and preferentially occurs among people with heterosexual risk factors. Although it is likely that some fraction of the individuals with extremely divergent, monophyletic strains are actually the result of superinfection, the substantial differences in time since diagnosis the between these monophyletic and polyphyletic groups suggests a different mechanism behind their genetic variation profiles. We note that in many of instances of oscillation between genetically diverged clades within monophyletically infected individuals in this study, these clades can be distinguished by the presence or absence of drug resistance at M184V in reverse transcriptase (Fig. 3). Resurgence of drug-resistant HIV from latently infected cells after treatment modification or drug recycling is a well-documented phenomenon (Kijak et al. 2002;Deeks et al. 2003;Joos et al. 2008;Little et al. 2008;Hedskog et al. 2010;Rocheleau et al. 2017). Different cellular reservoirs (e.g. peripheral blood mononuclear cells) often harbor distinct viral populations that could be the source of these re-emergent strains (Rozera et al. 2012). Rather than generating de novo mutations after the re-introduction of ART, pre-existing viral variants encoding drug resistance emerge into dominance. These pre-existing variants may also possess the necessary compensatory mutations to offset the fitness deficit arising from drug resistance mutations (Nijhuis et al. 1999;Gonzalez-Ortega et al. 2011). During this study, however, we did not have access to ART histories for any of these cases to determine if this resurgence correlated with changes or adherence to ART. Additionally, the M184 codon has biological importance beyond its potential for conferring drug resistance. This codon resides within a highly conserved sequence motif and is a known cytotoxic T-lymphocyte (CTL) epitope (Harrer et al. 1996). Therefore, genetic variants at this site are potentially subjected to dynamic CTL immune pressures as well as selection for and against drug resistance.
Conducting this study in a surveillance setting presents several limitations. Attribution of samples to a different individual or unintentional merging of two individuals within a surveillance database could artificially increase our estimates of superinfection. Moreover, poor quality sequencing of a single viral genotype could artificially increase intra-host diversity yet preserve monophyly. Furthermore, this investigation was limited to the analysis of bulk Sanger consensus sequences, which are routinely reported as part of HIV molecular surveillance in the US. The lack of population-level resolution inherent in consensus sequences prevents us from obtaining a clear picture of longitudinal intra-host diversity in these cases of interest. If similar cases of extremely divergent monophyletic viruses can be found in well-documented research cohorts, more in-depth investigations into the intra-host viral population dynamics will likely be possible. Public health agencies within and outside the US are increasingly incorporating molecular sequence analysis into their HIV surveillance activities to identify growing transmission clusters (Poon et al. 2016;Monterosso et al. 2017). The discovery of individuals with extremely divergent viruses may complicate efforts to identify potential transmission links using Sanger sequencing. Genetic distance approaches for constructing molecular transmission clusters implicitly assume relatively low levels of intra-host HIV diversity (<1.5% nucleotide identity). Whether using the earliest sampled genetic (Oster et al. 2015;Wertheim et al. 2017b;Kosakovsky Pond et al. 2018;Wertheim et al. 2018) sequence or all available sequences for a given person (Poon et al. 2015(Poon et al. , 2016, HIV molecular epidemiological methods need to account for the presence of individuals within clusters whose intra-host genetic variation is as great as random intra-subtype variation. Although this phenomenon appears relatively rare-given the number of people with decades-old diagnoses in the molecular surveillance databasea small number of problematic sequences can have large effects in genetic distance-based molecular transmission networks (Aldous et al. 2012;Kosakovsky Pond et al. 2018). Examination of viral populations within an individual at each specimen collection time point using next-generation sequencing may help to reveal these hidden variants and further our understanding of viral transmission dynamics.

Data availability
All data included in this article were collected and analyzed as part of CDC routine surveillance activities. These data cannot be made publicly available; CDC is not permitted to share or distribute any surveillance data due to an assurance of confidentiality authorized under Section 308(d) of the Public Health Service Act (USA). Each state has primary authority for determining whether their laws and regulations permit the submission to GenBank or other open databases. Disclaimer: The findings and conclusions of this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention (CDC). The use of trade names and commercial sources is for identification only and does not imply endorsement by CDC.