National and International Dimensions of Human Immunodeficiency Virus-1 Sequence Clusters in a Northern California Clinical Cohort

Abstract Background Recent advances in high-throughput molecular epidemiology are transforming the analysis of viral infections. Methods Human immunodeficiency virus (HIV)-1 pol sequences from a Northern Californian cohort (NCC) of 4553 antiretroviral-naive individuals sampled between 1998 and 2016 were analyzed together with 140 000 previously published global pol sequences. The HIV-TRAnsmission Cluster Engine (HIV-TRACE) was used to infer a transmission network comprising links between NCC and previously published sequences having a genetic distance ≤1.5%. Results Twenty-five percent of NCC sequences were included in 264 clusters linked to a published sequence, and approximately one third of these (8.0% of the total) were linked to 1 or more non-US sequences. The largest cluster, containing 512 NCC sequences (11.2% of the total), comprised the subtype B lineage that traced its origin to the earliest North American sequences. Approximately 5 percent of NCC sequences belonged to a non-B subtype, and these were more likely to cluster with a non-US sequence. Twenty-two NCC sequences belonged to 1 of 4 large clusters containing sequences from rapidly growing regional epidemics: CRF07_BC (East Asia), subtype A6 (former Soviet Union), a Japanese subtype B lineage, and an East/Southeast Asian CRF01_AE lineage. Bayesian phylogenetics suggested that most non-B sequences resulted from separate introductions but that local spread within the largest CRF01_AE cluster occurred twice. Conclusions The NCC contains national and international links to previously published sequences including many to the subtype B strain that originated in North America and several to rapidly growing Asian epidemics. Despite their rapid regional growth, the Asian epidemic strains demonstrated limited NCC spread.

Several phylogeographic studies indicate that human immunodeficiency virus (HIV)-1 subtype B was introduced into the United States from the Caribbean approximately 1970 [1,2]. In the years since the epidemic was recognized, the subtype B HIV-1 lineage that initiated the US epidemic spread to many other countries [1,3]. Concurrently, non-subtype-B variants gradually increased in prevalence in the United States [4][5][6] and other predominantly subtype-B countries [7][8][9][10]. By the early 1980s, large numbers of persons suffering from HIV-related illnesses were reported from the New York City, Miami, Los Angeles, and San Francisco metropolitan areas. These regions, frequent destinations for both immigration and travel, were points of origin for the global spread of the subtype B viruses.
Recent advances in high-throughput molecular epidemiology analysis revealed international links to regional HIV-1 epidemics [11]. In this study, we use a large collection of HIV-1 pol sequences in a Northern Californian cohort sampled over 2 decades, and we published pol sequences from other parts of the United States and from outside of the United States to quantify the degree and nature of national and international links in this cohort.

Persons and Virus Sequences
We analyzed protease (PR) and reverse-transcriptase (RT) sequences from a cohort of antiretroviral therapy (ART)naive persons from the Kaiser Permanente Medical Care Program-Northern California (KPNC) undergoing genotypic resistance testing at Stanford University between 1998 and 2016 (Northern California cohort [NCC]). The KPNC is estimated to provide care to approximately 30% of the insured population in Northern California. Genotypic resistance testing for ART-naive individuals became routine in 2003 [12,13]. Protease and reverse-transcriptase nucleotide sequences from 4553 individuals in the Northern California cohort are available in GenBank and the accession numbers are listed in the Supplementary Files.
We searched the Los Alamos National Laboratories (LANL) HIV Sequence Database, which contains all published HIV-1 sequences, to identify sequences encompassing PR and at least the first 200 amino acids of RT (HXB2 nucleotides 2253-3151) [14]. When more than 1 sequence from an individual is available, the representative sequence was selected at random per LANL protocols. When more than 1 sequence was available for persons in the NCC, we selected the earliest sequence. All LANL sequences from individuals also in the NCC were excluded from the LANL dataset. Sequences annotated as problematic by LANL were also excluded: synthetic sequences and those with high non-ACTG content, hypermutation, and potential contamination [14,15].
The LANL reference dataset, generated July 1, 2018, included 139 060 HIV-1 group M sequences. The LANL and NCC sequences were annotated with subtype, country, sample year, and surveillance drug-resistance mutations (DRMs) [16]. The NCC sequences were subtyped using the COMET program [17]. The NCC sequences were also annotated with age, gender, race, and HIV acquisition risk factor. Within the NCC, a generalized binomial logistic regression model was used to assess the relationship between sample year and proportion of persons with a non-B subtype.

Transmission Network Analyses
The HIV-TRAnsmission Cluster Engine (HIV-TRACE) was used to infer a molecular transmission network designed to identify NCC sequences genetically similar to sequences in the LANL dataset [18]. The NCC and LANL sequences were aligned to a reference PR and RT sequence (HXB2; GenBank accession no. K03455) using the codon-aware program bealign (BioExt package, https://github.com/veg/BioExt). Tamura-Nei (TN93) pairwise nucleotide genetic distances were calculated between each pair of sequences in the combined datasets, and sequence pairs with TN93 distances ≤2% were recorded for subsequent analyses [18,19]. Ambiguous nucleotides were handled as described previously by resolving 2-way ambiguities (RYMSWK) to maximize matches, averaging all other ambiguities, and averaging all ambiguities in sequences that contained 5% or more of ambiguous nucleotide positions [18]. A phylogenetic test of conditional independence was used to remove some spurious transitive connections that result in cycles of transmission using an edge filtering method (described in Supplementary Methods).
We used a 2-tiered TN93 distance cutoff to define a link (edge) between 2 sequences (nodes): a 1.5% threshold for linking NCC sequences to other NCC sequences and to LANL sequences [18], and a more stringent 1.0% to bring in LANL sequences that connect only to other LANL sequences [19]. The NCC transmission network contained clusters having 2 or more NCC sequences. The NCC-LANL transmission network contained clusters having 1 or more NCC sequences linked to 1 or more LANL sequences.

Phylogenetic Analyses
Phylogenetic and molecular clock analyses were performed to characterize the phylogenetic structure and transmission characteristics of several of the largest clusters. Maximum likelihood (ML) trees were constructed in IQ-TREE version 1.6.5 using the best-fitting nucleotide substitution model according to Bayesian information criterion [20,21]. Evidence in support of the use of a molecular clock model was assessed in TempEst version 1.5.1 using linear regression of the relationship of rootto-tip genetic divergence from the best-fitting root within the maximum likelihood tree of each taxon with sampling time [22]. Regression analysis revealed a positive rate of genetic divergence and random residual pattern for each of the 4 major clusters, indicating sufficient measurable evolution among the sequenced cohort for divergence dating (Supplementary Figure  1). Bayesian phylogenetic inference of time-scaled trees was performed using Markov chain Monte Carlo (MCMC) sampling implemented in BEAST version 1.10 using parameters described in the Supplementary Methods [23,24]. To accelerate molecular clock analyses on clusters with large numbers of sequences, we systematically subsampled the sequences from the larger clusters using a principled heuristic described in the Supplementary Methods. Subtype B accounted for 95.5% of HIV-1 isolates from NCC members (n = 4350). The remaining 4.5% (n = 203) belonged to a non-B subtype: C (n = 55), CRF01_AE (n = 52), CRF02_AG (n = 24), A (n = 15) and other miscellaneous subtypes, circulating and unique recombinant forms (n = 57). There was a statistically significant increase in non-B sequences prevalence from 1.8% between 1998 and 2003 to 5.5% between 2011 and 2016 (odds ratio, 1.06; 95% confidence interval [CI], 1.03-1.10).

The Transmission Network
The inferred transmission network contained 679 clusters ranging in size from 2 to 3502 sequences including 415 clusters containing just NCC sequences and 264 clusters containing both NCC and LANL sequences. The 415 NCC-only clusters comprised 1240 NCC sequences (27.2% of all NCC sequences). The 264 NCC-LANL clusters included 1151 NCC sequences (25.3% of all NCC sequences) and 11 735 LANL sequences (8.4% of all LANL database sequences). Overall, 2391 of 4553 NCC sequences (52.5%) were in a cluster, and 1944 (81.3%) of those were directly linked to other NCC sequences. The majority (95.4%) of clusters contained 2 to 13 sequences, with 31 large clusters containing 14 or more sequences. Of the 28 largest clusters with both NCC and LANL sequences, 9 contained non-B viruses ( Figure 1A), and most contained primarily LANL sequences ( Figure 1B).
Of the 1151 sequences in the 264 NCC-LANL clusters, 791 (68.7%) were directly linked to a LANL sequence, whereas 360 (31.3%) were directly linked only to other NCC sequences. Of the 791 NCC sequences directly linked to an LANL sequence, 425 (53.7%) linked solely to 1 or more US sequences, 156 (19.7%) linked solely to 1 or more non-US sequences, and 210 (26.5%) linked to both US and non-US sequences. The intricate web of interconnectivity between the NCC and LANL sequences is illustrated in Figure 2.
The 326 clusters with 3 or more sequences were further categorized into broad geographic classes: 223 (68.4%) local clusters (>60% of cluster members were from NCC), 50 (15.3%) national clusters (>50% of cluster members were from the United States, but not from NCC), and 33 (10.1%) international (>50% of cluster members were from outside the United States). Twenty (6.13%) clusters with 3 or more sequences did not belong to 1 of the 3 classes. The 1151 NCC sequences clustering with an LANL sequence were significantly more likely than the 3402 that did not cluster to be from persons who were male (94.4% vs 89.3%; P < .001), younger (median 38 vs 40 years; P < .002), MSM (74.7% vs 68.3%; P < .001), and intravenous drug users (6.8% vs 3.5%; P < .001). Within the clusters containing 3 or more sequences, NCC persons in the international clusters were more likely to be older (median, 41 vs 34 years; P < .001), white (53.8% vs 43.5%; P < .001), and female (8.3% vs 4.8%; P < .001) and less likely to be MSM (71.2% vs 78.5%; P < .001) compared with those in local and national clusters.

Subtype B and Non-B Sequences
Of the 203 NCC non-B sequences, 57 (28.1%) were directly linked to other NCC sequences and 45 (22.2%) were clustered with 1 or more LANL sequences including 41 directly linked to an LANL sequence. Of these 41 sequences, 33 (80.5%) were directly linked to a sequence in 1 or more non-US countries. The proportions of NCC sequences directly linked to LANL sequences was not significantly different between subtype B and non-B sequences (17.2% vs 20.2%; P = .32, χ 2 test). Of the 23 non-B clusters with LANL sequences, 16 included 1 or more sequences from outside of the United States ( Table  2). Compared to the 158 NCC non-B sequences that did not cluster with LANL sequences, the 45 NCC sequences that did cluster were significantly more likely to be from persons who were male (77.8% vs 58.9%; P = .032), Asian (40.0% vs 17.7%; P = .003), MSM (44.4% vs 26.6%; P = .035), or intravenous drug users (6.7% vs 0%; P = .01) and less likely to be African American (22.2% vs 54.4%; P < .001).
Of the 264 NCC-LANL clusters within the inferred transmission network, 241 (91.3%) contained subtype B sequences and 23 (8.7%) contained non-B subtype sequences including 7 CRF01_AE clusters, 3 subtype C clusters, 2 subtype D clusters, and 11 additional clusters each containing sequences belonging to a different subtype, circulating recombinant form (CRF), or unique recombinant form. Table 1 summarizes the sizes and spatiotemporal characteristics of the 28 subtype B clusters with 10 or more sequences. Table 2 summarizes the sizes and the spatiotemporal characteristics of the 23 non-B clusters.

Sequences With Drug-Resistance Mutations
In 27 (10.2%) of the 264 NCC-LANL clusters, 42 NCC sequences shared 1 or more established DRMs with a linked LANL sequence; 11 of these sequences shared 2 or more DRMs between the NCC and LANL sequences. Eighteen sequences had shared nonnucleoside reverse-transcriptase inhibitor (NNRTI) resistance, 17 had shared nucleoside reverse-transcriptase inhibitor (NRTI) resistance, 4 had shared protease inhibitor (PI) resistance, and 3 had shared dual NRTI/PI resistance. Forty of the 42 sequences with shared DRMs were from the United States and 2 were from Europe; 40 comprised subtype B viruses. The most common shared mutations within these sequences were the NRTI-resistance mutations T215D/E/S/C/V (18 sequences) and M41L (11 sequences); the NNRTI-resistance mutations K103N (12 sequences) and Y188L (3 sequences); and the PI-resistance mutations L90M (4 sequences) and I54V (3 sequences).

Notable Clusters and Northern California Cohort-Associated Transmission Events
The largest subtype B cluster contained 3502 sequences including 512 NCC sequences and 2990 LANL sequences from 45 countries, including 1868 from outside the United States. Seven of the US sequences were isolated in 1978 and 1979, suggesting that this cluster originated from the founding viruses of the US subtype B epidemic [2]. Twenty-five NCC sequences shared a TN93 genetic distance ≤1.5% with 1 of the 7 1978/1979 US subtype B founder viruses. This cluster represents a lineage that originated in Haiti and the United States and then spread widely to many countries in Europe, Asia, and South America [1,3,25,26], which explains its designation as an international cluster. The overall mean 2  3  4  5  6  7  8  9  10  11  12  13  14  16  17  18  19  20  21  22  24  26  28  33  34  35  37  45  50  68  262  1404  1492  1640  3123  3502   2  3  4  5  6  7  8  9  10  11  12  13  14  16  17  18  19  20  21  22  24  26  28  33  34  35  37  45  50  68  262  1404  1492  1640  3123  3502 Other  Figure  1A shows the numbers of clusters according to their sizes. Subtype B sequences are indicated in dark gray and non-subtype B sequences are indicated in yellow. Figure 1B shows the counts of NCC and LANL sequences in clusters grouped by their sizes with NCC sequences indicated in gray and LANL sequences indicated in dark red.
distance between sequence pairs in this cluster was 3.5% with the maximum distance between sequence pairs of 9.4%. The large size of the cluster likely results from chaining caused by the dense sampling of viruses belonging to this lineage since the earliest of these strains were identified 35 years ago. The second largest subtype B cluster contained 1492 sequences including 2 NCC sequences and 1490 LANL sequences, 1471 of which were from Japan. This cluster contained sequences closely related to the previously reported JP.MSM.B-1 strain, which accounts for one third of infections in Japanese MSM [27] (Supplementary Figure 3). One of the 2 NCC sequences was found within a well supported subclade originating in 1995 (approximate high posterior density [HPD] interval: 1992-1998) (Supplementary Figure 4). This subclade contained 4 European and 4 Asian sequences described by Takebe et al [27] as evidence of global dissemination of the B-1 strain.
The largest non-B cluster contained 3123 CRF01_AE sequences including 13 NCC sequences, 58 LANL sequences directly linked to 1 of these 13 sequences, and 3052 additional LANL sequences. Most of the sequences in this cluster were from China, Vietnam, Thailand, or Japan. Two subclades involving 3 NCC lineages each provide evidence of at least   from China ( Table 2). The recent origin of this cluster (1990; HPD, 1984-1995) is consistent with previous descriptions of this subtype as a rapidly growing population [28]. There is no evidence for ongoing transmission of these viruses in the NCC (Supplementary Figure 5). The third largest non-B cluster included 1404 subtype A6 sequences including 4 NCC sequences, 250 LANL sequences directly linked to 1 of these 4 sequences, and 1150 additional LANL sequences [29]. Most of these sequences were from countries of the former Soviet Union (FSU) ( Table 2). Molecular dating of this cluster using relaxed clock methods proved difficult. Bayesian estimates of cluster tMRCA were sensitive to the model of population size dynamics and showed wide CIs (1939;HPD, 1901HPD, -1969. Our analyses included many sequences outside the FSU, hence its results are not directly comparable to previous, FSU-focused, estimates. No significant evidence of onward transmission was found within the NCC (Supplementary Figure 6).
Several additional examples of how the inclusion of published sequences provides insight into transmission dynamics are shown in Supplementary Figure 7. For example, a cluster of just 2 CRF24_BG NCC sequences was expanded to a cluster of 33 sequences involving 2 additional countries (Supplementary Figure 7A). Two NCC subtype C sequences  that were disconnected in the NCC-only network became hubs in a cluster of 50 sequences from 7 additional countries (Supplementary Figure 7B). A cluster of 2 NCC subtype B sequences and an unconnected subtype B NCC sequence were linked to a cluster of LANL isolates from the United States and Canada (Supplementary Figure 7C), and a cluster of 4 NCC sequences was augmented with a LANL US isolate and bridged to an older cluster of LANL sequences (Supplementary Figure  7D).

DISCUSSION
Standard practice in molecular epidemiology inference is to consider only sequences from a local or regional epidemic or to include a small subset of preselected reference sequences [30,31]. By expanding our analysis to all previously published sequences, we analyzed our local epidemic within the context of the global HIV-1 pandemic. Twenty-five percent of 4553 NCC sequences were in a cluster with 1 or more previously published sequences, and more than two thirds of these (17.4% of the total) had a direct link, defined as genetic distance ≤1.5%, to 1 or more previously published sequences. Approximately one half of these linked sequences (8.0% of the total) were linked to 1 or more non-US sequences. Subtype B accounted for 95.5% of the NCC sequences. The largest subtype B cluster contained many NCC sequences consistent with the fact that this lineage traces its origin to the United States [26]. Although most of the other large subtype B clusters were dominated by US sequences, the second largest subtype B cluster, which contained 2 NCC sequences, traced its origin to Japan.
Non-B subtypes accounted for 4.5% of the NCC sequences, and, unsurprisingly, these were more likely to cluster with a non-US sequence. The CRF07_BC cluster, which originated in East Asia, contained 3 NCC sequences, each of which appeared to represent a separate NCC introduction. The subtype A6 cluster, which originated in countries of the former Soviet Union, contained 4 NCC sequences, each of which also appeared to represent separate NCC introductions. In contrast, the largest CRF01_AE cluster, which was dominated by sequences from East and Southeast Asia, contained 13 NCC sequences, including 2 lineages, each containing 3 sequences, consistent with instances of ongoing local transmission. Although 2 of the 3 NCC individuals in the CRF07_BC cluster and 8 of the 13 NCC individuals in the largest CRF01_AE cluster were of Asian ethnicity, the countries of origin of NCC individuals were not available. The absence of data on country of origin is a limitation of our study that makes it impossible to distinguish between immigration and local dissemination as a cause of non-subtype B diversity in our cohort.
Approximately 1% of NCC sequences shared 1 or more established DRMs with a previously published sequence. The vast majority were subtype B sequences from other persons in the United States, consistent with the high prevalence of transmitted drug resistance (TDR) in the United States. However, the proportion of NCC TDR sequences linked to TDR sequences outside of the NCC was much lower than the 37% of NCC TDR sequences previously reported to link to 1 or more other NCC TDR sequences [12].
Despite a statistically significant positive correlation between the genetic distance between sequences and the length of time separating their isolation dates, there were many exceptions to this trend. Indeed, even sequences that differ from one another by ≤0.5% were often obtained several years apart, demonstrating that the presence of highly similar sequences between a pair of individuals should not be interpreted as evidence for direct transmission. Hypotheses about direct transmission make sense only when sequence analysis is combined with contact tracing [32][33][34].
As the number of sequences in transmission network analyses have risen, it has become increasingly common to first identify clusters using a rapid distance-based approach and to then use phylogenetic analysis to study individual clusters [19,35]. Because of how HIV-TRACE defines clusters, adding more sequences to the sample set can create new clusters, merge existing clusters, or add sequences to existing clusters. In contrast, clusters that are defined phylogenetically may disappear entirely with the addition of more sequence data, because, for example, the bootstrap support for subtending branches is degraded [36].
Including the totality of publicly available sequences in our analysis had several measurable and important benefits. First, by comparing NCC with previously published sequences, an additional 429 NCC sequences (9.4% of the total) were brought into the network, increasing the total number of clustered NCC sequences to 2391 or 52.5% of the total. This rate is similar to the rate of 52.3% reported for a smaller and intensively sampled cohort from San Diego [37], and it is notably higher than the rate of 32% reported from US national surveillance data [38]. In addition, the placement of individual sequences in the clusters was altered by the inclusion of previously published sequences, with NCC sequences becoming peripheral both in the large non-subtype B clusters displaying patterns of exponential growth and in many of the smaller clusters. Because HIV transmission dynamics can be strongly influenced by central nodes in transmission networks (and that there are relatively few of them), being able to more accurately identify such nodes is also important [39].

CONCLUSIONS
In conclusion, the NCC epidemic is dominated by subtype B viruses linked primarily to other NCC sequences. However, the largest subtype B cluster, which retained links to the earliest North American subtype B strains, contained large numbers of links to persons in other parts of the United States, Asia, and Europe. Several sequences belonged to well characterized international clusters responsible for rapidly expanding regional epidemics. However, despite their rapid regional spread, these viruses do not appear to have made significant inroads into the NCC, suggesting that HIV-1 transmission is influenced more by human behavioral factors than intrinsic viral factors. Continued monitoring of local epidemics in the context of global sequences will advance a better understanding the past, present, and potentially future courses of regional epidemics and support public health measures to optimally target prevention efforts.

Supplementary Data
Supplementary materials are available at Open Forum Infectious Diseases online. Consisting of data provided by the authors to benefit the reader, the posted materials are not copyedited and are the sole responsibility of the authors, so questions or comments should be addressed to the corresponding author.