Genetic Diversity From Proviral DNA as a Proxy for Time Since HIV-1 Infection

Abstract HIV-1 RNA genetic diversity predicts time since infection, which is important for clinical care and research. It is unclear, however, whether proviral DNA genetic diversity sampled under suppressive antiretroviral therapy can be used for this purpose. We tested whether proviral genetic diversity from next-generation sequencing predicts time since infection and recency in 221 people with HIV-1 with known infection time. Proviral diversity was significantly associated with time since infection (P < 5×10−7, R2 up to 25%) and predictive of treatment initiation during recent infection (area under the curve-receiver operating characteristic up to 0.85). This shows the utility of proviral genetic diversity as a proxy for time since infection.

diversity and proviral reservoir size.This has implications, for example, when deciding on simplifying antiretroviral therapy [1] or in investigations about immune responses [2].However, its estimation is often challenging due to lack of a previous negative HIV test or recall of unambiguous risk situations leading to an infection.
As HIV diversity increases with infection time, different diversity-based approaches have been developed for estimating time since infection and especially if a PWH was recently (ie, less than 1 year) infected.For example, Kouyos et al [3] used ambiguous nucleotide frequency from Sanger sequences from routine HIV drug resistance testing, and Carlisle et al and Puller et al [4,5] showed that an average pairwise diversity score (APD) based on next-generation sequencing (NGS) provides an even more accurate measure.In plasma virus-derived sequences from antiretroviral therapy (ART)-naive PWH, APD score correlates well with time since infection and has a receiver operating characteristic (ROC) area under the curve (AUC) of over 95% to determine if PWH were infected recently [4].
For a large number of PWH, the pre-ART sequences required for these approaches are not available.However, increasing numbers of PWH may have proviral DNA sequences performed for research purposes, or to guide treatment simplifications or treatment with long-acting antiretrovirals [6].Such proviral DNA sequences might in principle inform the time between infection and therapy initiation, as it is expected that the diversity of the viral reservoir increases with the length of this time window, but then stops after ART has suppressed viral replication [7,8].However, proviral diversity also differs in important ways from pre-ART viral diversity: proviral diversity represents the accumulated diversity over the entire infection, it may be affected by the decay of the reservoir, and by hypermutations in proviral DNA caused by APOBEC3G/F [9].
As these differences may affect the association with prediction of infection time, we evaluate in this study the utility of proviral sequences sampled post-ART as a proxy for the time between infection and ART.Given the role of APOBEC3G/F as a source of noise, we combine this approach with a hypermutation filtering on a NGS read level.[11].These include PWH with a negative HIV-1 test within 1 year prior to the date of diagnosis and PWH with a clinical diagnosis of a documented primary HIV infection based on a comprehensive clinical assessment by a highly experienced research team.We determined the date of diagnosis as the earliest date of the following events: SHCS registration, first HIV-1 positive test, or first HIV-1 laboratory measurement.The date of infection was defined as described previously [3]: (1) for PWH in the ZPHI as the estimated date of infection, (2) for PWH with primary infections as the date of diagnosis minus 30 days (to account for incubation time), and (3) for all others as the midpoint between diagnosis date and last negative test.We selected proviral NGS sequences from those selected PWH without ART interruption and virological failure until the day of sampling.Samples were predominantly sequenced in a study that systematically sequenced the proviral DNA of all SHCS participants without HIV-RNA genotyping available [12].We considered the length of 2 time windows for the analysis, the number of years from the date of infection until date of ART start (t InfectionToART ), that is, time since infection, and the time number of years from ART start until proviral NGS sequence sampling (t ARTtoSampling ).

NGS Sequencing
DNA was isolated from on average 5 million peripheral blood mononuclear cells and proviral DNA was amplified by (1) near full-length polymerase chain reaction (PCR) and followed by 2 nested hemilength PCRs [12]; (2) if unsuccessful, near fulllength PCR followed by nested near full-length PCR; or (3) 2 hemilength PCRs amplifying a 5′ amplicon and a 3′ amplicon followed by nested hemilength PCRs was performed as previously described [7].NGS sequencing was performed for the near full-length HIV-1 genome using the MiSeq Reagent Kit version 2 (300-cycles).Majority consensus alignments were created from the NGS reads using SmaltAlign (https://github.com/medvir/SmaltAlign).From majority consensus sequences, respective genes (gag, pol, env) were extracted with BLAST and codon alignments were made with the HIV-1 reference strain HXB2 using MACSE2 [13].

APOBEC Hypermutation Filtering
Hypermutation filtering was performed based on a previously published method [9,14].We adapted this method to the level of single NGS reads, using 3 different P value thresholds to determine hypermutation status of a read and subsequent removal: (1) a constant threshold of P < .05;(2) a liberal dynamic threshold based on the bootstrapped upper 95% confidence interval (CI) of the mean from the hypermutation P value distribution of RNA sequences, randomly selected from the SHCS NGS database at the University Hospital Zurich, for each HIV-1 genome position (HXB2 as reference); and (3) a conservative dynamic threshold based on the bootstrapped upper 95% CI of the upper 90% percentile interval of the P value distribution of RNA sequences for each HIV-1 genome position (HXB2 as reference).Filters and their effect are shown as an example in Supplementary Figure 1.After filtering, we generated a new fastq file, reran SmaltAlign, and recalculated the APD.

Average Pairwise Diversity Score
We calculated the APD score as described by Carlisle et al and Puller et al [4,5] based on the third codon position of gag, pol, and env individually on the NGS sequence reads and after applying the 3 different hypermutation filters described above with a coverage threshold of 100 reads for each position.

Time From Infection to ART and Recent Infection Analysis
We used linear regression models to determine the fraction (R 2 ) of the variance of t InfectionToART (time since infection) explained by the APD score calculated on gag, pol, and env.We used ROC curve analysis to determine the validity of the APD for the prediction of recent infection by the APD score calculated on gag, env, and pol separately and in combination for all different hypermutation thresholds.We used 2 approaches: (1) including all NGS data sets comprising at least 100 codons of the respective gene env/gag/pol (designated as partial length); and (2) full length, including only NGS data sets covering nearly the entire gene, that is, > 95% of codons of the respective gene gag/pol/env (designated as full length).
We found significant associations of APD with t InfectionToART , but not with t ARTtoSampling (Supplementary Table 2, and Supplementary Figures 2 and 3).Depending on the gene considered and the hypermutation-filtering threshold used, APD explained between 5% and 25% of the variance in t InfectionToART (quantified as the R 2 in a linear regression model; Figure 2B  and 2C), with the best performance (R 2 = 25%) obtained for pol full length and the dynamic conservative threshold.By contrast, APD explained only between 1% and 6% of the variance of t ARTtoSampling, (Supplementary Figures 4-6).Overall, across genes, hypermutation filtering increases the R 2 of t InfectionToART , e632 • JID 2024:230 (15 September) • BRIEF REPORT in particular for pol full length and env.For gag, however, R 2 is highest without any filtering (Figure 2C).When assessing the ability of APD to predict t InfectionToART in leave-one-out cross validation, we found the lowest mean absolute error (MAE) in predicting t InfectionToART by pol with dynamic conservative threshold and full length (MAE, 1.19 years).Whereas the MAE was highest for env (MAE, 2.19 years), with dynamic liberal threshold and full length (Supplementary Table 3).

DISCUSSION
In this work we showed that a diversity score derived from proviral DNA HIV-1 NGS sequences from individuals on suppressive ART is associated with the time since infection (t InfectionToART ) and recent infection status.Its predictive accuracy is lower than that of viral diversity derived from plasma HIV-1 RNA [4], in particular when partial sequences were included.However, when restricting the analysis to full-length sequences and hypermutation filtering, predictive performances are in the range of what is achieved with treatment-naive plasma RNA for pol/env (AUC of 0.84/0.85for proviral DNA compared to ≥0.95 for viral RNA).For gag, hypermutation filtering showed no improvements, which may be explained by the lower G→A substitution rates in gag [15].The performance increase comparing partial pol to the entire pol gene is striking (Figure 2C).This may be explained by absence of the pol positions 3000 to 4000 in almost 50% of sequences (Supplementary Figure 9), which previously were shown to have the highest predictability for time since infection [5].Finally, we show that the APD only has minor associations with t ARTtoSampling , confirming our assumption and previous evidence [7] that there is almost no viral evolution under suppressive ART.
The main limitation of this work is the small number of recovered gene sequences, which is most likely due to low reservoir sizes in early treated PWH [1].It may also be because of the low specificity from the hypermutation filtering and subsequent failure of NGS assembly due to a lack of reads.Another limitation is the between-sequence overlap in partial length sequences, which may impact comparability of APDs inferred from different regions within a gene.Further, we could not identify an overall optimal hypermutation filtering threshold across all genes.Nevertheless, we show improvements of both the explained variance and AUC with hypermutation filtering compared to not filtering at all.