Motivation: HIV-1 antiretroviral drug resistance testing produces large amounts of HIV-1 protease and reverse transcriptase sequences. These provide an excellent resource to study the incidence, spread and clinical significance of HIV-1 subtypes. We have produced a program, Subtype Analyser (STAR) that rapidly and accurately subtypes HIV-1. Here we have determined a robust and statistically validated model for subtype assignment.
Results: We have significantly extended our HIV-1 subtyping tool (STAR), such that each query sequence when evaluated against subtype profile alignments, returns a discriminating score based on the ratio of subtype positive to negative amino acid positions. These scores were transformed into a Z-score distribution and evaluated. Of the 141 sequences used to define the subtype alignments, 98% were correctly reclassified. Inclusion of additional recombination detection within STAR increased the detection of known recombinant sequences to 95%.
Availability: STAR is available as compiled (Linux Fedora 3) or source code from http://pgv19.virol.ucl.ac.uk/download/star_linux.tar
Human immunodeficiency virus type-1 (HIV-1) is the retroviral agent that causes acquired immune deficiency syndrome (AIDS). As of December 2004 there were an estimated 40 million people living with HIV/AIDS, with 5 million new infections and 3 million deaths in 2004. One defining component of the retroviral life cycle is the reverse transcription of the viral RNA genome into DNA. The enzyme responsible for this, reverse transcriptase (RT), has a low level of fidelity resulting in an error rate of between 1/1000 and 1/10000 nucleotide misincorporations (Roberts et al., 1988; Takeuchi et al., 1988). Given a 9000–10 000 base pair genome, this indicates that progeny viruses contain between 1 and 10 polymorphisms per genome per round of replication. HIV-1 replicates to high levels in individual untreated patients (Ho et al., 1989), which coupled with the size of the HIV-1 pandemic results in viral quasi-species of HIV-1 within an infected individual and a global distribution of viruses that are genetically divergent (Wain-Hobson, 1993).
HIV-1 genetic diversity has been stratified and a subtype nomenclature based on sequence variation has been described. Initial HIV-1 phylogenies were inferred from viral envelope sequences (env) [subtypes A–F (Myers et al., 1992)], which were consistent when extended to group antigen sequences (gag) [except subtype E (Louwagie et al., 1993)]. Improvements in molecular biology, computational power and phylogenetic algorithms have subsequently allowed subtyping of the full length HIV-1 genome (Salminen et al., 1995; Salminen et al., 1996). Currently, there are 11 single subtypes (A–H, J, N, O) and a variety of circulating recombinant forms (CRF) of HIV-1 that represent recombination events between two or more single subtypes (Robertson et al., 1995; Robertson et al., 2000). The Los Alamos HIV-1 Sequence Database stores updated reference subtype sequences and provides phylogenetic tools, such as SUDI, that allow novel subtypes to be evaluated (http://hiv-web.lanl.gov/content/hiv-db/SUDI/sudi.html).
Although HIV-1 sequences and consequently subtypes evolve, HIV-1 subtyping allows clinical aspects of HIV-1 infection and management to be studied in the context of similar sequences. Web-based HIV-1 subtyping tools are available at both the NCBI (Rozanov et al., 2004) and at the Stanford HIV Drug Resistance Database http://hivdb2.stanford.edu/asi/deployed/hiv_central.pl?program=hivseq),both of which use reference datasets from Los Alamos. The NCBI subtyping tool uses the BLAST algorithm along a sliding window to compare HIV nucleotide sequences with reference subtype sequences, whereas the Stanford tool calculates a genetic distance between the test sequence and reference sequences. Neither method attempts to assign any confidence to the subtype prediction; both require nucleic acid as input and neither allow batch submission of sequences.
We have described previously the program STAR (Gale et al., 2004), which uses a region of HIV-1 polymerase gene (Pol) [99 amino acids of protease (PR), 340 amino acids of RT] to subtype sequences relative to a large set of accurately defined reference sequences. This region of HIV-1 is routinely used for drug resistance profiling thereby allowing subtyping of clinical sequences evaluated for drug resistance. STAR was designed to use amino acid sequences rather than nucleic acid, to allow retrospective subtyping of sequences in databases where information is stored as amino acid changes relative to a known reference strain. STAR utilizes position-specific scoring matrices (PSSMs) of each HIV subtype to perform profile subtype alignments. The frequency of amino acids at each position in the Pol fragment for a given subtype were compared against a sequence of unknown subtype with the maximum sum of the frequencies identifying the subtype of the unknown sequence. Any sequence with a score within 1.5 standard deviations from the mean of the reference sequence subtype score was used to highlight sequences that were difficult to classify or were putative recombinant subtypes. This scoring system was derived empirically and as such STAR required a more accurate means of assessing how well a query sequence fitted the data model. Here we describe an improved scoring mechanism for STAR allowing the statistical assignment of single subtype, recombinant subtype and unclassified subtypes.
2 SYSTEMS AND METHODS
A dataset of 174 sequences was obtained from GenBank and The Los Alamos Database. A dataset of 174 sequences comprising full-length HIV-1 genomes with annotated Gag, Pol and Env proteins was obtained from GenBank. Although the number of fully annotated HIV-1 genomes continues to increase, this dataset encompassed all single subtypes and the majority of CRFs. Of these 174 sequences, 65 were identified as Los Alamos HIV-1 subtype reference sequences (Supplementary Table 1). A dataset of 141 sequences were partitioned into 11 subtype alignments by assessing phylogenies derived from sequence identity/linkage, neighbour joining and maximum parsimony methods. The 33 sequences obtained with the original dataset were defined as recombinant or unclassified. This dataset was used to seed STAR, providing sufficient subtype coverage and allowing complex repetitive permutations of STAR program parameters. An additional 813 clinically derived sequences were obtained from the Health Protection Agency (HPA) Laboratories in the United Kingdom.
2.1 Scoring function
A multiple alignment of 141 HIV-1 Pol fragment sequences was performed using ClustalW (version 1.8) (Higgins and Sharp, 1988; Thompson et al., 1994). Owing to the high level of sequence conservation within HIV-1 Pol, it was possible to generate an alignment with only two gapped positions. In each case gaps were introduced as a result of single sequences. The positions that introduced gaps were excised from the overall alignment as they provided no phylogenetic or subtype determining information. The 11 subtype defined alignments were extracted, such that each subtype alignment corresponded with the structure of the complete 141 sequence alignment. Subtype alignments were converted into normalized positional amino acid frequency profiles and compared against a similarly normalized frequency profile of the complete dataset. Alignment of a query sequence of unknown subtype was performed using dynamic programming against a template sequence derived from a complete subtype consensus sequence. This, therefore, represented the alignment structure of the reference datasets. In this way the start and stop of the Pol fragment were determined, deletions in the query sequence (uncommon) could be gapped and insertions (usually owing to resistance mutations and therefore not subtype specific) could beexcised.
Each amino acid in the test sequence was compared with the odds ratio (subtype/complete dataset) of that amino acid at the corresponding position in each of the 11 subtype profile alignments. Odds ratio scores >1 indicated a positive discriminant at that position for the given subtype alignment, <1 indicated a negative discriminant. The sums of the number of positive and negative discriminants along the length of the sequence, expressed as a ratio (positive discriminant/negative discriminant) yielded an overall discriminant odds ratio (DOR) score for the test sequence against each subtype alignment. The subtype alignment with the largest DOR score indicated the subtype that the query sequence most closely resembled. The odds ratio scores at each amino acid position were simply counted as positive and negative discriminants (rather than summed or multiplied) to prevent overprediction of subtype by a few high scoring positions. To remove noise around a neutral odds ratio score of 1, positive and negative discriminant assigning thresholds were established (equations are detailed in Supplementary information). These meant that only positions with an odds ratio score above a positive or below a negative discriminant threshold were counted and included the overall discriminant ratio (Fig. 1).
To investigate the relationship between the level of positive and negative discriminant thresholds and their effect on the DOR score, the threshold levels were varied within ranges of 1.1–1.9/0.9–0.1 (positive/negative discriminant threshold) (Fig. 1). The 141 sequences used to define the 11-subtype alignments were reclassified using these varying discriminant thresholds, with the query sequence being removed from the subtype alignments prior to calculating the subtype odds ratio scores. Assessment of the DOR score for each of the 141 sequences showed the average highest score from the 141 member sequence set varied predictably; high negative discriminant (ND) and low positive discriminant (PD) thresholds generated higher summed scores and the converse producing lower scores, respectively (Fig. 2a).
As the subtype of the 141 sequences was known, the effect of varying discriminant thresholds could also be evaluated in terms of the number of sequences that were erroneously classified (Fig. 2b). Classification was adjudged incorrect if the highest scoring subtype alignment did not match the known subtype of the sequence. The analysis showed little overall misclassification with the highest level being 4–5% misclassification (6–8 sequences). More importantly, this analysis identified an island of discriminant threshold values between 1.5 and 1.6 (PD) and 0.4 and 0.5 (ND) that were optimal for classification. Using these discriminant thresholds resulted in correct classification of all except two sequences (98.6%).
2.2 Assessment of score
Despite the identification of the optimal discriminant thresholds there was no clear relationship between the magnitude of the DOR score from the maximal subtype alignment and likelihood of a correct subtype reclassification (data not shown). Rather than considering the maximal subtype ratio alone, the highest scoring subtype alignment (per query sequence) can be considered as the extremity of the distribution of the scores generated for the set of 11-subtype alignments. This was characterized using a Z-score transformation resulting in distribution with a mean of 0 and a standard deviation of 1 for the 11-point distribution of subtype scores (per query sequence). The same process of reclassification of the 141 sequences following prior removal from the subtype alignment was used to investigate Z-score transformation. The maximum score per sequence from the 11 possible scores was always >2. Analysis of Z-score frequency in terms of the percentage of sequences above a given Z-score at varying discriminant thresholds (analogous to Fig. 2a) was used to confirm the optimum discriminant threshold levels. This established a Z-score value that maximized correct classification of query sequence subtype (Fig. 2c).
As expected when the maximum Z-score threshold was reduced (Z-score 2.25, Fig. 2c), the overall percentage of sequences above the threshold was increased. At the lowest Z-score (2.25) a broad range of discriminant thresholds (PD range 1.4–1.9/ND range 0.5–0.3) resulted in >98% of sequences being above the Z-score threshold. At a Z-score of 2.5, only one pair of discriminant thresholds (PD 1.6/ND 0.5) resulted in >98% of sequences with Z-scores higher than the threshold. This threshold combination also correlated with the group of discriminant thresholds identified previously as resulting in only two erroneous subtype predictions (Fig. 2b). Again, as expected, a high Z-score threshold of 2.75 reduced the maximum percentage of sequences scoring above the threshold to 90–92%. In effect, a high Z-score threshold should have low coverage but high accuracy of subtype assignment with the converse being true of low Z-score thresholds.
Through this analysis we have identified discriminant thresholds of 1.6 (positive) and 0.5 (negative) that resulted in 139/141 sequences, matching the subtype alignment to which they had been assigned, with a maximal Z-score >2.5.
We investigated the distribution of the entire set of Z-scores for all query sequences (n = 1551) partitioning them into two sets, the top Z-score set (n = 141) and the remaining Z-score set (n = 10) (Fig. 3).
This analysis showed that there was excellent separation between the maximal Z-score (subtype predictive alignment) and the 10 other Z-scores (non-subtype predictive alignments). The two sequences that were not subtyped correctly produced maximal Z-scores <2.5. In these cases, the second highest scoring subtype alignment Z-scores were also relatively large (1.11, 1.51). This contrasts with the remaining 139 correctly assigned sequences with Z-scores >2.5, where the second highest scoring subtype PSSM scored <1. The Z-score based implementation of STAR (STAR-s) performed better than the original scoring threshold function of STAR (Gale et al., 2004). The number of sequences scoring above the respective thresholds increased from 93% with STAR (Gale et al., 2004) to 98% with STAR-s described here.
2.3 Assessment of recombinant or unclassified sequences
In cases where a query sequence returned a maximal Z-score below the threshold of 2.5, the subtype of that sequence could not be rigorously assigned. Unassigned query sequences were likely to be either recombinant sequences that contained regions of sequence similarity to more than one subtype alignment or sequences containing polymorphisms not defined within the subtype alignments. In cases where sequences previously defined within the Los Alamos database as recombinant [http://hiv-web.lanl.gov/content/index, (Robertson et al., 2000)] were analysed relative to single subtype sequences it was clear that, recombinant sequences generally produced low Z-scores (Fig. 4).
Two (of three) circulating recombinant forms with subtype B and subtype F recombinant regions (CRF12 BF) produced Z-scores >2.5; however, these sequences predominantly contain subtype B sequence with minimal subtype F recombinant regions (Carr et al., 2001). Consistent with the score distribution, when 814 clinical sequences were subtyped using the Z-score function (data not shown), the distribution of subtype B predicted sequences (maximum Z-score, subtype B PSSM) showed greater dispersion than the reference sequences used to define the subtype B alignment (Fig. 4).
Nevertheless, only 8.4% (n = 56) of clinical sequences predicted to be subtype B fall below the 2.5 Z-score threshold. This 8.4% of clinical sequences may contain polymorphisms that are not present within our initial subtype specific alignments or may suggest the presence of recombinant genomes.
To identify the recombinant genomes, all query sequences were reanalysed along the length of the sequence. Each query sequence was compared with the 11 HIV-1 subtype specific PSSMs; however the PSSMs were recalculated to give normalized percentage occurrence of each amino acid per position within the individual subtype alignments. The percentage occurrence of the amino acids present in the query sequence relative to a subtype PSSM were averaged within a sliding window of 50 and a step value of one amino acid. The subtype initially predicted for the query sequence provided a baseline score of 0 and the scores derived from the other 10 subtype PSSMs were compared with that baseline. The output of this analysis showed positions within the query sequence that were more similar to a different subtype rather than the predicted subtype. Summing the regions of each subtype trace from the first instance of a positive score to the first instance of a negative score (i.e. where the similarity to the alternative PSSM finishes), highlighted regions of the query sequence that were more similar to a subtype other than the predicted subtype (Fig. 5).
The recombinant regions within a query sequence were formally assessed by analysing their predicted length (if >50 amino acids) and the cumulative total percentage sequence identity difference relative to the initial predicted subtype. A ratio of summed percentage identity to sequence length was required to be >1 for the region to be considered recombinant. Assessment of the validity of mosaic recombinant architectures proved to be effective for simple dual subtype mosaic architectures (Fig. 5).
The main aim of recombination characterization was to separate reliable single subtype predictions from sequences which were less reliably subtyped. Therefore, query sequences were initially subtyped using the Z-score subtyping program and then reanalysed to detect the presence of recombination events. In the event of this second analysis phase detecting recombination, the initial Z-score was assigned a null value and the query sequence was reported as a recombinant with no statistical score.
A comparison of three implementations of STAR, the original STAR (STAR), the Z-score STAR (STAR-s), and the Z-score and recombination analysis STAR (STAR-rec) (Fig. 6a) showed that STAR-rec performed better than the other methods. STAR and STAR-rec correctly reclassify similar numbers of single subtype sequences when the method dependent confidence thresholds were applied (93 and 92% of 141 sequences, respectively). As described, STAR-s performs better during this reclassification (98%); however, both STAR-s and STAR potentially classify known recombinants as a single subtype. From 20 known recombinant sequences STAR and STAR-s assigned 13 and 8 sequences, respectively to a single subtype, scoring above the relevant threshold. However, STAR-rec identifies the majority of the known recombinant sequences (19 out of 20), although not significantly affecting the percentage reclassified known single subtype sequences (Fig. 6a). The receiver operated characteristic (ROC) (Fig. 6b) formally showed that STAR-rec had higher sensitivity and specificity that either STAR-s or the original STAR methods.
In summary, STAR-rec, combining Z-scoring and recombination analysis, provides a validated statistical framework for the accurate assignment of single HIV-1 subtypes while identifying potential recombinant sequences.
We have described here a validated scoring system for assigning HIV-1 sequence subtype to PR and RT sequences using the STAR program. This scoring system works effectively on HIV-1 sequences generated as a part of routine antiretroviral drug therapy monitoring. The scoring algorithm has been parameterized, such that each position in the test sequence was evaluated as being positively, negatively or neutrally associated with a given subtype alignment expressed as a PSSM. Optimal DOR score used to establish the association were determined, which coupled with an additional recombination analysis gave optimal performance in terms of predictive classification power, while minimizing false predictions. Hidden Markov models (HMM) were considered as an alternative to PSSMs; however, the hidden layers, use of a matched versus random score model and difficulty of detecting recombinant sequences within our implementation, rendered them less attractive than our PSSM model (addressed within Supplementary data, HMM).
STAR has been designed to allow rapid, accurate and user-friendly subtyping of HIV-1 sequences generated in a clinical context. It is our aim that clinical databases used to manage HIV-1 infected patients could incorporate subtype information produced from the sequences already used to monitor HIV-1 antiretroviral therapy resistance. In this context, STAR has advantages over currently existing subtyping methodologies. STAR does not require each query sequence to be submitted to a web-based subtyping tool and does not require the building or interpretation of phylogenetic trees. The use of amino acid based subtyping will allow retrospective subtyping of sequences already stored in databases as mutations from a subtype B consensus, which in some cases is the only sequence information available. Importantly, STAR also gives a statistical assessment of the accuracy of its subtype prediction.
The design of STAR as a clinical tool results in subtyping a relatively small fragment of the HIV-1 genome (1317 of the 9000 base pairs). Recombination events outside the Pol protein will not be detected. Recombination events at the 5′ or 3′ ends of the HIV-1 PR and RT fragment, resulting in a short region of one subtype and a longer region of a different subtype, also remain problematic. One of the three previously identified recombinant sequences CRF12 (B/F) was classified as subtype B with a Z-score >2.5, owing to the short nature of the F sequence in the region of HIV-1 examined. The remaining subtype recombinant sequences included in this analysis all generated a highest scoring subtype alignment below Z-score 2.5. It is envisaged that relative overprediction of sequences that appear recombinant is preferable to inaccurate prediction of genuinely recombinant sequences as confidently belonging to a single HIV-1 subtype. This will provide clean datasets of accurately subtyped sequences and a set of putative recombinant query sequences that will require subsequent downstream analysis by more sophisticated recombination analysis tools, such as LOHA (Bartosch et al., 2004) or RDP (Martin et al., 2005).
STAR was designed to partition HIV-1 polymerase amino acid sequences into subtypes defined by sequence variation, which complements, but does not replace HIV-1 phylogenetic analyses. This methodology is applicable to other datasets, which have been subgrouped by sequence variation. Full-length Hepatitis B virus (HBV) and HIV-1 nucleotide genomes have both been subtyped usingSTAR (unpublished data).
STAR is a tool that allows sutyping of a specific HIV-1 sequence resource (drug resistance monitoring) and as an aid to investigating the clinical implications of HIV-1 subtype. The majority of antiretroviral drug design and evaluation is undertaken using the HIV-1 subtype B virus, which predominates in both the US and western European epidemics. The influx of non-subtype B HIV-1 infected individuals into western Europe (Parry et al., 2001; Gale et al., 2004) and the increasing availability of therapy in Africa and Asia will generate large amounts of non-subtype B sequence data as a part of routine drug resistance screening. Accurate subtype prediction of this resource should, therefore, provide insights into the role of HIV-1 subtype and sequence variation, with respect to the clinical management and disease progression of non-subtype B infected individuals. Incorporating subtype information into HIV-1 patient management strategies will allow analysis of any potential subtype related differences in response to drug therapy, transmission and disease progression.
This work was funded by the Medical Research Council (MRC) (R.E.M.) and the Special Hospital Trustees of the Middlesex Hospital London (C.V.G.).
Conflict of Interest: none declared.