Improving HIV proteome annotation: new features of BioAfrica HIV Proteomics Resource

The Human Immunodeficiency Virus (HIV) is one of the pathogens that cause the greatest global concern, with approximately 35 million people currently infected with HIV. Extensive HIV research has been performed, generating a large amount of HIV and host genomic data. However, no effective vaccine that protects the host from HIV infection is available and HIV is still spreading at an alarming rate, despite effective antiretroviral (ARV) treatment. In order to develop effective therapies, we need to expand our knowledge of the interaction between HIV and host proteins. In contrast to virus proteins, which often rapidly evolve drug resistance mutations, the host proteins are essentially invariant within all humans. Thus, if we can identify the host proteins needed for virus replication, such as those involved in transporting viral proteins to the cell surface, we have a chance of interrupting viral replication. There is no proteome resource that summarizes this interaction, making research on this subject a difficult enterprise. In order to fill this gap in knowledge, we curated a resource presents detailed annotation on the interaction between the HIV proteome and host proteins. Our resource was produced in collaboration with ViralZone and used manual curation techniques developed by UniProtKB/Swiss-Prot. Our new website also used previous annotations of the BioAfrica HIV-1 Proteome Resource, which has been accessed by approximately 10 000 unique users a year since its inception in 2005. The novel features include a dedicated new page for each HIV protein, a graphic display of its function and a section on its interaction with host proteins. Our new webpages also add information on the genomic location of each HIV protein and the position of ARV drug resistance mutations. Our improved BioAfrica HIV-1 Proteome Resource fills a gap in the current knowledge of biocuration. Database URL: http://www.bioafrica.net/proteomics/HIVproteome.html


Introduction
The objective of the BioAfrica HIV-1 Proteomics Resource is to provide accurate and comprehensive information on Human Immunodeficiency Virus (HIV) proteins. The first version of the resource was published in 2005 (1) and became popular across the scientific community. But, as with any resource, there is a need for information to be current and accurate. This is especially true in the fields of bioinformatics and proteomics, which have recently seen a massive increase in knowledge of both HIV-1 and host proteins as well on antiretroviral (ARV) drugs used for HIV treatment. In order to produce an accurate and comprehensive resource that complements other online protein resources, Bioafrica started to collaborate with Swiss-Prot ViralZone group of the Swiss Institute of Bioinformatics (SIB).
The collaboration of ViralZone and BioAfrica was funded by the Swiss South Africa Joint Research Programme (SSAJRP) in order to allow much of the knowledge produced by their independent projects to be synergized and presented in a new online section of the Swiss and South Africa academic websites. One of our main goals was to add information on ARV drug resistance and to generate new knowledge of the pathogen and host interaction. We also aimed to represent current understanding in an innovative and interactive way with graphical images and with links to relevant protein databases. We applied biocuration methods developed by UniProtKB/Swiss-Prot, which included manual extraction and structuring of information from the literature, manual verification of results from computational analyses, and mining and integration of large-scale datasets. The resource now includes protein structure and function, gene expression, post-transcriptional and translational modification, protease cleavage sites, drug resistance information and HIV and host protein interactions. In this biocuration paper, we present the upgrade of the BioAfrica HIV Proteome Resource and its synergetic interaction with ViralZone/SIB.

Methods
The original BioAfrica HIV-1 Proteome Resource and ViralZone were used as a starting point for the biocuration process. A team of researchers from South Africa and Switzerland updated the information using manual curation in accordance with Swiss-Prot standards (2). This involved reading a large number of abstracts, identifying relevant publications by searching literature databases and reviewing related UniProt protein pages. Papers were read in full and important information was extracted and summarized as text and images. We also manually accessed information in protein databases. Our updated resource provides a webpage for each of the 22 HIV-1 proteins. Each webpage is divided into six sections. These are: (i) General Overview, (ii) Protein Function and Host-Virus Protein Interactions, (iii) Genomic Location and Protein Sequence, (iv) Protein Domains/Folds/Motifs, (v) HIV ARVs and Drug Resistance Mutations and (vi) Primary and Secondary Database Entries. All of the proteome pages end with a list of the referenced articles, which are linked to PubMed. Below we describe the methods used for each section.
(i) The General Overview contains information on the main function and localization of the protein in the HIV-1 replication cycle. This section describes: (a) the main function of the protein, (b) protein isoforms, (c) protein cleavage sites and when these exist, (d) localization of the protein within the virus and host cell and (e) additional information on protein function. The objective of this section is to allow users to access key information on the function of each HIV-1 protein as well as on key online resources. This was the last section to be curated as it summarizes information presented in the other sections of the webpage. It also provides key links to other online resources such as ViralZone (2) HIV replication cycle, the Protein Database (PDB), Uniprot and the GenBank.
(ii) The section on Protein Function and Host-Virus Protein Interactions was produced from a critical literature review of experimental and predicted host protein interaction for each HIV-1 protein. All interactions were manually verified for each protein sequence in UniProtKB. The interactions were summarized in graphical images that were created in Illustrator, following the standard process in ViralZone (2). One of the objectives was to create images that represented the function of the proteins and that used standard colours and shapes. Much of the work summarized in the images was produced by deep annotation made by the ViralZone group alone. The images are used on the ViralZone and BioAfrica websites. The dual display of this information allows synergy to be produced between the websites and information to be consistent across sites.
(iii) The Genomic Location and Protein Sequence section presents the genomic coordinates and amino acid sequence for each HIV-1 protein. The genomic location is presented as a graphical image and was produced by Rega Subtyping Tool Version 3.0 (3). The graphical image and numbering positioning are produced according to HIV-1 reference sequence (i.e. HXB2). The amino acid sequence of the protein is displayed and can be downloaded as FASTA sequence.
(iv) The Protein Domains/Folds/Motifs section contains information from the analysis of the amino acid sequences of the reference protein by InterPro (4) and Prosite (5).
InterPro is a freely available database that is used to classify sequences into protein families and to predict the presence of important domains and sites. The link provided to InterPro contains information on the biological process and molecular function of the main domain in the protein.
For example, for HIV-1 protease, our resource links to the aspartic peptidase active site entry (IPR001969), which is a wide family of proteolytic enzymes known to exist in vertebrates, fungi, plants and retroviruses. The Prosite motifs of high probability of occurrence that are excluded from Interpro are also presented in the resource, such as N-glycosylation, N-myristoylation and protein kinase C phosphorylation site motifs.
(v) The HIV ARVs and Drug Resistance Mutations section is presented for the HIV-1 proteins that are targeted by ARVs. The section links information housed in ViralZone (2) and the Stanford HIV Drug Resistance Database (Stanford HIVDB) (6). The Stanford HIVDB contains detailed information on all of the licensed HIV drugs and how they interact with the HIV proteins (http:// hivdb.stanford.edu). HIVDB also regularly provides an updated, in-depth, referenced summary of HIV drug resistance mutations. This is displayed in tables, with links to detailed information on how the mutations act on the HIV proteins in order to cause resistance. (http://bioafrica.mrc. ac.za/hivdb/pages/download/resistanceMutations_hand Stanford HIVDB (6). We expect to continually upgrade the mutations and ARVs lists as we currently collaborate with Stanford on the maintenance of the southern African Stanford HIVDB mirror (7).
(vi) The Primary and Secondary Database Entries section provides links to related databases within each protein page of the HIV-1 Proteome Resource. All of the links are provided to the HIV-1 reference genome (HXB2 All of the proteome pages end with a list of the referenced articles, which are linked to PubMed. All of the links provided are functional and a script has been developed to check them every quarter. All of the updated webpages in the resource kept the original Bioafrica HIV proteomics resource html links. The reason for that is that these webpages also receive many links. For example, all of the 22 webpages of the resource are cross-linked from Uniprot in their HIV-1 reference sequence (i.e. HXB2) annotation page (e.g. http://www.uniprot.org/uniprot/P04608). The BioAfrica Proteome Resource is a part of http://www.bioafr ica.net website. All pages from the previous version of the BioAfrica Proteome Resource are still online. However, we modified the html address to 'proteomics2005' (e.g. http:// www.bioafrica.net//proteomics2005/POL-PRprot.html). This allows a comparison between the old and the newly upgraded resource (e.g. http://www.bioafrica.net/prote omics/POL-PRprot.html). In addition, Supplementary Figure 1 shows the old and upgraded protease pages.

Results
Our results section starts by presenting information on the HIV-1 proteome (i.e. all HIV-1 proteins) and their interaction with host-proteins. Each protein is also described in detail in the six sections of the manuscript, which mimic the sections of our online resource. We use the Env gp120/ gp41 and Rev annotation pages as examples to present some of the results of the biocuration process. Other proteins pages are available at http://www.bioafrica.net/prote omics/HIVproteome.html.
The HIV-1 genome is approximately 9.75 kb in length. It codes for only nine open reading frames (ORFs), including the Gag, Pol, Env, Tat, Rev, Nef, Vif, Vpr and Vpu genes. The HIV-1 genome is capable of producing 19 proteins via alternative splicing, alternative translation, ribosomal frameshift, alternative initiation and post-translational cleavages (8). Within the BioAfrica resource, there are 22 webpages dedicated to the 19 HIV-1 proteins and to the 3 polyproteins.

(i) General overview
Each HIV-1 protein page starts with a general overview. For example, the main function of the gp120 protein is viral attachment. This protein is an external membrane glycoprotein. It is localized at the host cell plasma membrane and virion envelope (more info at: http://bioafrica. net/proteomics/ENV-GP120prot.html and Figure 1). This section also contains a representative illustration of the protein in question. It also include a list of links to key online resources such as ViralZone, the Protein Database (PDB), Uniprot, the HIV-1/Human Protein Interaction Database, the Los Alamos HIV Sequence Database and EMBL/GenBank/DDBJ links.

(ii) Protein function and host-virus protein interactions
This section includes a detailed and well-annotated image illustrating the function of the protein in question within the host cell and its host protein interactions. For example, on the Env gp120 proteome webpage (http://bioafrica.net/ proteomics/ENV-GP120prot.html and Figure 2), the human proteins CD4, CCR5 and CXCR4, which are HIV-1 entry receptors found at the cell membrane are listed in the illustration and a link is provided to their Uniprot page. The gp120 webpage also links to a ViralZone webpage that describes in more detail the process of viral attachment to the host cell (http://viralzone.expasy.org/all_ by_protein/3942.html). In addition, gp120 has been shown to interact with DC-SIGN/CD209 on the surface of dendritic cells to enhance virion transmission and infection. DC-SIGN also facilitates mucosal transmission by transporting HIV to lymphoid tissue (9,10).
Using the Rev protein page (http://bioafrica.net/prote omics/REVprot.html) as a second example, we show the Rev-mediated export of unspliced or incompletely spliced viral RNA transcripts from the host nucleus to the cytoplasm, facilitated by various Rev-host protein interactions (11)(12)(13). Host-virus protein interactions highlighted in the image at the webpage, include CRM1/XPO1, Importinbeta 1, B23, DDX3X and Sam68. Importin-beta 1 (14) and B23 (15) form a complex with RanGTP and Rev to facilitate the transport of Rev from the host cell cytoplasm to the nucleus. Once inside the host nucleus, DDX1 binds to Rev and the Rev-responsive element (RRE) to facilitate their transport within the cell nucleus (16). Following this CRM1, the Rev-RRE nuclear export receptor is bound by RanGTP to form a CRM1-RanGTP complex. This induces the formation of a Rev-RRE-CRM1-RanGTP complex and initiates the export of Rev-RRE out of the nucleus (17). DDX3 (18) and Sam68 (19) bind to this complex and enhance the Rev-mediated nuclear export of viral RNA. Further host-Rev protein interactions not highlighted in the image include DDX5 and DDX24. It has been proposed that the Rev-DDX5 interaction plays a role in HIV-1 replication and association interference could result in the reduction of viral replication (20).
In addition, all host and virus proteins have been linked to their appropriate UniProt (http://www.uniprot.org/) pages. Together, the Protein Function and Host-Virus Protein Interactions section provides users with an illustrated description of the role of the HIV-1 protein within the virus life cycle as well as descriptions of host-virus protein interactions linked to relevant publications and resources. All interactions listed have been proved by dozens of experiments, which were manually curated from the literature. Table 1 summarizes information on the host protein interactions for all HIV-1 proteins.

(iii) Genomic location and protein sequence
This section provides a graphical representation of the location of the HIV-1 protein sequence in question relative to the HIV-1 HXB2 reference genome (96). This is followed by the amino acid sequence data (FASTA format). For example, the gp120 is a protein that contains 481 amino acids with a molecular weight of 53 922 Da and theoretical PI of 9.05 ( Figure 3). This protein is formed after a 30 amino acid signal peptide is cleaved from the amino terminal part of the ENV protein.

(iv) Protein domains/folds/motifs
As in the original version of the BioAfrica HIV-1 Proteomics Resource, the section on protein domains/ folds/motifs includes information about the predicted motifs and structure of the protein as well as the protein functional domains (1). For example, the gp120 had five variable loops (V1-V5). The V3 loop interacts with CXCR4 and CCR5 chemokine receptors and it is important for determining the preferential tropism for either T lymphocytes or primary macrophages (Figure 4). This section also includes information relating to protein secondary structure, low complexity regions, myristoylation, phosphorylation and glycosylation. For example, we list the Highly conserved intrachain disulfide bonds at cystein (Cys) Cys54-Cys74, Cys119-Cys205, Cys126-Cys196, etc. (http://bioafrica.net/proteomics/ENV-GP120prot.html).

(V) HIV ARVs and drug resistance mutations
The ARVs and Drug Resistance Mutations section is a new development to the BioAfrica Proteomics Resource. The section includes all ARVs targeting an HIV protein. These include protease (www.bioafrica.net/proteomics/POL-PRprot.html), reverse transcriptase (http://www.bioafrica. net/proteomics/POL-RTprot.html), integrase (http://www. bioafrica.net/proteomics/POL-INprot.html) and envelope gp120/gp41 proteins (http://www.bioafrica.net/prote omics/ENVprot.html). For each of these protein pages, there is an additional section that provides users with an overview of the current ARVs. Information in this section also includes descriptions of the type and position of the mutation and relevant publications. Using Env as an     Table 2). A further two potential mutations have been identified but need to be tested phenotypically. In addition to Env, there are ARVs and drug resistance sections on the reverse transcriptase, protease and integrase webpages. For example, the K103R mutation on the reverse transcriptase affects the ARVs Nevirapine (NVP), Delavirdine (DLV) and Efavirenz (EFV) and reduces the virus susceptibility to these drugs (7,13). Furthermore, when in combination with the V179D mutation, K103R mutants can decrease HIV susceptibility to NVP, DLV and EFV by 15-fold (99). All drug resistance mutations are based on the HIV-1 subtype B reference sequence (HXB2), however, we also link from the resource recent reviews that add information related to drug resistance to HIV-1 subtype C, which is the most prevalent HIV-1 strain in the world.

(vi) Primary and secondary database entries
The primary and secondary database entries section lists links to relevant online resources containing information about different aspects of the virus protein ( Figure 5).
Options include specific databases that provide users with sequence, function and protein-protein interaction data for each HIV-1 protein, as well as protein family annotations and post-translational modification information. A graphical representation (PDB format) of the protein is also provided in this section with links to the protein data bank entry. All of the proteome pages end with a list of the referenced articles, which are linked to PubMed. For example, for gp120 we provide 19 key references that were used in the curation process. The citations normally start with a link to the Los Alamos HIV Database Compendium and to the Retroviruses book, which are online accessible resources that contain detailed information about each protein. This is followed by the original publication on the function of each protein, which in the case of the envelope, is a Nature publication from 1988 that describes how a glycoprotein of HIV-1 binds to the immunoglobulin-like domain of CD4 (100).

Discussion
Upgrading BioAfrica in collaboration with ViralZone and with the use of SwissProt curation methods was a time consuming but worthwhile undertaking. The process consisted of reading hundreds of manuscripts to critically review experimental and predicted data for each HIV protein as well as host proteins that interact with HIV. Curation included extracting and structuring information from the literature, manually verifying results from computational analyses and mining large-scale protein datasets. The process involved collaboration with professional curators from Switzerland and the training of South African researchers in biocuration. Furthermore, it provided synergy between BioAfrica and ViralZone information, which will allow users to access high-quality information that is available in two popular protein curation resources.
Prior to the upgrade of BioAfrica, the majority of resources only provided users with information about the virus or the host proteins. In addition, no resource linked this information to ARVs and drug resistance mutations. Our online resource provides comprehensive detail about various aspects of each HIV-1 gene product. It now includes information about protein isoforms, localization, function, sequence data (based on the HIV-1 reference  sequence HXB2), protein domains/folds/motifs and host and virus protein-protein interactions. We believe that the easy access to well curated and current information will advance HIV drug resistance and HIV vaccine research and will provide a better understanding of the interaction between the host and the virus.

Supplementary data
Supplementary data are available at Database Online. Conflict of interest. None declared.