VectorBase (http://www.vectorbase.org) is a NIAID-supported bioinformatics resource for invertebrate vectors of human pathogens. It hosts data for nine genomes: mosquitoes (three Anopheles gambiae genomes, Aedes aegypti and Culex quinquefasciatus), tick (Ixodes scapularis), body louse (Pediculus humanus), kissing bug (Rhodnius prolixus) and tsetse fly (Glossina morsitans). Hosted data range from genomic features and expression data to population genetics and ontologies. We describe improvements and integration of new data that expand our taxonomic coverage. Releases are bi-monthly and include the delivery of preliminary data for emerging genomes. Frequent updates of the genome browser provide VectorBase users with increasing options for visualizing their own high-throughput data. One major development is a new population biology resource for storing genomic variations, insecticide resistance data and their associated metadata. It takes advantage of improved ontologies and controlled vocabularies. Combined, these new features ensure timely release of multiple types of data in the public domain while helping overcome the bottlenecks of bioinformatics and annotation by engaging with our user community.
VectorBase is a NIAID-funded Bioinformatics Resource Center (BRC) (1), which focuses on arthropod vectors of human pathogens. Our mission is to support the vector research community by providing access to genome assemblies, genome annotations and high-throughput data. VectorBase is involved in capturing community gene annotations, storing microarray expression studies and more recently population biology data. The collection of experimental and sample-related metadata has been aided through our development of ontologies and controlled vocabularies for vector-specific data, such as field-associated samples, pathogen transmission and insecticide resistance. VectorBase currently hosts nine genomes of which the majority are mosquitoes, reflecting their importance in disease agent transmission. The seven corresponding species are: Anopheles gambiae (three genomes, for the PEST, Mali-NIH and Pimperena colonies), Aedes aegypti, Culex quinquefasciatus, Glossina morsitans, Ixodes scapularis, Pediculus humanus and Rhodnius prolixus. We anticipate hosting genome clusters for a broader group of Anopheline mosquitoes, ticks and other important vector genera such as Glossina and Simulium. Full details about current and future genomes to be hosted by VectorBase can be found at http://www.vectorbase.org/organisms. Here, we highlight improvements and new features, and discuss genomes integrated since the last update (2). All information and data are available from our website at http://www.vectorbase.org.
Release cycles and early release of emerging genomes
VectorBase now releases data and software updates on a bi-monthly release cycle, such as genome browser improvements via the Ensembl project (3). Recent browser additions include tools for the visualization of user data sources: read coverage plots from high-throughput mRNA-sequencing experiments (BAM (4), WIG http://genome.ucsc.edu/FAQ/FAQformat.html), gene models (GFF3— http://www.sequenceontology.org/gff3.shtml) and population resequencing/variation data sets [VCF (5)] (Figure 1). Searching and selection of evidence tracks have been simplified with a greater level of customization of genome-based views.
To make emerging genome sequences rapidly available to our communities, we have recently introduced preliminary sites, called pre-sites, for newly assembled genomes. These contain temporary, unarchived automated gene predictions and transcriptome and proteome alignments. These pre-sites improve vector community involvement during initial analysis, including highly valued community-aided annotation. Once an annotation is finalized, additional analyses are performed such as our standard orthology/paralogy relationship predictions (6) and cross-referencing to other resources. This system was trialled for the R. prolixus and G. morsitans genomes.
Integration of community data
VectorBase has a mandate to capture community annotations. Community appraisal of the reference genome annotations has been important to assess automatic gene predictions and ensure correct models for many gene families as part of the initial genome publication (7) and subsequent analyses (8). Most current annotation data correspond to specific genes and/or gene families and are provided by community members through a simple spreadsheet submitted to our Community Annotation Pipeline. Integration of these data with existing gene sets has greatly improved reference gene sets (e.g. An. gambiae) and has led to a new ‘patch’ build system that uses heuristics to merge manual and automated gene predictions to allow more frequent gene set updates. Patch builds for three species (Ae. aegypti, C. quinquefasciatus and I. scapularis) were performed in 2011. To ensure timely release of community-sourced annotations, all community manual annotation data are made available as a Distributed Annotation System track within the genome browser (9). These data include corrections of gene structures and relevant metadata such as gene symbols and citations. Community-generated transcriptome data from newer sequencing technologies, known as RNA-Seq, are also increasingly being produced for VectorBase species. We have been using these data to validate existing gene models and predict new ones. Alignment algorithms such as Tophat (10), GSNAP (11) (short reads) or GMAP (12) (long reads), were used to map reads to the assembly and identify splicing junctions. Gene models were then reconstructed using Cufflinks (13) and a custom pipeline.
VectorBase has improved its text-based search facility by increasing the speed and the scope of the underlying engine. Search terms now include gene identifiers and descriptions, microarray experiments and expression data. Indices are regenerated for each release using the open source Apache Lucene technology (http://lucene.apache.org) and served using a web service. Information can be retrieved from the search box on the main site or from the genome browser; results contain hyperlinks to genes, their locations and where appropriate, their paralogs/orthologs. A custom interface, CVSearch, has been developed to search (keywords or identifiers) and browse ontologies and controlled vocabularies. More recently, we have used our GDAV open source tool (http://www.vectorbase.org/Help/GDAV) to provide access to available RNA-Seq data. For example, assembled RNA-Seq data for eight Anopheline species for which the genome sequencing is in progress are already available for download or blast, and searchable using keywords, gene identifiers or InterPro domains.
VectorBase continues to develop and maintain ontologies relating to control of disease vectors (14). Specifically, we host anatomy ontologies [TGMA for mosquitoes and TADS for ticks (15)] and a BFO compliant ontology of insecticide resistance [MIRO (16)]. Our most recent ontology is an extension of the Infectious Disease Ontology (IDO) called IDOMAL (17), which is a comprehensive malaria-focused ontology with more than 2300 unique terms including most related to the disease vector (e.g. vector control). All VectorBase ontologies strictly follow the rules established by the OBO Foundry (18), and can be browsed either at VectorBase or the NCBO Bioportal (http://bioportal.bioontology.org). These ontologies have also been deposited into the publicly accessible OBO Foundry (http://www.obofoundry.org).
Insecticide resistance data
IRbase is a dedicated section of VectorBase that hosts data from both published studies and recently analyzed data for field populations. It used to depend on our MIRO ontology but now relies on the newer IDOMAL ontology described above. We are in the process of incorporating these data into the population biology resource described in the next section.
As anticipated in our previous update (2), analyses of populations and variations at the genomic level have increased significantly. To accommodate these data sets, VectorBase has continued to improve its Ensembl-based genome browser for visualizing genomic variation data. As of 2011, the current resource contains data from the dbSNP database (19), variations derived from the An. gambiae Mali-NIH (M molecular form) and Pimperena (S molecular form) sequencing project (20), and genotypes obtained with the AgSNP01 SNP-array (21). We expect to increasingly use this functionality with the completion of a number of planned large-scale population sampling projects.
POPULATION GENOMICS RESOURCE
Integral to handling both genomic variations and insecticide resistance data is the capture of metadata, such as field collection locations and methods. The original IRbase (16) and more recent AgPopGenBase data from UC Davis/UCLA (http://www.vectorbase.org/PopulationData) were highly valuable but were not designed to store more diverse data types. To allow more flexibility, we developed a unified population biology resource that can store all of these data while linking to the genome browser when useful, e.g. high-throughput genotyping data from stored AgSNP01 chip hybridizations (21). This new resource currently contains just over 15 000 mosquito samples originating from over 1600 field collections and more than 34 000 phenotype/genotype assay results.
Population genomics database
We participated in the development of a Chado Natural Diversity Module (22) in collaboration with the GMOD consortium (http://gmod.org) and specific members (23–25). This module is an extension to the Chado database schema that stores population and variation data. The module has a simple, ontology-centred, design which allows the processing of data from a wide range of experiments by extending existing ontologies or adopting new ones.
The standard display methods provide a wide variety of options that can be customized by a submitter to best suit their data. By using an open web service and providing the visualization code under an open source license, we hope third-party displays will be developed and we will support these efforts through outreach and through VectorBase-hosted development mailing lists. As a concrete example, we have tested a number of visualizations that retrieve data from our resource and from the web service at EuPathDB (26). Other examples of this approach include the display of climatic, economic or human disease data. This functionality could enable co–analysis of vector and pathogen data of this kind.
Data can be submitted to the VectorBase Population Biology Resource via spreadsheet forms using open source tools to assist with formatting and ontology term selection (ISA-Tab (27) and Phenote, http://www.phenote.org). Genotypes are submitted to the variation resource in standard VCF format (5).
EXPANDING THE TAXONOMIC COVERAGE OF VECTORBASE
The decreasing cost of genome sequencing has radical effects on the scope of genome projects. Previously, VectorBase has partnered with large-scale sequencing centres to generate annotation and support single representatives from important vector genera, e.g., An. gambiae for Anopheles and Ae. aegypti for Aedes. Projects using newer generation sequencing methodologies can deliver assemblies at a fraction of the cost and have expanded to encompass multiple species from each genera. NIAID/NHGRI has approved several of these genome clusters including 15 Anopheline genomes, 11 Simulium genomes, 5 Glossina genomes, 2 tick genomes (including the improvement of the I. scapularis assembly) and a mite genome. In total, these represent a 4-fold increase of the number of genomes stored in VectorBase.
VectorBase will support these expanded genome clusters using many of the features described in this update. Each project will produce other data types such as RNA-Seq and variation data through population sampling. VectorBase has also developed a new genome annotation pipeline to infer gene structures from closely related orthologs via whole-genome alignment techniques. Thus a single, high-quality reference annotation set can be used to rapidly predict genes in the other members of a genome cluster. The improvements in the storage and visualization of RNA-Seq and variation data will be invaluable for supporting and augmenting these new genomes for our users.
In this update, we described improvements to existing features and integration of new data. Two significant advancements are the development of a bi-monthly release and pre-sites, providing the latest data at an early stage of their analysis, thus ensuring high community involvement. VectorBase also assists the community with a helpdesk system, on-line help (FAQs, forum, tutorials) and outreach at conferences. Decreasing sequencing costs are producing a wealth of vector-focused genomics data and expanding the taxonomic coverage far beyond mosquitoes. Although a first cluster of 15 Anopheline genomes is being sequenced, three clusters of related non-mosquito vectors are next in line. Re–sequencing or sequencing of individuals from the same species for population genetics study is also becoming more common. The future of vector genomics appears to be an expansion of both taxonomic coverage (breadth) and within-species re-sequencing (depth). By continuously improving its resources, as has been done in the past years, VectorBase is in a good position to meet this exciting challenge.
National Institutes of Health/National Institute for Allergy and Infectious Diseases (grant numbers HHSN266200400039C, HHSN272200900039C); partial support from: the Evimalar network of excellence (grant number 242095); INFRAVEC from the FP7 program of the European Commission (grant number 228421); Transmalariabloc from the FP7 program of the European Commission (grant number HEALTH-F3-2008-223736). Funding for open access charge: National Institutes of Health/National Institute for Allergy and Infectious Diseases [grant number HHSN272200900039C].
Conflict of interest statement. None declared.
We would like to acknowledge the reviewers for their useful comments and the many researchers that have provided data to our community resources (gene annotations, expression, variation data) and provided feedback.
As well as the authors listed above, the VectorBase Consortium is composed of: The VectorBase Consortium is composed of: European Bioinformatics Institute, UK: Ewan Birney, Martin Hammond, Paul Kersey, Nick Langridge; Harvard University, USA: Kathy S. Campbell, Madeline Corby, David Emmert, William M. Gelbart, Pinglei Zhou; Imperial College London, UK: George K. Christophides, Fotis C. Kafatos; University of California – Davis, USA: Travis Collier, Gregory C. Lanzaro, Yoosook Lee, Charles E. Taylor; University of New Mexico, USA: Phillip Baker, Margaret Werner-Washburne; University of Notre-Dame, USA: Nora J. Besansky, Ryan Butler, Rory Carmichael, David Cieslak, Nathan Konopinski, Andrew Thrasher, Gregory Madey and Frank H. Collins.