Ensembl Genomes 2013: scaling up access to genome-wide data

Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species. The project exploits and extends technologies for genome annotation, analysis and dissemination, developed in the context of the vertebrate-focused Ensembl project, and provides a complementary set of resources for non-vertebrate species through a consistent set of programmatic and interactive interfaces. These provide access to data including reference sequence, gene models, transcriptional data, polymorphisms and comparative analysis. This article provides an update to the previous publications about the resource, with a focus on recent developments. These include the addition of important new genomes (and related data sets) including crop plants, vectors of human disease and eukaryotic pathogens. In addition, the resource has scaled up its representation of bacterial genomes, and now includes the genomes of over 9000 bacteria. Specific extensions to the web and programmatic interfaces have been developed to support users in navigating these large data sets. Looking forward, analytic tools to allow targeted selection of data for visualization and download are likely to become increasingly important in future as the number of available genomes increases within all domains of life, and some of the challenges faced in representing bacterial data are likely to become commonplace for eukaryotes in future.

ant new genomes (and related data sets) including crop plants, vectors of human disease and eukaryotic pathogens. In addition, the resource has scaled up its representation of bacterial genomes, and now includes the genomes of over 9000 bacteria. Specific extensions to the web and programmatic interfaces have been developed to support users in navigating these large data sets. Looking forward, analytic tools to allow targeted selection of data for visualization and download are likely to become increasingly important in future as the number of available genomes increases within all domains of life, and some of the challenges faced in representing bacterial data are likely to become commonplace for eukaryotes in future.

OVERVIEW AND ACCESS
Ensembl Genomes (http://www.ensemblgenomes.org) is organized as five sites, each focused on one of the traditional kingdoms of life: bacteria (specific URL http:// bacteria.ensembl.org), protists, fungi, plants and (invertebrate) metazoa. Vertebrate metazoa are the focus of the Ensembl project (http://www.ensembl.org) (1); Ensembl Genomes provides a complementary set of interfaces for non-vertebrate species. Core data available for all species include genome sequence and annotations of proteincoding and non-coding genes; additional data include transcriptional data, polymorphisms and comparative analysis. Interactive access is provided through a web interface providing genome browsing capabilities: users can scroll through a graphical representation of a DNA molecule at various levels of resolution, seeing the relative locations of features-including conceptual annotations [e.g. genes, single nucleotide polymorphism (SNP) loci], sequence patterns (e.g. repeats) and experimental data (e.g. sequences and external sequence features mapped onto the genome)-supporting the primary annotations. Functional information is provided through direct curation, import from the UniProt Knowledgebase (2) or imputation from protein sequence [using the classification tool InterProScan (3)]. Users can download much of the data available on each page in a variety of formats, and tools exist for upload of (various types of) user data, allowing users to see their own annotation in the context of the reference sequence. DNA-and protein-based sequence search are also available.
The data are stored in a set of MySQL databases using the same schemas as those in use for the Ensembl project. Direct access to these is provided through a public MySQL server (mysql.ebi.ac.uk:4157; user 'anonymous') and additionally through well-developed Application Programming Interfaces (APIs) that provide an objectoriented framework for working with the data. Database dumps and common data sets (e.g. DNA, RNA and protein sequence sets and sequence alignments) can be directly downloaded in bulk via file transfer protocol (ftp://ftp.ensemblgenomes.org).
Ensembl Genomes data are also made available through a series of data warehouses, optimized around common (gene and SNP-centric) queries, using the BioMart data warehousing system (4). The BioMart framework provides a series of interfaces, including web-based query building tools, for each of the Ensembl Genomes (eukaryotic) domains (e.g. at http://plants.ensembl.org/biomart/ martview) and a variety of other interfaces for interactive and programmatic access. BioMarts are not currently available for Ensembl Bacteria.
Ensembl Genomes is released 4-5 times a year, in synchrony with releases of Ensembl, using the same software as the corresponding Ensembl release. The overall suite of Ensembl Genomes interfaces mirrors the interfaces provided for vertebrate genomes in Ensembl, and allows users access to genomic data from across the tree of life in a consistent manner.

A COLLABORATIVE MODEL FOR GENOME-SCALE DATA
The Ensembl Genomes project is driven by a number of domain-specific collaborations, each with a scientific community with its own focus of interest. By working in partnership with us, communities can benefit from a robust infrastructure and the integration of their data within a comprehensive service. These collaborations take a number of forms. In some domains, we work with our partners to develop a community-centric service, aimed at each community's specific needs, but also mirror key data within the central Ensembl Genomes portal. Examples of such collaborations include VectorBase (http://www. vectorbase.org) (5), a resource for the genomes of invertebrate pathogens of human diseases; WormBase (http:// www.wormbase.org) (6), which maintains resources for nematode genomes, especially the model species Caenorhabditis elegans; PomBase (http://www.pombase. org) (7), the model organism database for the fission yeast Schizosaccharomyces pombe; and PhytoPath (http:// www.phytopathdb.org), a resource for plant pathogens, with a focus on fungi and oomycetes. In other domains, we collaborate more broadly with other integrative centers, with a goal of developing high-quality networks of interlinked resources through the sharing of common reference data and standards for interoperability. In the context of Ensembl Plants, for example, we work closely with the Gramene database (http://www.gramene.org) (8) and a number of leading European plant genomics and informatics centers through the transPLANT project (http://www. transplantdb.eu). In addition, we contribute to many community-driven projects to sequence, assemble and annotate particular genomes, and make the resulting data available through the Ensembl Genomes site.
Ensembl Genomes prioritizes data for incorporation, according to scientific importance. The criteria for priority treatment are first, data relevant to our specific collaborations; second, data from other major experimental species; and third, data from other species that provide local or remote evolutionary context for the priority species, and which are used to strengthen the comparative analysis provided in the site. For the first category of genomes, we actively work with our collaborators to produce the primary community-recognized annotation. For the second category, we supplement the reference annotation (often maintained by model organism databases or other similar resources) with additional high-value data sets. For several species in these two categories, we have constructed variation databases, which store genotypes, loci and phenotypes from large-scale genome-wide arraybased and resequencing studies, and have made the data available through specialized graphical views and an SNPcentric BioMart. Variation data are sourced from dbSNP (9) or Database of Genomic Variants archive (10) where available, or otherwise directly from the data producers. For the third category of genomes, annotation is generally incorporated from the original submitters with only limited enhancement (for example, the annotation of non-coding genes, if absent in the original submission).
At the time of writing, there have been 10 releases of Ensembl Genomes since the previous report was published in this journal (11). The current release is release 20, made public in September 2013. In this time, there has been a significant increase in the content of all five Ensembl Genomes sites.

Metazoa
Nineteen new genomes have been added, including the sponge Amphimedon queenslandica, the south and central American malarial mosquito Anopheles darlingi, the leafcutter ant Atta cepahlotes, the silkworm Bombyx mori, the water flea Daphnia pulex, the pacific oyster Crassostrea gigas, the owl limpet Lottia gigantea, the scuttle fly Megaselia scalaris, the centipede Strigamia maratima, the kissing bug Rhodnius prolixus, the red flour beetle Tribolium casteneum, the two-spotted spider mite Tetranychus uriticae, two annelid worms, two butterflies and three nematodes. Additional variation data (12,13) have been introduced for Anopheles gambiae, and new DNA-based comparative analysis has been added for nematodes.

Plants
Twenty-two new genomes have been added, including Chinese cabbage (Brassica rapa); soy bean (Glycine max), barley (Hordeum vulgare), banana (Musa acuminata), barrel clover (Medicago trunculata), the club moss Selagninella moellenorfii, foxtail millet (Setaria ilatica), tomato (Solanum lycopersicum), potato (Solanum tuberosum), two species of rice, two diploid ancestors of hexaploid bread wheat and, as taxonomic outliers, two algal species. In addition, the preliminary genome assemblies, homeologous SNP calls and expressed sequence transcript (EST) sequences available for bread wheat have been aligned to the genomes of barley and Brachypodium distachyon, and a sequence search has been implemented against the EST sequences that visualized the results in the context of their alignments to these references. Variation databases have been provided for barley (14), maize (15), rice (16,17) and sorghum (18). The variation data set for Arabidopsis thaliana has been expanded to include additional data from the 1001 Genomes Project (19) and other work, including phenotypic data (20). Additional comparative alignments have been produced for cereal genomes.

Fungi
Twenty-four new genomes have been added, including 20 plant pathogens (Blumeria graminis, Botrytis cornerea, Fusarium oxysporum, Gaeumannomyces graminis, Glomerella gramincola, Leptosphaeria maculans, Melampsora larcini-populina, Microbotryum violaceum, Nectria haematococca, Puccinia triticina, Sclerotina sclerotiorum, Sporisorium reilanium, Trichoderma reesei, Ustilago maydis, two species of Gibberella, two species of Magnaporthe and two species of Pyrenophora). Other species added include the human pathogen Cryptococcus neoformans, the truffle Tuber melanosporum and two additional yeast species. RNA-seq alignments (to the genome) have been added for P. triticina; EST alignments have been added for Phaeosphaeria nodorum, S. pombe, T. melanosporum and Zymoseptoria tritici; and new comparative genomic alignments have been added for certain Pyrenophora and yeast species. For phytopathogenic fungal (and protist) species, information about genes impacting on pathogenesis has been imported from the PHIbase database (21), and mutant and overexpression phenotypes are now represented in a color-coded form in the genome browser.

Protists
Eleven new genomes, including those of several important plant and human pathogens, have been added: Albugo laibachii, Entamoeba histolytica, Giardia lamblia, Guillardia theta, Hyaloperonspora arabidopsis, Leishmania major, Paramecium tetraurelia, Pythium ultimum, Tetrahymena thermophila, Toxoplasma gondii and Trypanosma brucei. New DNA alignments have been provided for the ciliates, the Peronsoporales and the Trypanosomatidae. A variation database has been added for Phytophthora infestans.

Bacteria
Ensembl Bacteria has been comprehensively expanded since release 17. Although previously the bacterial division of Ensembl had focused on a small number of selected clades, the division now contains all bacterial genomes that have been completely sequenced, annotated and submitted to the International Nucleotide Sequence databases (European Nucleotide Archive, GenBank and the DNA Database of Japan) (22), a total of 9089 genomes in the latest release. Additional information is incorporated from the UniProtKB, InterPro, information about operons from RegulonDB (23) and about reaction catalysts from Microme (http://www.micromedb.eu). To ensure that data within this expanded set remain discoverable, two new species selection mechanisms have been introduced into the portal, one using autocomplete and the other providing a taxonomically structured interface (illustrated in Figure 1). The latter also enables the restriction of (sequence and text) search to user-defined taxonomic segments. Additionally, the Ensembl Perl API has been extended with a new lookup module, allowing users to discover genomes matching their specifications (e.g. full or partial name-match, taxonomic identifier, nucleotide sequence accession) programmatically. Within the browser, an improved representation of transcripts and translations, capable of providing a correct representation of bacterial features (i.e. polycistronic transcripts and alternative translational initiation) has been introduced.

IMPROVED TOOLS FOR DATA ACCESS
A number of improvements to the Ensembl infrastructure have been made during the past year, including the introduction of a scrollable browser and a new RESTful API (a language-agnostic supplement to the existing Perl API), whereas the range of data formats provided (for appropriate data types) via file transfer protocol has been expanded to include Genome File Format and Variant Call Format. A new fast sequence search, based on a back-end provided by the European Nucleotide Archive, has been introduced for all species alongside a Basic Local Alignment Search Tool (BLAST) server. A feature allowing portions of gene trees to be highlighted based on the existence of common annotation has been introduced. Support has been introduced for annotations comprising structured assemblies of ontology terms (e.g. for complex phenotypic description), and a new browser has been implemented for ontological terms, which depicts the ancestry of annotated terms and provides links through to BioMart to allow users to retrieve gene sets annotated with any term in the display. Finally, automatic display of remote files is now supported for any data file using any known synonym to identify the reference sequence on which the data is to be visualized.

COMPARATIVE ANALYSIS
Extensive comparative analyses are performed between the sequences in Ensembl Genomes. Analyses include pairwise alignments between DNA sequences, using the tools LASTZ (24) and (for more diverged genomes) translated BLAT (25) combined with the use of the chain/net algorithm of Kent et. al (26). The number of these comparisons has increased and we now have 118 pairwise alignments. In Ensembl Plants, pairwise alignments are provided for rice against every other genome, A. thaliana against every other genome (except barley) and 14 other pairwise comparisons. In Ensembl Metazoa, comparisons are provided from Drosophila melanogaster to 11 other drosophilid species and 4 mosquitoes, for all pairwise combinations of A. gambiae, A. darlingi, Aedes aegytptii and Culex quinquefasciatus, from C. elegans to 8 other nematodes and from Brugia malayi to Loa loa. In Ensembl Fungi, all-against-all alignments are available in the Aspergillus, Hypocreales, Pucciniales, Pyrenophora clades and for Saccharomyces cerevisiae against Ashyba gossypii. In Ensembl Protists, DNA alignments are provided for each of three Phytophthora species against each other. No DNA-based comparisons are currently provided for bacterial species.
Protein alignments are used to reconstruct evolutionary trees for related genes using the Ensembl Compara Gene Trees pipeline (27). These are run for each eukaryotic domain and additionally for a representative selection of species from across the taxonomic space to identify widely conserved families and deep homologies between different evolutionary branches. In the current release, the pan-taxonomic database was constructed from the genomes of 12 chordates (11 vertebrates, plus Ciona intestinalis), 15 non-chordate metazoans, 7 plants, 7 fungi, 8 protists, 98 bacteria and 25 archaea. Genomes are chosen for inclusion according to a variety of criteria, including mutual taxonomic distance, number of recorded publications, prior inclusion in previous editions of the pan-taxonomic Compara and overlap with the reference proteome sets defined by the UniProt KB. In total, 79 005 gene trees have been constructed for a total of 1 070 325 proteins. Their distribution among the different taxonomic domains is shown in Figure 2. Bacterial proteins (from all included genomes) have additionally been grouped into families using the HAMAP (28) and Panther (29) resources.

CEREALS: SERIOUSLY BIG GENOMES
The genomes of several economically important crop species have not yet been completely sequenced owing to their large size and highly repetitive DNA. However, during the last year, early versions of the diploid barley genome (5 Gb) and the hexaploid bread wheat genome 400 Mb of which have been located on chromosome level using markers from extant physical and genetic maps. A total of 24 211 high-confidence protein-coding genes have been called, of which 64% are in anchored locations. The N50 is only 1405 bp, but the N50 of genecontaining scaffolds is much higher (8.4 Kb). Despite the fragmented nature of the genome, barley is represented conventionally in Ensembl Plants, with data shown at all levels from the karyotype through to comparative analysis and variation. In the absence of high-level scaffolding, approximate colocation of contigs to marker sequences located on the physical map is used to provide an approximation of the order and orientation of contigs at each chromosomal locus. Additionally, unanchored contigs have been grouped together in a synthetic 'chromosome' (consisting of the actual contigs with arbitrary gaps between them) to better fit the data model (and critically, to improve analysis times), and contigs of <200 nucleotides have been excluded from the database. In all other respects, the genome can be accessed in the same way as any of the better-assembled genomes in the resource. A typical view of the barley genome in Ensembl Plants is depicted in Figure 3.
The wheat genome assemblies published in late 2012 (30) are even more fragmentary with many contigs, even those in genic regions, less extensive than the genes themselves; thus, creation of accurate complete gene models is difficult. Therefore, in Ensembl Plants, we have presented the data in the form of alignments of the genomic contigs onto two better-assembled reference genomes, Brachypodium and barley. This enables the wheat sequences, and the homeologous variants (between the three wheat genomes) that have been identified, to be located in the context of full-length gene models predicted in these closely related species. Additionally, a set of 1.3 million wheat ESTs has been mapped onto the Brachypodium and barley genomes; and a sequence search facility provided against the wheat EST set that returns an alignment of the query sequence against the top-matching ESTs and additionally shows where those ESTs align to the genomes of these two related species.

PERSPECTIVE AND PRIORITIES
Over the next 2 years, we anticipate that increasingly complete versions of the hexaploid bread wheat genome will emerge from the efforts of the International Wheat Genome Sequencing Consortium (http://www. wheatgenome.org/), allowing for a transition from the current limited representation of the genome to a more complete representation. As such, wheat will be the first polyploid genome to be fully represented in the Ensembl system. Although the size and repetitive nature of the genome is a challenge in terms of assembly and annotation, the Ensembl database schema and interface are flexible and should accommodate data from polyploid species with only minor modification. Each of the three genomes will be separately analyzed in the Ensembl comparative analysis pipelines, placing each genome separately within each gene tree and identifying homeologous regions of DNA sequence that will be displayed in an integrated stacked visualization. The existing representation of homeologous variants will be extended to show their functional consequence on gene models.
A more significant challenge lies in data discovery, as the number of available genomes and data sets continues to rise: how can users discover whether information that might be of interest to them exists in the system? We anticipate an increased need for data analysis tools not just as an end in themselves but also as a route of data access. Users will not necessarily start their analyses knowing which genomes or genes they wish to work on; instead, they might wish to ask questions of the complete data set to identify genomes that differ in terms of their gene content or genes whose presence/absence/copy number difference differentiates two genomes (for example, to determine the difference between pathogenic and nonpathogenic strains of related organisms or between competent and non-competent vectors). Likewise, variant features will also be studied in the context of their presence or absence in certain individuals/populations. Supporting these use cases will require rapid gene classification (including common occurrences of novel gene families identified in multiple species) and a high-performance data warehouse to support analysis of the data and help users identify the features of interest within the total data set. Another increasingly important use case is likely to be for the dynamic display of data on demand from archived analyses (e.g. sequence alignments, variant calls) selected on the basis of associated experimental metadata. Developing such tools will be a priority for Ensembl Genomes over the next years.