PortEco: a resource for exploring bacterial biology through high-throughput data and analysis tools

PortEco (http://porteco.org) aims to collect, curate and provide data and analysis tools to support basic biological research in Escherichia coli (and eventually other bacterial systems). PortEco is implemented as a 'virtual' model organism database that provides a single unified interface to the user, while integrating information from a variety of sources. The main focus of PortEco is to enable broad use of the growing number of high-through-put experiments available for E. coli, and to leverage community annotation through the EcoliWiki and GONUTS systems. Currently, PortEco includes curated data from hundreds of genome-wide RNA expression studies, from high-throughput pheno-typing of single-gene knockouts under hundreds of annotated conditions, from chromatin immunopre-cipitation experiments for tens of different DNA-binding factors and from ribosome profiling experiments that yield insights into protein expression. Conditions have been annotated with a consistent vocabulary, and data have been consistently normalized to enable users to find, compare and interpret relevant experiments. PortEco includes tools for data analysis, including clustering, enrichment analysis and exploration via genome browsers. PortEco search and data analysis tools are extensively linked to the curated gene, metabolic pathway and regulation content at its sister site, EcoCyc.


INTRODUCTION
The central role of Escherichia coli research in the history of molecular genetics, systems biology and synthetic biology make the data generated from E. coli important not only for this model organism, but also for bacteria in general, including environmental sequencing and human microbiome studies. High-throughput molecular biology technologies are transforming biological research, making it possible to probe the detailed systems responses of organisms to perturbations in their genetics or environment. A large number of such data sets have been, and continue to be, collected for E. coli, one of the best-studied bacterial model organisms. PortEco (http://porteco.org) is a data resource that provides access to data and tools to allow users to efficiently find and integrate information from more than half a century of basic research on laboratory E. coli, its phages, plasmids and mobile genetic elements.
PortEco's mission is to support bacterial research, by facilitating access to the massive (and continually growing) volume of experimental data for E. coli, and eventually other bacterial model systems. Making these data truly accessible requires both data handling-collection, consistent and updated processing, curation (e.g. creation of accurate data descriptions)-and databases and intuitive software for users to find and analyze existing data to help pose or answer novel research questions. PortEco is designed to be a 'central point of access' for such data, but it does not seek to reinvent the wheel. EcoCyc (1) already provides curated, review-level data for E. coli genes, metabolic pathways and, in collaboration with RegulonDB (2), operons and gene regulatory interactions. PortEco, by contrast, focuses on high-throughput experimental data and analysis tools, described in detail below, as well as covering genetics data and information about E. coli plasmids and phage. PortEco, through EcoliWiki (3) and GONUTS (4), also provides community input for a variety of areas. As researchers need to quickly navigate between all of these data sources, PortEco and EcoCyc have extensive reciprocal links, and the PortEco integrated search simultaneously searches PortEco and EcoCyc, as well as other, more specialized, data resources. Together, these resources create a more complete and powerful solution for the needs of researchers using E. coli as a model system for microbiology and molecular biology, for biotechnology, or as a platform for systems and synthetic biology.

INTEGRATED SEARCH
The PortEco search is designed to be a 'one-stop' search for information about E. coli. Searches are 'comprehensive', including not only PortEco data sources, but also other databases with E. coli information. On the technical side, the searches of all different data sources are carried out simultaneously via web services, and the results page is continually updated as search results arrive from each source using AJAX (asynchronous Javascript and XML). Currently, PortEco searches 16 different data sources (Table 1), and new resources that support web services-based queries can easily be added. By default, the search displays all results from each resource that are associated with the search term. However, the PortEco search is also 'context-sensitive': it automatically detects if the user has entered a gene name or synonym, and filters and formats the results into a 'gene view.' The gene view displays only those results obtained for the specific gene, and performs additional queries for more detailed information about that gene (Table 2). Users can still view the 'full results' even for a gene query, by clicking on the 'view full results' button at the top of the gene view.

PORTECO DATA: COLLECTION, PROCESSING AND CURATION
PortEco is collecting, processing and curating data from experiments in E. coli. These data types currently include the following: . Genome-scale mRNA expression data . Alleles and phenotype data for E. coli mutant strains from both curated articles and genome-scale growth experiments . Genomic features of E. coli plasmids and phage . Genome-scale protein-DNA interactions from chromatin immunoprecipitation (ChIP) experiments and genomic SELEX experiments . Genome-scale ribosome profiling data . An interactive, community-editable E. coli strain genealogy . Gene Ontology (GO) annotations of gene functions (in collaboration with EcoCyc) . Gene family trees and orthologs of E. coli genes in representative species . A corpus of E. coli scientific literature mRNA expression data At PortEco, we collect publicly available microarray data, with the vast majority of these data being taken from ArrayExpress (16) or the Gene Expression Omnibus [GEO, (17)]. To allow results from different laboratories and different experiment sets to be compared with one another, raw data are processed and normalized using a standard procedure before being made available at PortEco. The processing pipeline includes associating each probe on a particular microarray platform with the correct genomic coordinates [by remapping probes to the current genome sequence, many probes were designed before the current version of the sequence (18)], and associating those coordinates with the correct gene name, and an extensive list of synonyms where synonyms exist. This allows data to be retrieved regardless  (12) Predicted interacting genes No UniProt (13) Proteins No of what gene identifier might have been used when the data were first deposited. Each experiment (microarray) is manually curated: information about growth and treatment conditions is collected, along with names and genotypes of the strains used. The descriptions of experimental conditions accompanying publicly deposited microarray data are often abbreviated and sometimes incomplete. In such cases, we turn to the associated publications and citations therein to track down experimental details. When necessary, we contact authors for further information.
In a similar fashion, we try to obtain complete genotypes for strain(s) used and, whenever possible, determine strain lineages. Details about strain constructions and lineages are entered on strain pages at EcoliWiki (for example, <http://ecoliwiki.net/colipedia/index.php/Cate gory:Strain:BW25113> contains information about BW25113, the strain background for the Keio knockout collection). Another part of the curation process assigns each experiment (microarray) to an experimental condition category; this allows users to search for microarrays that may be related using those categories as queries. We are collaborating with the RegulonDB (2) and COLOMBOS (19) groups to establish a common set of condition terms. At PortEco, we currently have data associated with 193 publications that have been published over the past 12 years. These data are normalized, converted to log ratios if necessary (for example, single channel Affymetrix data are converted to ratio style measurements by using either a control array as the denominator, or by using a probe's average intensity in the data set as the denominator), and then clustered. We note that GenExpDB (http://genexpdb. ou.edu/) has some similar functionality to our expression site, also containing expression data for E. coli imported from GEO. For a given gene or genes entered into the GenExpDB search box, a heatmap can be retrieved for those genes' expression across all conditions for which they have data available. However, GenExpDB does not provide the ability to cluster data for arbitrary genes across an arbitrary set of conditions, nor does it annotate the conditions with a consistent set of controlled vocabulary terms, instead relying on the meta-data imported from GEO. In addition, it does not provide a means by which to select the most significantly expressed genes from any given condition or set of conditions. These functionalities are all currently available from PortEco (see below).
With the advent of high-throughput sequencing, researchers now have the ability to not only determine with unprecedented detail which parts of the genome are actually transcribed, but in addition, can quantify at what level they are transcribed over a linear range spanning 5 orders of magnitude, at least two more orders than possible with microarrays (20). While there are few RNA-Seq data sets currently available for E. coli, these data sets are expected to be generated with increasing frequency, as fewer experiments are performed using microarray technology. At PortEco, we are developing standard pipelines to take the raw read data (in fastq format), and to map these data to the latest version of the genome, and to then determine expression values for each gene (in rpkm) using the latest genome annotation. As the genome sequence is updated, and as the primary annotation of the genome changes (for example, with newly described transcripts), we will be able to reprocess all data sets using the same pipeline, to provide consistent and comparable results across all RNA-Seq data sets.

Alleles and phenotypes
EcoliWiki gene pages contain >16 000 entries for alleles or phenotypes for E. coli genes. These alleles and phenotypes are a combination of alleles imported from the records of the E. coli Genetic Stock Center (21) and information from manual curation of the E. coli genetics literature. As part of EcoliWiki, these pages are available for community curation.
Nichols et al. performed large-scale determination of growth phenotypes for 3979 mutants under 324 conditions representing 114 distinct stresses (22). This data set provides a rich source of functional insights from comparison of phenotypic profiles between genes and conditions. PortEco provides two systems for browsing data from this study. The original data browser allows users to query and browse the fitness data and correlations from the authors between strains or conditions. This phenotypic profiles data browser, which was linked in the article, is one of the most heavily accessed components of PortEco. Integration with EcoliWiki allows the search to recognize records by the current gene names and synonyms. In a second-generation data browser, we have adapted the GeneXplorer system (23) used for expression data to allow users to recluster and analyze subsets of the largescale growth phenotypes. This system provides the significant phenotypes section displayed by the PortEco search. The PortEco GBrowse and JBrowse genome browsers allow users to view these data in the context of curated genomic features. We provide browsers for multiple E. coli strains, plasmids and bacteriophage. Default tracks are generated from RefSeq and Genbank records, but alternative tracks provide alternative annotations, such as operons from RegulonDB, locations of cloned inserts, known deletions and other manually curated content.

E. coli strain genealogies
In EcoliWiki, PortEco provides community-editable information about >280 strains. Stain information includes genotypes, references, construction details and sources for obtaining the strain. Strains are arranged in genealogies based on their construction. We currently include all of the strains described in the genealogies for E. coli K-12 and E. coli B described by Bachmann [in (39) and Daegelen et al. (40), respectively]. PortEco also supports pathway-genome databases for several E. coli strains, allowing comparison of these strains using BioCyc tools.

GO annotations of gene function
PortEco and EcoCyc collaborate to maintain and update the annotation of E. coli gene function for the GO consortium (41,42). We regularly aggregate and deposit an up-to-date gene annotation file that is downloadable from either PortEco or the GO consortium Web site. This file is constructed from combining annotations from UniProt with the professionally curated GO annotations from EcoCyc and community annotations from EcoliWiki and GONUTS, which provides a community GO annotation system for any protein in UniProt.

Orthologs and gene family trees
E. coli gene families and phylogenetic trees are generated and curated in collaboration with the PANTHER database (8). Currently, 2657 genes (64% of proteincoding genes) have been placed in phylogenetic trees. 'Strict' orthologs (i.e. genes related by vertical descent from a common ancestor) are computed from these trees in 81 other organisms (listed at http://pantherdb.org/ panther/summaryStats.jsp). Hidden Markov models are created for both families and subfamilies, to allow searching for related genes in other genomes. These Hidden Markov models are run regularly on the UniProt database as part of the InterPro project (15), so users can navigate to comprehensive lists of related genes.

E. coli literature
EcoliWiki contains wiki pages for >25 000 publications. These pages allow community-editable addition of notes and discussion, links to other PortEco content and data tables for data mining, such as the track information tables described above. Articles covered in EcoliWiki are used to automatically update the literature corpus for fulltext indexing by the PortEco instance of Textpresso (43), which has been modified to provide a more user-friendly interface and to provide a web service to provide relevant articles to the integrated PortEco search.

EcoliHouse: database access to gene information
EcoliHouse is a database warehouse containing multiple E. coli databases. EcoliHouse serves two purposes within PortEco. First, it is a publicly queryable MySQL database that allows scientists to issue SQL queries across multiple E. coli databases. Second, it is the database to which the PortEco web-based multigene query system sends queries to access the EcoCyc and EcoGene databases. The databases currently present within EcoliHouse are EcoCyc, EcoGene, Eco2Dbase, the UniProt complete proteome for E. coli K-12, the RefSeq E. coli K-12 MG1655 genome entry, and the Genbank E. coli K-12 MG1655 genome entry and several E. coli ChIP-chip data sets. See http://biowarehouse.ai.sri.com/EcoliHouseOverview. html for a listing of the current databases within EcoliHouse, EcoliHouse access instructions and example queries.

HIGH-THROUGHPUT DATA ANALYSIS WORKFLOWS
PortEco is designed to facilitate retrieval and analysis of high-throughput data sets that have been generated for E. coli (Figure 1). There are three starting points for accessing E. coli data in PortEco: (i) search for a specific gene, (ii) search for a specific set of experimental conditions (for either gene expression or growth phenotype data) and (iii) search for a specific set of experiments to view in a genome browser. PortEco uses the GeneXplorer tool (23) for display of gene expression and knockout growth phenotype data, which in PortEco is now seamlessly integrated with analysis tools from the PANTHER and EcoCyc Web sites. PortEco currently uses GBrowse (36) as a genome browser, though Jbrowse (44) is currently available on a testing site and will be fully released in the near future.

Search for a specific gene
Searching for a gene name, synonym or accession launches the PortEco gene search results view (see 'Integrated Search' above). From here, users can click on the genome browser link to view the genomic context and select ChIP, ribosome profiling and RNA-seq tracks to add to the view. Users will see a thumbnail of conditions where mRNA expression of that gene is up-or downregulated, and another thumbnail of conditions where the knockout of that gene has increased or decreased growth rate. Clicking on the link to analyze all data (for either expression or growth phenotype) will launch the Samples and Conditions view of the GeneXplorer tool, allowing the user to (i) browse the conditions that have the most significantly increased or decreased expression or growth, and (ii) select subsets of conditions for clustering. This allows users to find genes that are correlated with the gene of interest specifically under those conditions where the gene of interest shows a significant expression change or phenotype. Focusing on specific conditions helps to avoid spurious correlations driven by the majority of conditions where there is little or no effect on the expression or knockout phenotype of most genes. Note that because this point of entry provides the ability to retrieve data from many unrelated experiments, the notion of using log ratio data is not necessarily applicable as it is when analyzing a coherent data set from a single publication. Thus, all data are transformed into Z-scores, which indicate, in that experiment, how many standard deviations above or below the mean was a particular gene's expression or phenotype value.
The Samples and Conditions view displays a histogram with the Z-scores for that gene's expression or phenotype data and a list of the experiments where the Z-score for the gene is above a user-selected threshold. Once conditions of interest have been selected, the data for all genes in those conditions can be clustered, and a GeneXplorer window then shows global and zoomed 'heatmap' views for the clustered data. Within the zoomed view, users can see gene names, product descriptions and links to resources for more information. At this point, users have a number of options. For any particular gene they can get a list of other genes with the most highly correlated and anticorrelated expression patterns or phenotypic profiles across the selected conditions. Subclusters of genes and data can be selected and further analyzed in a number of ways, including finding overrepresented pathways/ processes, viewing in the EcoCyc 'cellular overview' tool or sending to the EcoCyc 'groups' tool (1).

Search for a specific set of experimental conditions
Using the 'cluster my genes' tool, users can browse the available experiments for selection. As described above, the experiments have been classified manually by the type of experimental conditions, the strain(s) used, the specific mutant (if applicable) and the publication. Users can select data sets by any of these criteria, and optionally enter a subset of genes (all genes are considered by default). They can then (i) retrieve a list of genes that are significantly up or down in the selected experiments (based on Z-scores relative to all experiments in the database, as described above), (ii) analyze those conditions for enriched biological pathways/processes or (iii) cluster the patterns for different genes under the selected conditions. Selected genes and data sets are then retrieved and clustered, and displayed using GeneXplorer. Clusters can be further analyzed as described above.
Search for a specific set of experiments to view in a genome browser Figure 2 illustrates the use of EcoliWiki to manage and personalize views of track collections for high-throughput data. The curation of track data in EcoliWiki publication pages allows us to generate interactive tables of available data sets. These list the author and publication, the type of experiment, a brief description and the strains used. Entering a search term will dynamically filter the table to include only those entries matching the term (e.g. by entering 'ribosome profiling' the table will be reduced to only those types of experiments). The user can then select the data sets to launch in a genome browser. In addition to the global listing of data tracks, users can create their own custom views of subsets of tracks based on querying the global set of browser tracks from high-throughput data.

PREPUBLICATION SERVICES
In addition to allowing users to compare their data sets with publicly available data sets, users can use Porteco tools to create password-protected private views of their data. Private views of data that can be visualized as genome browser tracks, such as genome-scale protein-DNA interactions, ribosome profiling or alternative genome annotations, can be created using the custom tracks capabilities of GBrowse (36,45). This allows users to view their data in the context of other work and existing annotations. GBrowse allows users to do this without even having to tell PortEco about it. Reviewers can be provided with access to these private before publication. However, working with PortEco, we can move these temporary custom tracks into the permanent collection so that stable URLs can be included in manuscripts and the data can be opened to the public on publication. For example, Myers et al. (29) was able to provide links for ChIP-chIP and ChIP-seq data sets, while Liu et al. (34) used the PortEco browser for ribosome profiling data mapped against both the E. coli K-12 and bacteriophage lambda genomes.
In other cases, the data of interest is a set of tabular data where we can create custom web-based tools to analyze and then provide public access. We have constructed a framework to quickly construct accesscontrolled custom views of tabular data. Unlike tabular data in Excel or Google Spreadsheets, we can easily leverage PortEco so that tables can be searched using synonyms for accessions or gene names in the user data sets, and links from the tables to PortEco or EcoliWiki can be built in more easily than if authors built and maintained their own web interfaces for supplemental data. This approach was used to provide data browsers for the Nichols et al. (22) phenotypic profile data and the analysis of the stress-induced mutagenesis network by Al-Mamun et al. (46). As with browser tracks, we can provide URLs to the public view of the data to be included in publications. These capabilities allow a greater subset of the research community to use published data in ways that will increase the citation of the articles including these links.

CONCLUSION
PortEco has been designed to leverage and integrate with the wealth of bioinformatics data resources that include information related to E. coli. Leverage and integration are also key to how PortEco combines and extends available open-source software. Our two wiki projects leverage the broader expertise of the research community and illustrate how MediaWiki can be used to quickly build community resources for different kinds of information. In this way PortEco provides important content for use by researchers using E. coli as a model system, and illustrates a virtual model organism database approach to building a data resource.