Microarray technology has become an integral part of biomedical research and increasing amounts of datasets become available through public repositories. However, re-use of these datasets is severely hindered by unstructured, missing or incorrect biological samples information; as well as the wide variety of preprocessing methods in use. The inSilicoDb R/Bioconductor package is a command-line front-end to the InSilico DB, a web-based database currently containing 86 104 expert-curated human Affymetrix expression profiles compiled from 1937 GEO repository series. The use of this package builds on the Bioconductor project's focus on reproducibility by enabling a clear workflow in which not only analysis, but also the retrieval of verified data is supported.
Availability: inSilicoDb is available as part of the Bioconductor project. There is a companion web interface that can be used for browsing available datasets before importing them into R/Bioconductor (http://insilico.ulb.ac.be).
1 THE INSILICO DATABASE: BUILDING ON GEO
With >500 000 genomic profiles freely available in NCBI GEO (Edgar et al., 2002), there is a huge amount of genome-wide information available, which could contain the clues necessary to treat fatal diseases such as cancer. However, accessibility to these datasets requires complex and intensive computational steps. Additionally, manual parsing and compilation of experimental attributes and values is tedious and error-prone (Baggerly and Coombes, 2009). Also, the large number of normalization and preprocessing methods in use make the comparison of different existing studies difficult, or even impossible (Sims, 2009).
The inSilico project has uniformly compiled a large amount of human Affymetrix gene expression studies (Colletta et al., in press) from publicly available datasets in GEO–ca. 1 billion gene expression measurements. Through the inSilicoDb package, the InSilico DB content is made available for enhanced programmatic access. inSilicoDb enables large-scale genome-wide analysis through automated scripting by seamless integration with the R/Bioconductor genome-wide datasets visualization and analysis platform (Gentleman et al., 2004).
2 USING THE INSILICODB PACKAGE
Other software packages to retrieve gene expression datasets in R/Bioconductor exist (Davis and Meltzer, 2007). However, the information about the samples is in a raw form requiring a manual curation step in transit between a data repository (e.g. GEO) and a data analysis platform (e.g. R/Bioconductor). In contrast, inSilicoDb streamlines this process by providing data manually verified by volunteers from the scientific community. Contributed content is entered through a spreadsheet-based online collaborative biocuration platform. This approach avoids the complexity of defining and using formal ontologies (Shah et al., 2009), but requires trust in the contributions made by the community.
Accessing these data from the InSilico DB is simple and straightforward with the inSilicoDb package:
The getDataset function queries the InSilico DB for a given series and platform identifiers and returns an
By default, numerical data downloaded is identical to the data originally provided by GEO. However, when combining different studies, a consistent preprocessing is required and therefore all studies for which raw data exists are available re-normalized [fRMA normalization algorithm (McCall et al., 2010)].
The unit of interest is usually the gene but there is no consensus on how to combine microarray probe measurements mapping to the same gene. The inSilicoDb offers the option to retrieve the probe set- or gene-level measurements. For the probe-to-gene mapping, the most recent versions of the R/Bioconductor platform annotation packages are used (
The following example illustrates the use of the
Both normalization and gene/probe options are precomputed for all datasets and are therefore as fast and easy to retrieve as the original data.
3 AN EXAMPLE
Once an expression set is retrieved, all available Bioconductor packages can be applied for further analysis. Executing the code below results in the clustered heatmap shown in Figure 1:
More information and examples to perform large-scale analyses can be found in the accompanying vignette of this package.
The inSilicoDb R/Bioconductor package provides an efficient means of performing large-scale genomic analysis on the large and growing amount of human Affymetrix gene expression profiles using automated scripting. The accompanying web interface (InSilico DB) allows search and browsing of curated datasets that can then be automatically retrieved, adding a means for reproducible data sourcing to the reproducible analysis platform R/Bioconductor.
Funding: Brussels-Capital Region, Innoviris in part.
Conflict of Interest: none declared.