Abstract

Microarray technology has become an integral part of biomedical research and increasing amounts of datasets become available through public repositories. However, re-use of these datasets is severely hindered by unstructured, missing or incorrect biological samples information; as well as the wide variety of preprocessing methods in use. The inSilicoDb R/Bioconductor package is a command-line front-end to the InSilico DB, a web-based database currently containing 86 104 expert-curated human Affymetrix expression profiles compiled from 1937 GEO repository series. The use of this package builds on the Bioconductor project's focus on reproducibility by enabling a clear workflow in which not only analysis, but also the retrieval of verified data is supported.

Availability: inSilicoDb is available as part of the Bioconductor project. There is a companion web interface that can be used for browsing available datasets before importing them into R/Bioconductor (http://insilico.ulb.ac.be).

Contact:jtaminau@vub.ac.be

1 THE INSILICO DATABASE: BUILDING ON GEO

With >500 000 genomic profiles freely available in NCBI GEO (Edgar et al., 2002), there is a huge amount of genome-wide information available, which could contain the clues necessary to treat fatal diseases such as cancer. However, accessibility to these datasets requires complex and intensive computational steps. Additionally, manual parsing and compilation of experimental attributes and values is tedious and error-prone (Baggerly and Coombes, 2009). Also, the large number of normalization and preprocessing methods in use make the comparison of different existing studies difficult, or even impossible (Sims, 2009).

The inSilico project has uniformly compiled a large amount of human Affymetrix gene expression studies (Colletta et al., in press) from publicly available datasets in GEO–ca. 1 billion gene expression measurements. Through the inSilicoDb package, the InSilico DB content is made available for enhanced programmatic access. inSilicoDb enables large-scale genome-wide analysis through automated scripting by seamless integration with the R/Bioconductor genome-wide datasets visualization and analysis platform (Gentleman et al., 2004).

2 USING THE INSILICODB PACKAGE

Other software packages to retrieve gene expression datasets in R/Bioconductor exist (Davis and Meltzer, 2007). However, the information about the samples is in a raw form requiring a manual curation step in transit between a data repository (e.g. GEO) and a data analysis platform (e.g. R/Bioconductor). In contrast, inSilicoDb streamlines this process by providing data manually verified by volunteers from the scientific community. Contributed content is entered through a spreadsheet-based online collaborative biocuration platform. This approach avoids the complexity of defining and using formal ontologies (Shah et al., 2009), but requires trust in the contributions made by the community.

Accessing these data from the InSilico DB is simple and straightforward with the inSilicoDb package:

graphic

The getDataset function queries the InSilico DB for a given series and platform identifiers and returns an

ExpressionSet
object, a standard R/Bioconductor data structure. In GEO, a Series is composed of samples assayed on one or more platforms, each platform containing tens of thousands of gene measurement probe sets. In InSilico DB, the series are conveniently represented by multiple samples x genes matrices. Two auxiliary functions to allow flexible management of studies with multiple platforms are provided:
getDatasets
to retrieve, for a given series, all gene expression matrices, and
getPlatforms
to retrieve all platforms:

graphic

By default, numerical data downloaded is identical to the data originally provided by GEO. However, when combining different studies, a consistent preprocessing is required and therefore all studies for which raw data exists are available re-normalized [fRMA normalization algorithm (McCall et al., 2010)].

The unit of interest is usually the gene but there is no consensus on how to combine microarray probe measurements mapping to the same gene. The inSilicoDb offers the option to retrieve the probe set- or gene-level measurements. For the probe-to-gene mapping, the most recent versions of the R/Bioconductor platform annotation packages are used (

hgu133a.db
,
hgu133plus2.db
,…). The maximum probe set value is retained when multiple probe sets map to the same gene.

The following example illustrates the use of the

genes
parameter.

graphic

Both normalization and gene/probe options are precomputed for all datasets and are therefore as fast and easy to retrieve as the original data.

3 AN EXAMPLE

Once an expression set is retrieved, all available Bioconductor packages can be applied for further analysis. Executing the code below results in the clustered heatmap shown in Figure 1:

graphic

Fig. 1.

Heatmap of discriminating genes of dataset

GSE4635
at adjusted P<10%. The samples cluster by ‘Smoker’ status (smokers :
current
and non-smokers:
never
).

Fig. 1.

Heatmap of discriminating genes of dataset

GSE4635
at adjusted P<10%. The samples cluster by ‘Smoker’ status (smokers :
current
and non-smokers:
never
).

More information and examples to perform large-scale analyses can be found in the accompanying vignette of this package.

4 CONCLUSION

The inSilicoDb R/Bioconductor package provides an efficient means of performing large-scale genomic analysis on the large and growing amount of human Affymetrix gene expression profiles using automated scripting. The accompanying web interface (InSilico DB) allows search and browsing of curated datasets that can then be automatically retrieved, adding a means for reproducible data sourcing to the reproducible analysis platform R/Bioconductor.

Funding: Brussels-Capital Region, Innoviris in part.

Conflict of Interest: none declared.

REFERENCES

Baggerly
K.A.
Coombes
K.R.
Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in high-throughput biology
Ann. Appl. Stat.
 , 
2009
, vol. 
3
 (pg. 
1309
-
1334
)
Davis
S.
Meltzer
P.S.
GEOquery: a bridge between the gene expression omnibus (GEO) and bioconductor
Bioinformatics
 , 
2007
, vol. 
23
 (pg. 
1846
-
1847
)
Edgar
R.
, et al.  . 
Gene expression omnibus: NCBI gene expression and hybridization array data repository
Nucleic Acids Res.
 , 
2002
, vol. 
30
 (pg. 
207
-
210
)
Gentleman
R.C.
, et al.  . 
Bioconductor: open software development for computational biology and bioinformatics
Genome Biol.
 , 
2004
, vol. 
5
 pg. 
R80
 
McCall
M.N.
, et al.  . 
Frozen robust multiarray analysis (fRMA)
Biostatistics
 , 
2010
, vol. 
11
 (pg. 
242
-
253
)
Shah
N.H.
, et al.  . 
Ontology-driven indexing of public datasets for translational bioinformatics
BMC Bioinformatics
 , 
2009
, vol. 
10
 
Suppl. 2
pg. 
S1
 
Sims
A.H.
Bioinformatics and breast cancer: what can high- throughput genomic approaches actually tell us?
J. Clin. Pathol.
 , 
2009
, vol. 
62
 (pg. 
879
-
885
)

Author notes

Associate Editor: Janet Kelso

Comments

0 Comments