Motivation: The goal of present -omics sciences is to understand biological systems as a whole in terms of interactions of the individual cellular components. One of the main building blocks in this field of study is proteomics where tandem mass spectrometry (LC-MS/MS) in combination with isotopic labelling techniques provides a common way to obtain a direct insight into regulation at the protein level. Methods to identify and quantify the peptides contained in a sample are well established, and their output usually results in lists of identified proteins and calculated relative abundance values. The next step is to move ahead from these abstract lists and apply statistical inference methods to compare measurements, to identify genes that are significantly up- or down-regulated, or to detect clusters of proteins with similar expression profiles.
Results: We introduce the Rich Internet Application (RIA) Qupe providing comprehensive data management and analysis functions for LC-MS/MS experiments. Starting with the import of mass spectra data the system guides the experimenter through the process of protein identification by database search, the calculation of protein abundance ratios, and in particular, the statistical evaluation of the quantification results including multivariate analysis methods such as analysis of variance or hierarchical cluster analysis. While a data model to store these results has been developed, a well-defined programming interface facilitates the integration of novel approaches. A compute cluster is utilized to distribute computationally intensive calculations, and a web service allows to interchange information with other -omics software applications. To demonstrate that Qupe represents a step forward in quantitative proteomics analysis an application study on Corynebacterium glutamicum has been carried out.
Availability and Implementation: Qupe is implemented in Java utilizing Hibernate, Echo2, R and the Spring framework. We encourage the usage of the RIA in the sense of the ‘software as a service’ concept, maintained on our servers and accessible at the following location: http://qupe.cebitec.uni-bielefeld.de
Supplementary information:Supplementary data are available at Bioinformatics online.
Present -omics sciences try to understand biological systems as a whole by scrutinizing the individual components and their interactions. In this field of study, often referred to as systems biology, proteomics is one of the main building blocks. While a few years ago, 2D gel electrophoresis in combination with single-stage mass spectrometry had been the standard technique to yield information about the proteome in a cell (Hufnagel and Rabus, 2006), recent methods such as liquid chromatography coupled to tandem mass spectrometry (LC-MS/MS) provide the possibility to characterize hundreds of peptides in a single sample. A common way to compare the abundance of proteins under two or more conditions is the combination of mass spectrometry with isotopic labelling techniques (Mueller et al., 2008; Ong et al., 2002; Wolters et al., 2001; Zhu et al., 2002), which enables us to obtain a direct insight into regulation at the protein level. Starting from the data recorded by a mass spectrometer instrument, a typical experiment's workflow involves: (i) a database search to identify proteins contained in a sample; (ii) the calculation of peptide abundance ratios; and (iii) a following evaluation of the results.
The standard method to identify proteins or peptides, respectively, compares the recorded mass spectra with theoretical fragmentation patterns derived from sequence databases, using search engines such as Mascot (TM) (Perkins et al., 1999), Sequest (TM) (Yates et al., 1995), OMSAA (Geer et al., 2004), ProbID (Zhang et al., 2002) or X!Tandem (Craig and Beavis, 2004). An integral element of this ‘qualitative’ part of the workflow is the validation of the reported peptides and proteins. A common strategy therefore is based on the utilization of decoy databases and the calculation of false discovery rates (FDRs) (Elias and Gygi, 2007; Peng et al., 2003).
A variety of software applications aims to guide through this process of peptide identification and validation, and to provide a standardized way of data management. In general, either specific flat file formats or databases are utilized to store and retrieve, e.g. mass spectra data, reported proteins or documentation of the experimental setups. As experiments are often conducted within larger communities and therefore need to be shared between a number of participants, user management and data access control are vital components of these systems. Examples of such applications are CPAS (Rauch et al., 2006), MASPECTRAS (Hartler et al., 2007), Proteios (ProSE) (Gärdén et al., 2005; Häkkinen et al., 2009) or the command-line-based Trans-Proteomics pipeline (TPP) (Keller et al., 2002; Nesvizhskii et al., 2003) and the OpenMS/TOPP framework (Kohlbacher et al., 2007; Sturm et al., 2008). As a recommended standard for proteomics data, the Proteomics Standards Initiative (PSI) at the Human Proteome Organisation (HUPO) (Orchard et al., 2003) specified the MIAPE reporting guidelines (Taylor et al., 2007)—the minimum information about a proteomics experiment.
Common experimental strategies for relative quantification are based on the incorporation of stable isotopes. In 2003, RelEx (MacCoss et al., 2003) and ASAPRatio (Li et al., 2003) were introduced to calculate relative abundance ratios from samples that are metabolically labelled using, e.g. heavy stable nitrogen isotopes. ProRata (Pan et al., 2006) and Census (Park et al., 2008), the successor of RelEx, are further examples of quantification tools, whereas other labelling approaches encompass ICAT (Gygi et al., 1999) or SILAC (Ong et al., 2002). In general, these tools are standalone software applications that have solely been designed for the process of quantification. The recently introduced MaxQuant (Cox and Mann, 2008) supports the SILAC approach, and is the first tool that additionally integrates protein identification using the Mascot (TM) search engine.
While the aforementioned applications allow to identify and quantify an organism's proteome, their end product is usually a list of calculated abundance ratios or expression values for the identified proteins. As a typical experimental setup includes more than one condition, the resulting values need to be combined to form, e.g. a data matrix (Kumar and Mann, 2009). At this point of analysis, proteomics researchers are somehow left out in the cold, since existing software solutions as listed above lack support of advanced data analysis. Moreover, in many of the workflows, it is often not yet clear what the best analysis methodology is, whether to identify up-down protein regulation, for comparative studies with varying conditions, to detect protein clusters with similar expression profiles or to fuse information with external databases such as KEGG (Kanehisa and Goto, 2000). Software such as spreadsheet programs or statistical programming languages, albeit generally usable for this purpose, demand a high level of background knowledge and training, or do not adapt to the complexity of proteomics data. In addition, data and associated metadata are not found connected in a single place.
A software application that provides a comprehensive set of statistical methods for various -omics data sources is the tool DAnTE (Polpitiya et al., 2008). This application, however, relies on the import of measurements in form of the aforementioned data matrix or spreadsheet data, and does neither integrate peptide or protein identification and quantification nor implement data management functions to organize experiments or projects. Ramos et al. (2008) are following a different approach with their protein information and property explorer (PIPE) that does not aim at the statistical evaluation, but at a functional analysis of identified peptides. The application allows for server-side data storage and provides, for example, functionality to associate Gene Ontology (Ashburner et al., 2000) information with identified proteins.
We have developed Qupe with two aims. First, we wanted to design a software package that integrates all aspects of the mass spectrometry-based proteome analysis workflow discussed above, from identification to multivariate statistical analysis. Second, we wanted to move forward in bringing algorithms closer to the biologists and developed Qupe as a so-called Rich Internet Application (RIA). As such, it addresses the limitations in ‘the richness of the application interfaces, media and content’ (Allaire, 2002, p. 1) of classical web applications and offers an interface that behaves similar to standalone software applications running on a users desktop. Qupe is independent from any operating system and the need for installation on individual workstations is omitted. Hence, data stored in the system such as mass spectra, or analysis results may be accessed on any computer connected to the Internet.
2 IMPLEMENTATION AND METHODS
Qupe is based on the Spring framework (Interface21, 2008; Johnson, 2003). It is compliant to the Java Platform Enterprise Edition (Java EE) specification, and thereby portable across all compatible application servers. Following the three tier architecture model, the system is separated into data access, logic and presentation layer (Fig. 1). Data stored in the system is protected by a number of security measures. In the first place, Qupe incorporates a generalized project management system (GPMS). On this level, security is based on discrete grants on databases and associated tables. The system has already successfully been used in other software packages hosting hundreds of international -omics projects (Dondrup et al., 2009; Neuweger et al., 2008). A second level of application-based security has been implemented utilizing access control list (ACL) directives on selected database objects. In addition, Qupe uses HTTP over Secure Sockets Layer (SSL) to secure all web communications.
2.1 Data access layer
Our data model is strongly adapted to the suggestions made by the PSI at the HUPO (Orchard et al., 2003). Storage of mass spectra data follows the open source format mzData (Orchard et al., 2004) developed by the PSI. Further aspects of the data model, which are realized in accordance to the PSI recommendations, concern the stored data about reported peptides and proteins, which are nowadays described in the recently introduced analysisXML (Proteomics Informatics Standards Group, 2008). Particular emphasis was placed on the storage of analysis results such as calculated abundance ratios, visualizations or the output of statistical tests. To cope with future requirements for the data model and facilitate the addition of further attributes or classes, the development followed the model driven architecture (MDA) approach (Object Management Group, 2008) using the model designer O2DBI (B. Linke, unpublished data). The implementation utilizes the Hibernate library (Red Hat Middleware, 2008).
2.2 Logic layer
Qupe includes several analysis functions for datasets such as those resulting from time series experiments. Furthermore, a well-defined programming interface (API) allows an easy development of new functions, supporting the retrieval and processing of data as well as the storing and visualization of results of an analysis such as new datasets or graphics. The API supports the integration of routines written in R (R Development Core Team, 2008; Urbanek, 2008) allowing developers to resort to a wealth of established data analysis methods. A Sun Grid Engine/DRMAA binding (Sun Microsystems, 2009) has been incorporated, that enables computationally intensive tasks to benefit from the advantages of a distributed computing solution.
2.3 Presentation layer
A graphical user interface, implemented using the Echo2 web framework (NextApp, Inc., 2008), allows the interaction with the system through a standard web browser. At second, Qupe provides a web service interface based on SOAP and the web service description language (WSDL; Gudgin et al., 2008), which can be utilized by other applications to exchange analysis results as, for example, to retrieve complete datasets of calculated abundance ratios.
In the following, important aspects and parts of Qupe are described in detail. We propose a workflow to quantitatively analyse isotopically labelled data from LC-MS/MS experiments as depicted in Figure 2.
Project and experiment setup: the web browser-based application provides extensive capabilities to group and integrate all data relevant to a particular experiment. This comprises a description of the experimental setup as well as mass spectra data and analysis results. Database access is first secured by a GPMS, and second, fine-grained privileges may be assigned to individual experiments and projects.
Experimental setup description: Qupe supports the description of experimental setups to allow for future retrieval of information about an experiment such as treatment of individual samples. Therefore, a number of predefined worksteps are provided that may be enhanced with additional details. Several worksteps may then be combined to describe the complete workflow of an experiment. A sample workstep would, for example, describe the cultivation of organisms including parameters such as optical density or growth medium.
Data acquisition: import/preprocessing of mass spectra: Qupe currently allows the import of mass spectra data in the open source formats mzXML (Pedrioli et al., 2004) and mzData (Orchard et al., 2004). The system primarily targets at the analysis of LC-MS/MS data, but has also been designed to handle other types of data. As such a proprietary format by Bruker (Bruker Daltonics, Billerica, MA, USA) for single-stage mass spectrometry data recorded by a MALDI-TOF instrument is already supported. Imported mass spectra can be visualized (Fig. 3A), and currently implemented tools support the preprocessing of MS/MS spectra, for example, to filter mass spectra having a total ion current value below a certain threshold.
Description of treatment and samples: for an experiment one or more types of treatment, such as temperature or concentration of a substance, may be defined and furthermore divided into levels, e.g. 10○ and 20○ celcius for the type temperature. To support the user in finding an appropriate terminology, the ontology lookup service of the EBI may be queried (Côté et al., 2006; Martens et al., 2005). Individual samples (datasets) of an experiment can then be assigned to the defined levels and handled accordingly in further analysis. For example, if samples were taken in distinct time intervals, therefrom calculated abundance ratios will be grouped in separate datasets that can then be compared with each other using statistical inference methods.
PMF/MIS search or import: peptide mass fingerprinting or MS/MS ion search can be carried out by an integrated Mascot (TM) search engine (Perkins et al., 1999). Searches of the same set of mass spectra may be batch processed, for example, by means of the definition of ranges for peptide tolerance values or by querying several databases at once. Additionally, Qupe supports the import of DTASelect-filter files (Tabb et al., 2002), so that further analysis can be based, for example, on Sequest (TM) (Yates et al., 1995) results.
Annotation/evaluation of search results: to ensure that further analysis rests on a solid ground of verified peptide or protein identifications, it is necessary to assess the reported hits produced by database search tools. In Qupe, this can be based upon the calculation of FDRs as suggested by Reidegeld et al. (2008). The preconditions for this are that concatenated decoy databases (Elias and Gygi, 2007; Peng et al., 2003) have been employed. In the first instance, all peptide or protein hits that were either imported or reported by the integrated Mascot (TM) search engine are stored in database. Based on user-defined parameters such as the exclusion of specific charge states, a certain FDR-threshold, or alternatively, a minimal score value, reported hits are filtered to gain the set of proteins and peptides that will be included in further analysis.
Quantification: isotopic labelling techniques allow the measurement of relative abundances of several hundreds of proteins or peptides. Qupe supports the import of ProRata quantification results, and provides own implementations of quantification algorithms (see Supplementary Material for a description of an algorithm integrated in Qupe).
Integration of external information: to extend the knowledge about identified proteins information from external resources such as Uniprot (UniProt Consortium, 2008) or the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa and Goto, 2000) can be integrated. This comprises COG or KOG (clusters of orthologous groups of proteins) functional categories (Tatusov et al., 2003), or EC numbers and pathway information. If protein identifiers have been derived from the GenDB annotation system (Meyer et al., 2003), a mapping onto regions via BRIDGE (Goesmann et al., 2003) is also available. This information can then be used, for example, to calculate the distribution of COG categories. Another function, which is integrated in Qupe, allows to map identified proteins and their calculated abundance ratios on KEGG pathways (Fig. 3B).
Statistical tests, multivariate analysis and data mining: in many proteomics workflows, it has not been elucidated, yet, which statistical analysis methods are suitable for the analysis of quantitative data. Qupe provides a number of analysis functions and guides an experimenter through the process of the statistical evaluation of abundance ratios. Currently, the software builds on established and well-known statistical methods, while it additionally eases the development of novel approaches utilizing a well-defined API. The one-sample t-test, the analysis of variance and the non-parametric Kruskal–Wallis test have been adapted to quantitative proteomics data. To account for the multiple testing situation and to give control of the family-wise error rate, resulting P-values can be adjusted using, e.g. the methods of Bonferroni or Holm. Other functions that Qupe suggests for data analysis are the principal component analysis (PCA) and hierarchical clustering algorithms using Ward's method, complete and average linkage and Euclidean as well as correlation-based distances. The PCA is used to analyse covariances, and may thereby reveal the intrinsic dimensionality of the data, while the hierarchical cluster analysis seeks to identify groups of co-regulated proteins. According to the defined type(s) of treatment and their levels (see VI) similarly (by means of a distance function) expressed proteins are grouped into clusters. Using colour codes for the calculated ratios, results of such an analysis can be evaluated in form of a heatmap as shown in Figure 3C. A further aim of cluster analysis is to find an optimal number of clusters. For this purpose, Qupe provides cluster indices such as Calinski–Harabasz (Calinski and Harabasz, 1974), Index-I (Maulik and Bandyopadhyay, 2002) or Davies–Bouldin (Davies and Bouldin, 1979) (Fig. 3D).
4 APPLICATION STUDY
In this application study, we want to demonstrate the capabilities of the RIA Qupe with the analysis of a MudPIT experiment conducted at the University of Bochum. Proteins from the Gram-positive bacterium, Corynebacterium glutamicum, were scrutinized on hyperosmotic conditions—a stress stimulus the biotechnologically relevant organism may be exposed to during fermentation. Utilizing the stable isotope labelling approach, bacteria were cultivated in media containing either 14N or 15N. Samples were taken before the osmotic shock, that was induced by adding sodium chloride, and after 15, 60 and 180 min. Each sample was analysed in an 8-step MudPIT experiment. Using Xcalibur mass spectra were recorded on a LTQ XL Orbitrap (Thermo Fisher Scientific Inc., Waltham, MA, USA). Further details of this analysis are published elsewhere (B. Fränzel et al., submitted for publication).
The resulting 38 datasets were converted into the open source format mzXML with the tool ‘ReAdW’ (Keller et al., 2002; Nesvizhskii et al., 2003). Using the web interface of the software, these datasets were then imported into Qupe running on a server at Bielefeld University. Therefore, a new experiment with appropriate read and write permissions for the participating experimenters was created to hold all (further) information and data. Subsequentially, spectra were preprocessed to filter for low overall intensities or insufficient numbers of peaks in the data, and afterwards submitted to the Mascot (TM) search engine. The composite target decoy database of C.glutamicum was derived from the corresponding GenDB genome annotation project (Kalinowski et al., 2003). Afterwards FDRs were calculated and used to filter the observed peptide hits. The automatic annotation tool retained 7258 peptide hits for further analysis, which in summary corresponded to 715 identified proteins. Information about the identified proteins was enriched by querying external resources for COG class names or EC numbers, finding for example, >13% of all identified proteins in the functional category ‘Translation, ribosomal structure and biogenesis’. Before peptide quantification took place, the experimental factor ‘time’ was set up and the imported samples were assigned to the four different values 0, 15, 60 and 180 min according to the timespan after shock. A univariate analysis of variance with the factor ‘time’ revealed 39 proteins as significant differentially regulated regarding the four distinct timepoints. This includes some temperature shock proteins, a putative transcriptional regulator and a phosphoglycerate dehydrogenase. The hierarchical cluster analysis seeks to identify groups or clusters, respectively, of co-regulated proteins. A result of such an analysis can be a heatmap as shown in Figure 3C, or a division of all proteins in a number of clusters. Utilizing the cluster index ‘Calinski–Harabasz’, this optimal number of groups was, for example, found at 13 clusters for Euclidean distances and the average linkage method (Fig. 3D) in our application study.
A variety of desktop and web applications that aim at a similar set of functionality compared with Qupe are already available. In terms of data management, this includes MASPECTRAS (Hartler et al., 2007), a web application that supports the import of the results from several search engines, provides peptide validation and quantification based on ASAPRatio. A unique feature of the system is an integrated algorithm to map identified peptides to proteins. This accounts for the problem that a single peptide is often shared by a group of proteins. Proteios (ProSE) (Gärdén et al., 2005; Häkkinen et al., 2009) is another web application that offers a comparable set of features like MASPECTRAS concerning data management, documentation of experimental processes and search engine integration. Similar to Qupe, it furthermore provides a programming interface, that allows for further extensions of the system, and integrates a web service for database access. Another example of such systems is CPAS (Rauch et al., 2006), which again features comprehensive data management functionalities, and a pipeline for protein identification and validation including the search engines X!Tandem, Mascot and Sequest. A further, detailed discussion and comparison of desktop and web applications including the TPP (Keller et al., 2002; Nesvizhskii et al., 2003) can be found, for example, in Nesvizhskii et al. (2007), Mueller et al. (2008) and Hartler et al. (2007).
In direct comparison, it has to be considered that, particulary, Proteios and MASPECTRAS support more data formats, and furthermore integrate additional search engines. However, while these applications focus on data management and the identification and evaluation of proteins from mass spectrometry data, Qupe goes one step further, and explores new frontiers of data analysis with the adaption of multivariate statistical methods to quantitative proteomics data. Qupe is highly extensible and eases the integration of additional formats or tools as well as the development of novel methodologies. A well-defined API not only provides access to data stored in the system, but also unifies both configuration and execution of analysis functions and presentation of the results. We could already show the expandability through the integration of MALDI-TOF data and peptide mass fingerprinting. Furthermore, Qupe gives the opportunity to retrieve the data analysed within the system using a SOAP/WSDL-based web service. The service has already been used to couple Qupe to ProMeTra, a web application to map expression values on biological pathways (Neuweger et al., 2009).
We have designed and implemented the RIA, Qupe, with the first aim to provide a software package that supports the complete workflow of a proteomics experiment based on tandem mass spectrometry and stable isotopic labelling of proteins. This includes standardized data management, data integration, documentation of experimental processes, and in particular, a guidance on applicable analysis methods. With the presented range of methods for statistical evaluation experimenters may draw reliable and meaningful conclusions from their data. Utilizing comprehensive approaches such as cluster analysis algorithms, experimenters may identify co-regulated proteins, and thereby gain new insights into the mechanisms of protein biosynthesis. As a second aim, we wanted to bring algorithms closer to the biologists, and developed the software as a so-called RIA. Qupe is accessible from any place where an Internet connection is available. This enables sharing of information and data not only between different departments such as a laboratory and an office but also between different universities or institutions. Following the concept of software as a service, any installation or requirement of maintenance is omitted, while data integrity and security are conserved.
The range of functions of Qupe will be extended in the near future, where for instance other quantification algorithms will be supported, or new data format specifications will be regarded covering the recently released mzML (Mass Spectrometry Standards Working Group, 2008), and the analysisXML data format (Proteomics Informatics Standards Group, 2008).
The authors wish to thank the BRF system administrators for expert technical support. We would especially like to thank the workgroups of D. Becher (Greifswald University) and A. Poetsch (Bochum University) who kindly provided datasets and material.
Funding: BMBF in the frame of the QuantPro initiative (grant 0313812, to S.P.A. and S.L.); International Graduate School in Bioinformatics and Genome Research (to H.N.).
Conflict of Interest: none declared.