ms-data-core-api: an open-source, metadata-oriented library for computational proteomics

Summary: The ms-data-core-api is a free, open-source library for developing computational proteomics tools and pipelines. The Application Programming Interface, written in Java, enables rapid tool creation by providing a robust, pluggable programming interface and common data model. The data model is based on controlled vocabularies/ontologies and captures the whole range of data types included in common proteomics experimental workflows, going from spectra to peptide/protein identifications to quantitative results. The library contains readers for three of the most used Proteomics Standards Initiative standard file formats: mzML, mzIdentML, and mzTab. In addition to mzML, it also supports other common mass spectra data formats: dta, ms2, mgf, pkl, apl (text-based), mzXML and mzData (XML-based). Also, it can be used to read PRIDE XML, the original format used by the PRIDE database, one of the world-leading proteomics resources. Finally, we present a set of algorithms and tools whose implementation illustrates the simplicity of developing applications using the library. Availability and implementation: The software is freely available at https://github.com/PRIDE-Utilities/ms-data-core-api. Supplementary information: Supplementary data are available at Bioinformatics online Contact: juan@ebi.ac.uk


Design and Implementation (ms-data-core-api)
The data object module is an abstraction layer between the data and the data representation ( Figure 1). It is implemented using plain Java objects, which are core objects used to handle information across different input formats. In fact, different formats often don't have a unified way of representing the same information, and they may also contain different aspects of experimental details. For instance, spectrum related metadata in the mzML format (Martens, et al., 2011) and PRIDE XML are formatted differently, and the peptide and protein identifications in PRIDE XML, mzIdentML (Jones, et al., 2012) and mzTab formats (Griss, et al., 2014) are represented also in different ways. The data model represents a standardized view of the information in the underlying data sources. While reading, the raw content from the data source is first converted into objects of the data module. This process of data transformation naturally depends on the input data source and a utility API (Application Programming Interface) is provided to facilitate the extraction of the information from the original data source.

Figure 1: Data Object Model
The current Data Object Model supports file formats containing spectrum, identification and quantitation information (Table 1). Specifically, it supports three major PSI (Proteomics Standards Initiative) data standard formats: mzML, mzIdentML and mzTab. Every file format is read using different file-specific readers and translated using Transformers to the Data Object Model. The Data Object Model consists of different classes representing the main data types in proteomics studies such as chromatogram, spectrum, peptide, protein, etc. A novel cache system was implemented in order to increase the performance and memory usage of the library. This cache system is especially useful for GUI (Graphical User Interface) components that require concurrent operations in the same data. Finally, a set of controllers that extends a general DataAccessController Interface enables the data retrieval from the Data Object Model.

Data Model
The Data Object Model comprises a set of Java classes to model the most relevant information included in a MS proteomics experiment, going from the sample preparation to the final experimental results (identification and quantification). All the classes in the Data Object Model extend the ParamGroup class to store and handle the associated metadata, e.g. protein scores related terms are stored as controlled vocabulary (CV) terms in the ParamGroup class.

Cache Design
The general idea of caching is to reuse content to avoid repeating expensive operations. In the ms-datacore-api, this is a clear requirement since the average sizes of the experimental output files in proteomics experiments keep increasing. Therefore, loading the entire file in memory is no longer feasible in many cases. On the other hand, requesting the data content directly from the source file is often restricted by the file's storage media. The processing cost for selecting a value from a file (e.g. spectrum, peptide or protein) is fairly high when compared to the cost of having the value stored in memory. Therefore it is plausible to implement caching strategies that keeps frequently used values in the application instead of retrieving these values from the storage media every time. Most frameworks and tools have integrated caching mechanisms nowadays.
Latency and hit rate are the two primary factors to consider when designing a cache. We aimed at achieving a balance between the two while designing the cache for the ms-data-core-api. Therefore, we offer two levels of caching. For ad-hoc user interactions (such as the selection of a protein or a peptide), we use the Least Recently Used (LRU) caching algorithm. The main argument behind this algorithm is that when users select a protein/peptide, they are likely to do further investigation on the selected entity.
For complex interactions (such as the generation of a file wide plot), another level of caching is designed to enabling the caching of the references between entities and their file system offset. These caching entries will be always available to avoid a full file scan.
While designing cache strategies, one also needs to balance between memory consumption and performance. A critical factor when using caching in Java is the size of the cache: when the cache grows too large, the Java Garbage Collector has to clean-up more often (which consumes time). This can lead to a gradual degradation of the performance, or the application may even crash if it exceeds the memory limit.
The ms-data-core-api controls the balance between the memory consumption and fast access to the data using two-level HashMaps ( Figure 3). Most of the objects such as spectrum, peptide, and protein can be kept in memory for fast access. For this goal, most of the data structures in the API have a key-value representation. A global map is then used to group all the cache structures. The memory defines a cache size for each structure in order to avoid out-of-memory problems. Some values such as precursor ion mass and precursor ion charge are also stored in the cache since these values can frequently be accessed by third-party tools. In the cache maps, not only complete objects are stored, the relationships between some frequently accessed data structures and their properties are also saved. This is one key feature of the library: it allows fast access to the data without the need to load the complete data structure in memory. For example, the cache modification map ensures fast access to the modifications of each PSM without the need to retrieve all the PSMs from the identification file.
The ms-data-core-api depends on a series of libraries and native readers developed by the PRIDE team and other members of the PSI community. Some of these libraries were improved in terms of functionality and performance. Specifically, the libraries for the XML-based formats such as jmzML (Cote, et al., 2010) and jmzIdentML (Reisinger, et al., 2012) were optimized in terms of performance enabling for the first time to load large mzIdentML files with the corresponding mass spectra files (   (Reisinger, et al., 2012). The benchmark was run using different mzIdentML files during the submission process to PRIDE Archive.

PRIDE Utilities: Computational proteomics functionalities
One of the dependencies of ms-data-core-api is the PRIDE Utilities library (https://github.com/PRIDE-Utilities/pride-utilities). The PRIDE Utilities module contains a series of algorithms that extend the functionality of the ms-data-core-api. The definition of the amino acid mass table, pK values, and hydrophobic indexes are some of the values defined in PRIDE Utilities. The module also contains the mappings between different controlled vocabulary (CV) or ontology terms meaning the same concept, e.g. the 'b ion' annotation could be annotated using the PRIDE ontology term 'PRIDE:0000194' or the PSI-MS CV term 'MS:1001224'. Therefore, these modules homogenize all the terms and concepts used in metadata annotations. For instance, the library contains the definition of the well-established search engines and processing software. Also, it contains Java-based functions for string validation and complex math functions.

Isoelectric point algorithm:
The value of the isoelectric point can be used as a filtering technique to validate peptide identifications (Perez-Riverol, et al., 2012). In the PRIDE Utilities library, the theoretical isoelectric point for proteins and peptides is calculated using a novel method, published by Bjellqvist and coworkers (Bjellqvist, et al., 1993). The pI is calculated using pK values of the amino acids. These values were defined by examining polypeptide migration between pH 4.5 to 7.3 in an immobilized pH gradient gel environment with 9.2 M and 9.8 M urea. The authors reported a standard deviation of 0.2 units for the entire pH range. A comparison of the algorithm shows that it works for all the fractions and a wide range of pH (Perez-Riverol, et al., 2012). In the future some other implementations could be plugged in the library (Perez-Riverol, et al., 2012).

Gravy Index algorithm:
The GRAVY (Grand Average of Hydropathy) value for a peptide or protein is calculated as the sum of hydropathy values (Kyte and Doolittle, 1982) of all the amino acids, divided by the number of residues in the sequence. The gravy index can be used to measure the hydrophobicity of a specific peptide/protein in the sample (Ramos, et al., 2011;Ramos, et al., 2008). The values used for the different amino acids are the following ones: Amino acid Value Ala 1.800

Exporting to mzTab
The ms-data-core-api library includes a set of exporting options from PRIDE XML and mzIdentML files to the mzTab format. The exporting options allow the annotation of mzTab files using the mapping terms from the pride-utilities library and annotate the missing metadata in the original files with default information (such as searched database, software and protein modifications terms).
The present version also includes a set of filters to select the high-quality data from mzIdentML files. These filters can be applied when the user is interested in high-quality peptide and protein identifications and not in including all the complete results available in mzIdentML files in the export to mzTab. The current version of the filter includes the following rules: • If there is not a "protein detection protocol" element in mzIdentML (e.g. no protein ambiguity groups are provided) then the filtering process cannot be done at the protein level directly. In this case: § If there is no threshold available at the spectrum identification protocol-> The spectra are filtered using the rank information. Only spectra with rank=1 pass the filter. § If there is a threshold available at the spectrum identification protocol-> The spectra are filtered using the provided threshold information.
Only the proteins whose PSMs remain after the filtering will be kept in the exported mzTab file.
• If there is a "protein detection protocol" element in mzIdentML, the proteins and protein groups will be filtered according to the threshold information first. o After that the filtering by threshold at the peptide level will be applied, because in the worst-case scenario it will remove only proteins without spectra evidence that pass the filter. Before, "NoPeptideFilter" was used to avoid inconsistencies with the protein filter. However, it was observed that some spectra evidences that did not pass the threshold were included because the threshold was provided but was incorrectly annotated in the file as "NoThresholdAvailable". This option minimizes the inclusion of spectra that do not pass the threshold. • If there is no threshold information at the protein or peptide level available: o The spectra are filtered using the PSM rank information. Only PSMs with rank=1 pass the filter. o Only the proteins whose PSMs remain after the filtering process will be kept in the exported mzTab file.

Description
The jmzReader library is a collection of Java APIs to parse the most commonly used MS peak list formats. Currently, the library contains parsers for: All parsers are optimized to be used in conjunction with mzIdentML (see link in the left panel). Based on a custom build class to efficiently parse text files line by line all parsers can handle arbitrary large files in minimal memory, allowing easy and efficient processing of peak list files using the Java programming language. mzIdentML files do not contain spectra data but refer to external peak list files. All peak list parsers support the methods used by mzIdentML to reference external spectra and implement a common interface. Thus, when developing software for mzIdentML programmers no longer have to support multiple peak list file formats but only this one interface.

License
Apache 2

Algorithms and tools built on top of the ms-data-core-api
A set of different algorithms and tools has been developed on top of the ms-data-core-api. In this section we will describe some of these libraries and tools.

PRIDE Inspector Toolsuite
The new version of PRIDE Inspector tool (https://github.com/PRIDE-Toolsuite/pride-inspector) makes use of the ms-data-core-api as the main library for data source handling and representation. The main goal of this tool is to visualize the ProteomeXchange 'complete' submissions in PRIDE (Figure 4). The new PRIDE Inspector 2 supports the visualization of spectra, chromatograms, protein groups, proteins, PSMs and the corresponding metadata (scores, thresholds, quantitative values, etc).

PRIDE submission pipeline
The PRIDE database (http://www.ebi.ac.uk/pride/archive/) (Vizcaino, et al., 2013) makes use of the msdata-core-api during the submission process of ProteomeXchange 'complete' submissions. The 'complete' submissions are based on the file formats mzIdentML and PRIDE XML. Additionally, they also contain a set of mass spectra files associated with the identification files that are handled during the submission process. All the properties related with each assay such as samples details, instruments, number of identified proteins, number of unique peptides, among others, are retrieved from the files using the ms-data-core-api ( Figure 5).

HI-bone
HI-bone (Perez-Riverol, et al., 2013) is an approach for scoring MS/MS identifications based on the high mass accuracy matching of precursor ions, the identification of a high intensity b1 fragment ion, and partial sequence tags from phenylthiocarbamoyl-derivatized peptides. This derivatization process boosts the b1 fragment ion signal, which turns it into a powerful feature for peptide identification. The ms-data-core-api is used to retrieve the information of each spectrum and to store the results.

Protein Inference Algorithms (PIA)
The Protein Inference Algorithm (PIA) suite (https://github.com/mpc-bioinformatics/pia) written in Java, includes a fully parametrisable web-interface (using Java Server Faces), which combines PSMs from different experiments and/or search engines, and reports consistent and comparable results ( Figure  6). None of the parameters for the protein inference process (e.g. filtering or scoring), are fixed like in prior approaches. Instead they are held as flexible as possible, to enable any adjustments needed by the user. The library was built on top of the ms-data-core-api demonstrating that the developers only need to focus on the new algorithms and tools avoiding details related with the common data structures and file handling. The PIA set of algorithms can be applied to all the ms-data-core-api supported formats containing peptide identification data (PRIDE XML, mzIdentML and mzTab).