Multi-omics Visualization Platform: An extensible Galaxy plug-in for multi-omics data visualization and exploration

Abstract Background Proteogenomics integrates genomics, transcriptomics, and mass spectrometry (MS)-based proteomics data to identify novel protein sequences arising from gene and transcript sequence variants. Proteogenomic data analysis requires integration of disparate ‘omic software tools, as well as customized tools to view and interpret results. The flexible Galaxy platform has proven valuable for proteogenomic data analysis. Here, we describe a novel Multi-omics Visualization Platform (MVP) for organizing, visualizing, and exploring proteogenomic results, adding a critically needed tool for data exploration and interpretation. Findings MVP is built as an HTML Galaxy plug-in, primarily based on JavaScript. Via the Galaxy API, MVP uses SQLite databases as input—a custom data type (mzSQLite) containing MS-based peptide identification information, a variant annotation table, and a coding sequence table. Users can interactively filter identified peptides based on sequence and data quality metrics, view annotated peptide MS data, and visualize protein-level information, along with genomic coordinates. Peptides that pass the user-defined thresholds can be sent back to Galaxy via the API for further analysis; processed data and visualizations can also be saved and shared. MVP leverages the Integrated Genomics Viewer JavaScript framework, enabling interactive visualization of peptides and corresponding transcript and genomic coding information within the MVP interface. Conclusions MVP provides a powerful, extensible platform for automated, interactive visualization of proteogenomic results within the Galaxy environment, adding a unique and critically needed tool for empowering exploration and interpretation of results. The platform is extensible, providing a basis for further development of new functionalities for proteogenomic data visualization.


1.
Although the manuscript is focused on MVP, the input files are heavily reliant on the previous steps that may involve a number of proteogenomics tools (Figure 1). It is not an issue for an experienced proteomics researcher, but it could be confusing for someone who is less informed of the field. Thus I think a brief description of the process would be very helpful , including which tool generate each input files.
>This is an excellent suggestion. In response, we have added a paragraph immediately preceding Figure 1 in the text which describes available resources (workflows and documentation) for the upstream workflows which generate data that ultimately is used as inputs to MVP. Training documentation is available on the Galaxy Training Network online resource for these workflows. We have also added the links to these training resources in the Accessibility section.

2.
For a better understanding of the data structure in the variant_annotation table, I would suggest adding one example for each type of variant (single amino acid change, deletion, insertion etc.). >Thank you for this suggestion. We have expanded Table 1, now including examples of frameshifts due to single nucleotide insertion/deletion, short Indels and structural variants (fusion), and how these are depicted in the variant annotation table. We have also clarified that MVP reads only the "name TEXT" and "cigar TEXT" fields for displaying variant sequences. The other fields are descriptive information included in the table for completeness.

3.
In table 2, the authors highlight the capability of the annotation table to represent gene fusion cases. I suggest them to include examples in the table to make the function more intuitive.
>Thank you for this suggestion. We have expanded Table 2, showing an example depiction of a fusion variant (the same fusion detailed in the variant_annotation examples in Table 1). The fusion is depicted as two entries in the feature_cds_map table --one entry for each of the two chromosomal regions fused together to form the fusion gene and expressed gene products.

4.
Please check all the links in the manuscript. Several of them seem not working. (page 26 for example) >We have checked all links in the manuscript, including those listed in References cited. They are all working at the time of this submission.
Reviewer #2: Authors of the paper have put together an excellent visualization tool in Galaxy to visualize and analyze Protegenomics data.
The tool's source code is available from GitHub and the visualization plugin is also available from usegalaxy.eu which proves that authors are trying to make the tool open and accessible to many users. I have tested the tool on usegalaxy.eu, and it lives up to the functionalities mentioned in the manuscript. >We thank the reviewer for the positive comments on our described work.
However, I think the manuscript is not justifying the purpose and functionality of the tool correctly, so I advise the following suggestions to improve the manuscript.
1. Update legend for figure 1 mentioning the process shown in the figure or if the process is not essential then update figure just showing inputs.
>This is an excellent suggestion, and in response we have added a paragraph in the text describing the processes shown in Figure 1. Specifically, we have added text which details the upstream process for generating the protein FASTA sequence database from RNA-Seq data and also the process for matching MS/MS data to peptide sequences using this database. These are important upstream workflows which ultimately generate results which provide inputs for MVP. Importantly, we have developed accessible workflows and training documentation enabling interested researchers to generate the results which are used as inputs for MVP. We provide readers information in the Accessibility section on accessing these workflows and documentation through the online Galaxy Training Network.
2. As it is a technical paper, provide a database schema for input(s).
>We thank the reviewer for this suggestion. Including the schema will provide a more complete description of the inputs to MVP. We have added Additional File 2, which provides Entity Relationship diagrams for the schema for the databases and tables which act as input to MVP: the mzSQLite database, the variant_annotation table and  the feature_cds_map table. 3. variant_annotation and feature_cds_map are sqlite tables in the databases with same name which could be confusing, change the text accordingly to make it clear.
>In the Operation section of the text, we made revisions to clarify that the variant_annotation and feature_CDS_map are separate SQLite data tables, which are not contained in the same database. These are generated within the Galaxy History and used as data inputs to MVP. Because these are separate data tables, naming of columns etc. that may overlap between the tables does not create issues in MVP. 4. Page 9, line 1 "For each exon coding a protein,.." do you mean "For each coding exons of a protein"? Please clarify.
>We have changed the text as suggested by the reviewer to read: "For each coding exon of a translated protein…" 5. Page 10, line 7, the authors mention that plugin interacts with other input data types if present, does the plugin read these data files directly or needed to load manually, if it loads automatically is it based on naming convention? If possible provide an example in functionality section.
>In the Operation section, where we describe Figure 2 and the invocation of MVP from the Galaxy History, we have added text clarifying that MVP automatically accesses compatible inputs such as the variant_annotation and feature_cds_map tables through functionalities within MVP which read these specific inputs. We point the readers to the Functionality section of the manuscript where we further describe these functions and how these features automatically access these inputs from the active Galaxy history (see description of Figure 5, parts C and D).
6. Page 12, line 4, provide a direct link to sample dataset.
>Rather than provide links in this section of the text to the demonstration datasets and workflows, we point the reader to the Accessibility section of the manuscript where instructions and links are provided to access data and software. >We have made revisions as suggested. Lorikeet is first introduced in the manuscript in the Operation section of the Findings, where a reference is provided with more information on the tool, and the purpose for its use is described. In response to comment #10 below, we have revised the description of Lorikeet in the Functionality section to give a more detailed description of its features. 10. Page 16, line 2 introduces Lorikeet viewer it seems like an informative tool but not described what it is, so please briefly describe what the user is looking at in figure 4c.
>This is an excellent suggestion, and we have provided more detail in the text which describes the data viewed in Figure 4C.
11. Figure 5 legend, describe what other colors like grey amino acids either side and, different shades of orange (not in this figure, but I noticed in the demo).
>We have added additional text describing color schemes in the protein map (5A) and the zoomed-in visualization of specific peptide sequences (5B).
12. Page 19, line 1, part D in figure 4 is not present; do you mean figure 5? >This has been corrected. We did mean to refer to part D of Figure 5.
13. Page 21, line 6, Mentioned Java, do you mean JavaScript? >This has been corrected, and now reads "JavaScript".
14. Page 21, line 12 Author mentioned "our recently developed tool" Please provide a list and/or reference.
>We have revised and clarified -we are describing a published Galaxy tool which further assesses functional impact of sequence variants, described in Reference 18.

Abstract Background
Proteogenomics integrates genomics, transcriptomics and mass spectrometry (MS)-based proteomics data to identify novel protein sequences arising from gene and transcript sequence variants. Proteogenomic data analysis requires integration of disparate 'omic software tools, as well as customized tools to view and interpret results. The flexible Galaxy platform has proven valuable for proteogenomic data analysis. Here, we describe a novel Multi-omics Visualization Platform (MVP) for organizing, visualizing and exploring proteogenomic results, adding a critically needed tool for data exploration and interpretation. Findings MVP is built as an HTML Galaxy plugin, primarily based on JavaScript. Via the Galaxy API, MVP uses SQLite databases as input --a custom datatype (mzSQLite) containing MS-based peptide identification information, a variant annotation table, and a coding sequence table. Users can interactively filter identified peptides based on sequence and data quality metrics, view annotated peptide MS data, and visualize protein-level information, along with genomic coordinates. Peptides that pass the user-defined thresholds can be sent back to Galaxy via the API for further analysis; processed data and visualizations can also be saved and shared. MVP leverages the Integrated Genomics Viewer JavaScript (IGVjs) framework, enabling interactive visualization of peptides and corresponding transcript and genomic coding information within the MVP interface.

Conclusions
MVP provides a powerful, extensible platform for automated, interactive visualization of proteogenomic results within the Galaxy environment, adding a unique and critically needed tool for empowering exploration and interpretation of results. The platform is extensible, providing a basis for further development of new functionalities for proteogenomic data visualization.

Findings
Proteogenomics has emerged as a powerful approach to characterizing expressed protein products within a wide-variety of studies [1][2][3][4][5]. Proteogenomics, a multi-omic approach, involves the integration of genomic and/or transcriptomic data with mass spectrometry (MS)-based proteomics data. Typically, a proteogenomics-based study starts with a sample (e.g. cells grown in culture, tissue sample etc.) which are analyzed using both next generation sequencing technologies (usually RNA-Seq) and MS-based proteomics. Once assembled from RNA-Seq data, the transcriptome sequence is translated in-silico to generate a database of potentially expressed proteins encoded by the RNA. This protein sequence database contains both proteins of known sequences contained in reference databases, as well as novel protein sequences which are derived from the transcriptome sequence via comparison to reference genome sequence. These novel sequences may include variants arising from single-amino acid substitutions, short insertions/deletions, RNA processing events (truncations, splice variants) or even translation from unexpected genomic regions [2].
Parallel to the RNA-Seq analysis, tandem mass spectrometry (MS/MS) data is collected from the same sample by fragmenting peptides derived from proteolytic digestion extracted proteins.
Each MS/MS spectrum contains sequence-specific information on detected peptides. Sequence database searching software [6] is used to match MS/MS spectra to peptide sequences within the RNA-Seq derived protein sequence database, providing direct evidence of expression of not only reference protein sequences, but also novel sequences. Proteogenomics provides a powerful approach to collect direct evidence of expression of novel protein sequences specific to a sample of interest, which may not necessarily be present in reference sequence databases. The value of proteogenomics has been shown in studies of cancer and disease [3][4][5] as well as a means to annotate genomes [7].
As with other multi-omic approaches, proteogenomics presents some unique informatics challenges [8]. For one, data from different 'omic technologies (e.g. RNA-Seq and MS-based proteomics) must be processed using multiple domain-specific software. Once MS/MS spectra are matched to peptide sequences, further processing is necessary to ensure quality of the matches as well as to confirm novelty of any sequences identified which don't match to known reference sequences. Finally, novel sequences must be further visualized and characterized, assessing confidence based on quality of supporting transcript sequence information and exploring the nature of the novel sequence when mapped to its genomic coding region [9].
Galaxy [10] has proven a highly capable platform for meeting the requirements of multi-omic informatics, including proteogenomics, as described by us and others [11][12][13][14][15]. Its amenability to integration of disparate software in a unified, user-friendly environment, along with a variety of useful features including complex workflow creation, provenance tracking and reproducibility, address the challenges of proteogenomics. As part of our work developing Galaxy for proteomics (Galaxy-P[16]), we have focused on putting in place a number of tools for the various steps necessary for proteogenomics --from raw data processing and sequence database generation [9,11,12,17], to tools for interpreting the potential impact of identified sequence variants [18] and mechanisms of regulation indicated by RNA-protein response [19]. Others have also contributed to this growing community of proteogenomic researchers utilizing Galaxy to address their data analysis and informatics needs [11][12][13][14][15].
However, despite this community-driven effort to develop Galaxy for proteogenomics, there are still a few missing pieces critical for complete analysis of this type of multi-omics data. Currently, there is a significant lack of tools that could filter the results from upstream proteogenomic workflows, enabling further exploration of novel sequences, including visualization of these sequences along with supporting transcript and genomic mapping information. Such a tool is critical to allowing researchers to gain understanding of variants identified, and select those of most interest for further study. Although stand-alone software options exist for viewing proteogenomics results [20,21]  We have also developed accessible workflows and training material for generating upstream results which ultimately provide the inputs necessary for MVP operation, as shown in the workflows depicted in Figure 1. These include workflows for generating protein FASTA databases from RNA-Seq data, as well as matching of MS/MS data to peptide sequences via sequence database searching. Instructions for accessing these resources are described below in the Accessibility section.    It should be noted that this table can also represent structural variants that are common in some cancers [27], where the variant protein maps to exons that are found on different chromosomes and/or different strands from each other. These differences would be annotated in the appropriate columns within the feature_cds_map table.
The MVP plugin is invoked from the mz.sqlite datatype generated within a Galaxy workflow.  programs. Lorikeet renders a plot of peptide fragment ions and annotation from the PSM data generated from the sequence database search, offering users the ability to zoom and select or de-select specific annotation information for the peptide. This allows users to visually explore data quality for PSMs of interest, including those putatively matching novel sequences [9].
Lorikeet functionality is described in more detail below in the Functionality section.
MVP also leverages the Integrated Genomics Viewer JavaScript framework (IGVjs) [30]. Using the genomic reference sequence information contained in the feature_cds_map file corresponding to identified peptides sequences, IGVjs can be automatically launched within the MVP interface.
IGVjs offers interactive viewing of peptides mapped against the reference genome, and also can add additional tracks for standard-format sequence files (e.g. BAM, ProBAM [31], BED) if present in the active Galaxy History, interacting through the Galaxy API. IGVjs provides users a flexible tool for viewing all levels of information for an identified peptide sequence --from genomic mapping to the supporting transcript sequencing information.
It is important to note that the outputs generated by MVP processing can be used as an input for further analysis within a Galaxy history. For example, selected peptide sequences (e.g. novel sequences verified within MVP) can be sent back to the active History via the Galaxy API where they can be further processed using Galaxy tools as desired by the user. Annotated MS/MS spectra for PSMs of interest visualized within the Lorikeet viewer can also be downloaded to the

Functionality
In order to demonstrate functionality of MVP, we have chosen a previously published dataset containing MS-based proteomic and RNA-Seq data generated from a mouse cell sample [32]. This dataset provides representative multi-omic data mimicking other contemporary proteogenomic studies, and a means to illustrate how MVP enables data exploration steps commonly pursued by researchers. The tour of MVP functionality presented here works from input data produced within a Galaxy workflow. We have made workflows available to generate input data needed for a user to explore the functionality of MVP, along with documentation describing their operation (see Accessibility section below for instructions and links to access this data).
We begin with a view of the MVP user interface, launched as a plugin from an mzSQLite data input within the active Galaxy History (See    Figure 4). One of these peptides (sequence DGDLENPVLYSGAVK) has been selected in this list, and the button "PSMs for Selected Peptides" clicked to display the two PSMs that matched to this sequence, along  Figure 4). Double-clicking on one of these PSMs opens the Lorikeet MS/MS viewer (Labeled C in Figure 4). Lorikeet[29] renders MS/MS spectra providing a visualization of the annotated spectra which led to a PSM using the upstream sequence database searching software. Figure 4C shows an example PSM, where the blue and red colored m/z peak values correspond to amino acid fragments which would be predicted to derive from the peptide sequence identified by this PSM. The higher the number of peaks within the spectrum matching to predicted amino acid fragment m/z values, the higher quality and confidence of the PSM and identified peptide sequence. Lorikeet is interactive, capable of magnifying spectral regions of interest, selecting desired predicted fragment types to display, and adjustment of data parameters (such as mass accuracy of acquired data) which are commonly used in assessing PSM quality. Within MVP, this tool provides a necessary function for users to view PSMs of interest, particularly useful for assessing the accuracy of matches to variant peptide sequences in proteogenomic applications, which require extra scrutiny compared to matches to reference peptides [9].
Once the quality of a given PSM has been adequately assessed, a common user need is viewing the peptide sequence in the context of its aligned protein sequence. MVP provides this functionality, by selecting the Peptide-Protein Viewer button (available in the Peptide-Protein Viewer pane, labeled B in Figure 4). This provides a listing of all proteins within the FASTA database used for generating PSMs which contain the selected peptide sequence. For example, Figure 5 shows the Peptide-Protein Viewer for DGDLENPVLYSGAVK (peptide sequence from   Figure 6 shows the IGV viewer, with several tracks of information loaded for investigation from the active Galaxy History, investigating the genomic region coding for the peptide DGDLENPVLYSGAVK shown in Figure 5 above. This display shows information related to this peptide sequences, genomic, transcriptomic and proteomics. C) Track C shows the identified peptide mapped to the genomic coordinates shown above.
The arrows indicate the direction of translation against the genomic coding sequence.
D) This track summarizes the transcript sequencing reads assembled from the RNA-Seq data.
This allows the user to assess the quality of supporting transcript information that led to the generation of the peptide sequence that was matched to the MS/MS data. The assembled transcript sequence read data was loaded from the active Galaxy History, contained in a standard format .bam file for assembled transcript sequencing data.
The peptide identified here contains a single amino acid variant at the serine in position 11 from the n-terminal end of the peptide. Ordinarily, this peptide contains an asparagine at this position, as indicated by the codon AAT indicated in the reference DNA sequence track (Track A in Figure   6). The assembled transcript data indicates a single-nucleotide mutation within this codon, showing a C nucleotide substitution within several of the assembled reads on the negative strand (Track D in Figure 6) We have provided a guided description of the main features offered by the MVP tool. Although many powerful features are already in place to meet the requirements of proteogenomic data analysis, MVP has been developed as an extensible framework with much potential for continued enhancement and new functionalities. Tools are already implemented in Galaxy for peptide-level quantification using label-free intensity-based measurements [34,35], which could be added to the information available for PSMs, enabling users to assess quality of abundance measurements and potentially filter for PSMs showing differential abundance across experimental conditions.
The HTML5, JavaScript, and CSS-based architecture of MVP provides the ability to interact with RESTful web services offered by complementary tools and databases, as well as with the Galaxy API. We envision extending functionalities in MVP, offering users the ability to query knowledge bases [36,37] to explore known disease-associations, interaction networks and biochemical pathways of proteins of interest. MVP also has the potential to display visualizations returned from queries against these knowledge bases. Validated peptides of interest can also easily be sent back to the Galaxy History for further analysis -for example, using available tools for assessing functional impact of sequence variants identified via proteogenomics [18].

Implementation:
The MVP plugin is built on HTML5, CSS and JavaScript. contains all the input files necessary for full operation of MVP.

Performance:
The application's performance, as perceived by the end-user, is dependent on the server infrastructure that Galaxy is hosted on, and the end-user's local machine used in accessing the Galaxy web application. The MVP application relies on the existing Galaxy API framework.
Therefore, the application will benefit from the existing Galaxy server infrastructure without any configuration needed from the application. API response from Galaxy to the MVP application will scale with the performance of the supporting server.
Though the underlying database (mzSQLite data type) is a simple SQLite3 database, care has been taken to optimize performance. During database construction, multiple indexes are generated for every

Availability of Supporting Data
Other data supporting this work, including snapshots of our code are available in the GigaScience repository, GigaDB [41].

Additional Material
Additional File 1 is included as a .pdf formatted file. Visualization both in the MVP Peptide-Protein Viewer is shown as well as the IGVjs viewer.
Additional File 2 is included as a .pdf formatted file.
Title: Schema of databases and tables acting as input to MVP. The Editor GigaScience To the Editor: Please find enclosed a revised manuscript titled "Multi-omics Visualization Platform: An extensible Galaxy plug-in for multi-omics data visualization and exploration" which we are submitting as a Technical Note as part of the Galaxy Series for consideration in GigaScience.
We have revised the manuscript according to the reviewer comments. We believe these revisions have strengthened the manuscript. We have also provided a point-by-point description of our revisions to the reviewer comments.
We have no competing interests to declare. All authors have reviewed and approved the submission. The content of the manuscript has not been published or submitted for publication elsewhere.
Thank you for considering our manuscript.
Sincerely, Tim Griffin, PhD Professor of Biochemistry, Molecular Biology and Biophysics