Pathway Commons: 2019 Update

Pathway Commons (https://www.pathwaycommons.org) is an integrated resource of publicly available information about biological pathways including biochemical reactions, assembly of biomolecular complexes, transport and catalysis events and physical interactions involving proteins, DNA, RNA, and small molecules (e.g., metabolites and drug compounds). Data is collected from multiple providers in standard formats, including the Biological Pathway Exchange (BioPAX) language and the Proteomics Standards Initiative Molecular Interactions format, and then integrated. Pathway Commons provides biologists with (1) tools to search this comprehensive resource, (2) a download site offering integrated bulk sets of pathway data (e.g., tables of interactions and gene sets), (3) reusable software libraries for working with pathway information in several programming languages (Java, R, Python, and Javascript), and (4) a web service for programmatically querying the entire dataset. Visualization of pathways is supported using the Systems Biological Graphical Notation (SBGN). Pathway Commons currently contains data from 22 databases with 4,794 detailed human biochemical processes (i.e., pathways) and ∼2.3 million interactions. To enhance the usability of this large resource for end-users, we develop and maintain interactive web applications and training materials that enable pathway exploration and advanced analysis.


INTRODUCTION
Pathway information that describes interactions between molecules in biological processes can help in solving research problems, such as the interpretation of genomics data (1) , generating hypotheses surrounding disease mechanisms (2,3) , design of rational therapeutics (4) and treatment decision strategies (5) . In a recent translational study, using pathway resources, genome-wide observations of increased methylation in pediatric brain cancer were linked to an upstream methyltransferase, which could be targeted pharmacologically and has served as the basis of clinical trials (6) .
The number of available pathway and interaction resources has nearly tripled over the last decade, from 190 in 2006 to 702 in 2018 (7) (www.pathguide.org), increasing the need for integration. Unfortunately, making this knowledge available to the research community has been hindered by fragmentation from the use of diverse data representation schemes and softwares, making pathway information from multiple sources difficult to combine and use.
Pathway Commons (PC) is a resource that aggregates data from publicly available biological pathway and molecular interaction databases and provides it from a single access point on the web (8) . In this way, PC facilitates integration and exchange of molecular-level descriptions of metabolic and signaling pathways, molecular and genetic interactions and gene regulation networks. Data is collected from providers in the Biological Pathway Exchange (BioPAX) Level 3 (9)

and the Proteomics Standards Initiative Molecular
Interaction (PSI-MI) formats (10) , and stored uniformly in BioPAX format. Use of the BioPAX ontology and format enables PC to capture, in a uniform and consistent way, details concerning genes, macromolecules (proteins) and small molecules and their involvement in different types of physical interactions, such as biochemical reactions, catalysis, post-translational protein modifications, complex assembly, and transport. PSI-MI data captures molecular interactions from small and large scale experiments. These descriptions are richly annotated with links to citations, experimental evidence, and external database information, for instance, protein sequence annotation. PC aims to add value to curated source databases by normalizing, integrating and exporting data in ways that simplify usage.
PC has been used to analyze transcriptomics, proteomics, and metabolomics data in a large number of projects across diseases to further our understanding of human biology in health and disease (4,(11)(12)(13)(14)(15)(16)(17)(18) . Since our original report in 2011, significant advances have been made with regard to the breadth and volume of data available along with software tools to support data creation, validation, and accessibility in the wider research community. Here, we summarize developments made since our original report and discuss future efforts to enhance accessibility and provide scalable systems for knowledge capture in support of biomedical discovery.

PATHWAY AND INTERACTION DATA COVERAGE
PC currently integrates data from 22 public databases, up from the 9 in our initial report. This has more than tripled the number of pathways (from 1,477 to 4,794) and interactions (from 687,883 to over 2.  (19)(20)(21)(22) . PC focuses on collecting human pathway data since many data providers focus specifically on interactions occurring in human cells.

SOFTWARE INFRASTRUCTURE
The core software tools driving PC are cPath2 and Paxtools. cPath2 is an open-source database and web application for collecting, storing, and querying biological pathway data, and has been completely rewritten based on cPath (23) . cPath2 is built atop the Java Paxtools library (24) which provides an in-memory BioPAX object model designed to provide an API along with rich and fast data querying, validation and format conversion utilities (25)(26)(27) (Figure 2). cPath2 includes built-in identifier mapping for linking between identical interactors and to external resources as well as an application programming interface (API) that functions as a web service for searching and retrieving pathway data sets. The web service is implemented using a RESTful architecture and allows fine-grained data retrieval as JSON-LD (to support easy access from web applications), BioPAX and other formats (see DATA FORMATS AND AVAILABILITY). It supports search, including keyword-based, in addition to the advanced querying facilities made available by Paxtools (e.g. graph-based querying). For a detailed graphical representation of pathways, Pathway Commons uses the standard Systems Biological Graphical Notation (SBGN) (28) , used to reduce the ambiguity in representations of biological maps, and its accompanying SBGN-ML format (29) . The web service is a major access point for software developers and computational biologists to programmatically access PC data and can be used to build third-party software apps, such as the ones described below.

DATA FORMATS AND AVAILABILITY
Users can freely access PC data by either downloading data files (designed for computational biologists), through a web service (for software developers or computational biologists), or via a series of interactive web-based search tools. Pathway information downloads are made available in BioPAX format, Gene Matrix Transposed (GMT) format, which is used in gene set enrichment analyses (30,31) , Simple Interaction Format (SIF) and extended SIF with additional fields, which are useful for network analysis and visualization (pathwaycommons.org/pc2/formats; SUPPLEMENTARY DATA). GMT datasets are provided with HGNC or UniProt identifiers (32,33) . Users can access a file containing the entire collection or files that only contain data provided from an individual database. Data updates are scheduled approximately biannually (current release as of February 2019 is Version 11) and previous versions are also available in an archive (pathwaycommons.org/archives).

SOFTWARE TOOLS
We have developed a number of tools using the core cPath2 and Paxtools PC infrastructure, including programming libraries as well as desktop and web-based applications for use by a broad audience.

Tools for querying and visualizing Pathway Commons data using BioPAX
In addition to the core Java-based Paxtools library, programming libraries in other languages commonly used by computational biologists, including R (34) and Python (35) , have been developed by the PC team and the community. These packages enable users to access content in BioPAX and act as clients for the PC web service. ChiBE is a desktop application focused on network visualization of BioPAX data and the analysis of genomic data in a pathway context (25,36) . Cytoscape (37) and CellDesigner (38)

Tools for visualizing and interacting with pathway diagrams online
A number of reusable tools have been built to enable users to interact with pathway figures online and to map data onto pathway diagrams (20,40) . We have developed software to visualize and interact with network diagrams using the SBGN standard (28) .
From there, figures can be exported as static images or included as part of a dynamic web application.

Analytical tools using the Pathway Commons data source
A number of analysis packages that make use of PC data have been developed (42)(43)(44)(45)(46)(47)(48)(49)(50)(51)(52) . Here we briefly describe several tools developed by the PC team. NetBox is an algorithm that automates the data-driven definition of network modules on the basis of genomic or molecular alterations (53) . CausalPath identifies potentially causal relations between (phospho)proteomic measurements based on known pathways (54,55) .

WEB APPLICATIONS AND TRAINING MATERIALS
Pathway Commons maintains a number of web applications and training materials aimed at advancing pathway analysis in the research community.

PC web apps: search and visualization
The PC search app attempts to anticipate the context of user questions from their queries and returns relevant results ( apps.pathwaycommons.org/search ). The system recognizes specific search types (e.g. genes) that are typically part of user queries (e.g. " cell cycle arrest involving TP53 and CDKN1A "). In this case, the search results display additional information about each gene along with links to additional apps that use this gene-based information as input (below) ( Figure 3A). A list of pathway search hits is displayed including information about the data source and its number of "participants". Pathway search hits link to an interactive viewer that displays the network, rendered using SBGN (see Data representations section) ( Figure 3B). Clicking on any node in the visualization reveals a tooltip that contains more detailed information including type (e.g. "protein" or "Biochemical Reaction"), alternative names, supporting publications and links to other databases.
Depending on the nature of the input query, links to other apps will become available.
For instance, if one or more genes are recognized in a search query, they are used to seed an interactive network visualization, called the Interactions app ( Figure 3C). When the query contains one recognized gene, this app displays an interaction "neighborhood" to answer the question "What interacts with my gene?"; when multiple genes are present, only direct interactions between those genes will be shown, answering the question "How are these genes connected?". These results are retrieved by performing either cPath2 neighborhood (one gene) or paths-between (multiple genes) web service queries. Users can filter individual interactions for specific interaction types. When the system recognizes many gene mentions, a link to the Enrichment app is enabled ( Figure 3D). This app answers the question "In which All network views in the above-described web apps are implemented using the Cytoscape.js graph visualization JavaScript library (41) .

Training
A major goal of Pathway Commons is to support the analysis and interpretation of molecular and genomic profiling datasets. To support this, we developed PC Guide ( pathwaycommons.org/guide ) that aims to be an online textbook for pathway analysis approaches. A current focus is pathway enrichment analysis that translate observed differences at the gene-level due to state (e.g., healthy versus diseased samples) or experimental testing (e.g., control versus treated samples) into higher-level changes at the pathway level. The Workflows section guides users through a step-by-step, example-driven tutorial to create Enrichment Map visualizations in Cytoscape from the analysis of RNA-seq data using Gene Set Enrichment Analysis (GSEA) (30) . The Primer section offers intuitive descriptions of analytical techniques (e.g. Fisher's Exact Test and GSEA) used in popular software packages and apps.

CONCLUSION
The goal of Pathway Commons is to provide a comprehensive and user-friendly access point for researchers desiring pathway and molecular interaction information to support the analysis of biological data and the discovery of interesting relationships. Since our original report (8) , the resource has expanded to include most of the widely used publicly available pathway datasets. In addition, we have increased accessibility through the development of web services, training resources and a diverse collection of end-user tools to explore and analyze the data. The PC Search web app aims to provide a unified and intelligent way to deliver relevant information and tools to users, inspired by recent additions to Google search functionality that 'understand' the query type to provide relevant search results (e.g. local movie times if you search for a movie name). We plan to extend the range of biological concepts recognized (e.g. drugs, metabolites, diseases) and collaborate with the community on the development of a unified and user-friendly federated search across network and pathway resources.
While PC incorporates over 20 large pathway and molecular interaction resources and over 700 of these resources are known, the vast majority of pathway resources are unfortunately no longer active or available. Further, even for the 22 databases currently integrated, much effort was required to work with data providers to create or tune BioPAX output to enable integration of the available data. For this reason, even with 700 created pathway-related databases, few additional ones will be integrated. As new databases are created, they can now use PC software components, such as Paxtools, to make available standard BioPAX formatted output. Another major barrier to pathway data access is that only a small handful of pathway and molecular interaction resources that curate data from the literature remain actively funded and they are only able to cover a relatively small part of the rapidly growing literature. To address this, the PC team is advancing text-mining technology to extract pathway information directly from the existing literature (60,66,67) , and developing a curation support tool that empowers authors themselves to capture and share structured summaries of knowledge described in their articles. These efforts, when combined with continued expert curation, may meet the challenge of providing high-quality, computable pathway information that can be effectively searched and analyzed by the broader research community.