BioUML: an integrated environment for systems biology and collaborative analysis of biomedical data

Abstract BioUML (homepage: http://www.biouml.org, main public server: https://ict.biouml.org) is a web-based integrated environment (platform) for systems biology and the analysis of biomedical data generated by omics technologies. The BioUML vision is to provide a computational platform to build virtual cell, virtual physiological human and virtual patient. BioUML spans a comprehensive range of capabilities, including access to biological databases, powerful tools for systems biology (visual modelling, simulation, parameters fitting and analyses), a genome browser, scripting (R, JavaScript) and a workflow engine. Due to integration with the Galaxy platform and R/Bioconductor, BioUML provides powerful possibilities for the analyses of omics data. The plug-in-based architecture allows the user to add new functionalities using plug-ins. To facilitate a user focus on a particular task or database, we have developed several predefined perspectives that display only those web interface elements that are needed for a specific task. To support collaborative work on scientific projects, there is a central authentication and authorization system (https://bio-store.org). The diagram editor enables several remote users to simultaneously edit diagrams.


INTRODUCTION
The BioUML project was started in 2002, and its main goal was to develop a common purpose visual language for formal descriptions of the structure and function of biological systems--a Biological Universal Modelling Language (Bi-oUML) (1). As a starting point, we used the graphic notation suggested by the GeneNet system (2,3). We have developed several diagram types that allow the user to construct a biological model step-by-step, with increasing levels of detail and formality. Through this approach, we have developed two databases: Biopath (4) and Cyclonet (5).
Subsequently, the international community created the Systems Biology Graphical Notation (SBGN), which standardizes the graphical notation used in maps of biological processes (6). The BioUML team was involved in this standardization process. Currently, BioUML completely supports the SBGN Process Description diagrams.
From the beginning, BioUML has supported a paradigm of visual modelling, where the user can create a diagram that completely and formally specifies the given biological model. BioUML then automatically generates program code that is used to simulate the model behaviour. The initial versions of the BioUML workbench generated MATLAB code and used the MATLAB ODE suite for simulation (7). The current version of BioUML generates highly optimized Java code and uses its own stateof-the-art simulation engines, which have been developed over the last 14 years. Thanks to the optimized simulation engines, BioUML is the only tool that can pass the SBML (Systems Biology Markup Language) semantic test suite (http://sbml.org/Facilities/Database/Submission/ Details/108)--a comprehensive set of tests to verify the correctness of simulation engines for complex biological systems (8).
The paradigm of systems biology implies not only computational modelling but also the so-called 'dry-wet-dry' cycle composed of theory and computational modelling, which proposes specific testable hypotheses about a bio-W226 Nucleic Acids Research, 2019, Vol. 47, Web Server issue logical system, followed by experimental validation and further refinement of the computational model or theory using the newly acquired quantitative description of cells or cell processes (9). Since the objective of such modelling is to describe, as fully as possible, the entire set of interactions in a biological system, high-throughput systems and genome-wide experimental techniques (so-called 'omics' data--transcriptomics, metabolomics, proteomics, etc.) are most suitable for validation of such system models. Such high-throughput techniques are mainly used to collect quantitative data for the construction and validation of models. Thus, the second main direction of BioUML development was the processing and analysis of omics data.
To process and analyse omics and other biomedical data, we integrated R/Bioconductor (10) and Galaxy (11) into the BioUML platform and developed 300+ methods for data analysis. To support the concept of reproducible research, the BioUML platform is equipped with a workflow engine that helps to place various methods for analysis into consequent chains/pipelines that can repeatedly perform the same sequence of analytical steps with new input data. BioUML provides a powerful web editor to visually construct such workflows and an engine for the execution of workflows on a server or computer cloud.
For the visualization of omics data, the BioUML platform provides a diagram viewer/editor and a genome browser. Data from omics experiments (transcriptomics, proteomics, metabolomics) can be mapped on different biological pathways and visualized by highlighting corresponding nodes on the diagram.
Thus, BioUML can perform two steps of the system biology 'dry-wet-dry' cycle--the modelling and analysis of omics data. The main BioUML vision is to provide a computational platform for building virtual cells, virtual physiological humans and virtual patients.
Currently, scientists from different countries use the Bi-oUML platform for the collaborative reproducible analysis of biomedical data, pathway visualization and genome browsing.

Meta-model
The meta-model is the core of the BioUML platform. It provides an abstract layer (compartmentalized attributed graph) for the comprehensive formal description of a wide range of biological systems and other complex systems. The content of biological pathway databases (e.g. Reactome, PantherDB), SBML models, biological pathways in BioPAX format (12), as well as workflows, can be expressed in terms of the meta-model. This formal description can be used both for a visual representation of the structure of biological systems and for automated code generation to simulate the model behaviour.
The meta-model describes a system as three interconnected parts: (i) graph structure: the system structure is described as a compartmentalized graph; (ii) database level: each graph element can contain a reference to an object in biological databases; and (iii) mathematical (executable) model--any graph element can be associated with an ele-ment of a mathematical model or an analysis method (e.g. for workflows).
The meta-model structure is problem-domain neutral, so it is used to describe biological models, as well as executable workflows, for the analysis of biomedical data.

Plug-in based architecture
BioUML is based on plug-in architecture (Open Services Gateway Initiative; OSGi) that enables extension of the platform functionality by the addition of new plug-ins. The basic components of the plug-in-based architecture are as follows: • A plug-in is the smallest unit that can be developed and delivered separately in the BioUML platform. Plug-ins are coded in Java. A typical plug-in consists of Java code in a JAR library, several read-only files and other resources, such as images, message catalogues and native code libraries. A plug-in is described in an XML manifest file called plugin.xml. The parsed contents of plugin manifest files are made available programmatically through a plug-in registry API provided by the Eclipse runtime. • Extension points are well-defined function points in the system where other plug-ins can contribute functionality. • An extension is a specific contribution to an extension point. Plug-ins can define their own extension points so that other plug-ins can integrate tightly with them.

USER INTERFACE
The BioUML user interface ( Figure 1) and architecture were inspired by the Eclipse platform (https://www.eclipse. org/ide). The web interface of BioUML is implemented as a single page application and comprises the following main parts: • Repository pane: This pane contains three tabs--database navigation, user data and available methods.
• -Databases: On the top level, this contains a list of available biological databases. Each database has its own structure. Usually, it consists of: • -Data: Collections of biological objects (genes, proteins, chemical substances, reactions, etc.) • -Diagrams: Diagrams or models of biological pathways. • -Data: This contains user data organized into projects.
The user can create their own project, import omics and other data, analyse the data and invite colleagues for collaborative analyses of corresponding data. All projects (their own and those where the user was invited) are shown in the 'Collaboration' folder. A number of projects were created to demonstrate the main possibilities of the BioUML platform. • Galaxy: Methods of analyses that are available from the Galaxy platform installed on the same server or cloud. • Methods: Analysis methods implemented in Java within the BioUML platform. Currently, it contains 300+ methods grouped into 15 categories. Each method has a detailed description and its own page in the BioUML wiki (e.g. http://wiki.biouml.org/index.php/ Cluster analysis by K-means (analysis)).
• JavaScript: JavaScript API for access to analysis methods. can select a template if several templates are available. For example, for a diagram (model), the following templates are available: (i) reactions--a list of reactions from the model; (ii) parameters--a list of parameters and their values and measurement units; (iii) variables--a list of variables and their initial values and measurement units; (iv) ODE--a system of differential equations generated from the model; (v) overview--includes all of the information mentioned above. • Search pane: This allows the user to specify criteria for searching information in any database selected in the repository. For this purpose, information of the installed databases is indexed using Apache Lucene (http://lucene.apache.org). • Perspectives: This facilitates user concentration on a specific task or a specific database. We have developed several predefined perspectives that display only those web interface elements that are necessary for the specific task.

Systems biology
BioUML supports the main worldwide standards used in systems biology: • SBML: Systems Biology Markup Language (13) serves for the formal description of mathematical models. Bi-oUML supports all versions of SBML, from l1v2 to the latest l3v2, including the extension packages 'fbs' (14) and 'comp' (15). • SBGN: Systems Biology Graphic Notation (6) is used for the visual description of model elements (complexes, compartments, molecule types, reactions, etc.). BioUML completely supports SBGN Process Description diagrams and uses them to visually represent SBML models. BioUML also supports the XML markup language SBGN-ML (https://github.com/sbgn/sbgn/wiki/ SBGN ML), which facilitates the exchange of SBGN diagrams between tools. • Antimony: Human-readable text format that supports most of the SBML features (16). In BioUML, it is automatically processed into SBML diagrams in SBGN notation. BioUML supports import and export into the antimony format.

• SedML: Simulation Experiment Description Markup
Language (17) describes model simulation steps and facilitates the reproducibility of simulation experiments. In BioUML, it is translated into workflows, which allows for the analysis and simulation of mathematical models and bioinformatics data. • Many models, however, require some features that are missing from the above-mentioned standards. In these cases, the SBML standard provides extension mechanisms via the <notes> and <annotation> XML elements. Using these extensions, BioUML stores all additional information about the models (e.g. diagram view attributes and layout). • SBGN was developed independently from SBML, so it does not define visual syntaxes for events, functions, as-signments and other mathematical elements. To solve this problem, we have extended the SBGN process diagrams with additional glyphs to represent and use them in our own notations. Detailed information about the types of models and their visual representations can be found at http://wiki.biouml.org/index.php/Diagram type. • Simulation engine: BioUML automatically generates program code that is used to simulate the behaviour of the analysed model. Currently, BioUML generates highly optimized Java code and uses its own state-of-the-art simulation engines. For each diagram, it provides a list of available engines. For example, network of reactions can be simulated as a system of ODEs or as a Gillespie-type stochastic model. The selected simulation engine provides a list of available solvers. Available ODE solvers include JVODE, which is a package CVODE ported from C to Java and developed at the Lawrence Livermore National Laboratory (18). It utilizes the multistep Adams-Moulton method and the backward differential algorithm, RADAU5 solver (19), as well as classic algorithms (Euler, Dormand-Prince (20)). The stochastic simulation engine provides the exact methods, Gillespie (21) and , as well as approximation methods.
Diagram transformation into a simulatable state by the selected simulation engine is a prerequisite for simulation. Thus, a hierarchical diagram may be transformed to an ordinary 'flat' diagram with reactions and entities. An agentbased diagram may be partially flattened, where all subdiagrams of the same type may be transformed into one combined agent.
There are several other simpler preprocessors. For example, SBML constraints are transformed into discrete events, thereby halting simulation when the constraint is violated. Additionally, fast reactions are transformed into algebraic equations, and Boolean expressions are transformed into numeric expressions, etc.
Other simulation engines are: • Hemodynamics: specifically tailored to solve PDE problems describing blood flow in arteries. • Population: solves NLME problems using the R library.
• Dynamic FBA: dynamically runs Flux Balance Analysis simultaneously with ODE simulation.

Modular modelling
In a modular approach, the investigated system is viewed as a set of interconnected subsystems. Each subsystem can be considered and simulated independently. Integration of these models (or modules) results in a more complex model of the whole system. Modules may leverage different mathematical formalisms and scales. They can be created, validated and improved independently and may be viewed as replaceable parts. Modules provide explicit interfaces through which they can be connected without exposing their inner structure to the user. We consider modules as mathematical models; their interfaces are variables and constant parameters. For example, the value of a variable in one model may be constant, while in another model it changes dynamically. Numerical calculations are performed in two ways: Nucleic Acids Research, 2019, Vol. 47, Web Server issue W229 (i) Flattening: A modular model may be transformed into a non-modular model by aggregating all elements of all modules with automatic resolving of established connections between variables (23). (ii) Agent-based simulation: Each module is simulated independently with its own simulator and formalism. The scheduler coordinates their interactions by sending and receiving numerical values of the connected variables (24).

Parameter estimation
BioUML provides several stochastic and deterministic global optimization methods (25), including a stochastic ranking evolution strategy (26), particle swarm optimization (27), cellular genetic algorithms (28) and others. We have achieved a significant acceleration of these methods using concurrent computing. Algorithms can use experimental data in time-course or steady state forms, with exact or relative values. BioUML also supports multi-experiment parameter estimation. A detailed comparison with other software can be found in (25).

Model analysis
We have implemented a number of methods for model analysis and reduction, including: • Identifiability analysis infers how well the model parameters are approximated by the amount and quality of experimental data (29,30). • Search for linear, monomolecular and pseudomonomolecular reactions (31). • Quasi-steady state analysis (32).
• Sensitivity analysis of the model steady state (33).
• Metabolic control analysis quantifies how fluxes and species concentrations depend on the system parameters (34). • Stoichiometric analysis derives linear relationships between flux rates and reactant concentration derivatives (31). • Mass conservation analysis decomposes a stoichiometric matrix into the product of its linearly independent rows and a link matrix (35).

BIOMEDICAL DATA ANALYSES
For processing and analysis of omics and other biomedical data, we have integrated the best platforms in the respective fields--R/Bioconductor (36) and Galaxy--into the Bi-oUML platform and developed 300+ of our own analysis methods (http://wiki.biouml.org/index.php/Category: Analyses).
• Integration with R. BioUML has bidirectional integration with R. R scripts can be used within BioUML in four ways: (i) The user can create, edit and execute R scripts in the BioUML document pane. The editor supports syntax highlighting; (ii) The 'Script' pane allows the user to input and execute R commands; (iii) R scripts can be building blocks of a BioUML workflow; and (iv) There are a number of Java analysis tools that provide a convenient interface to configure the analysis parameters, with subsequent generation of the corresponding R script. To execute an R script, the BioUML server calls R. Text output is shown in the 'Output' tab. Graphical results (plots, dendrograms, etc.) are shown on separate pages.
To gain access to the BioUML server from inside R, we have developed the rbiouml package (https://cran.r-project. org/package=rbiouml). The package contains functions to acquire data from the BioUML repository, import/export the data, start analyses and workflows and manage the execution queue.
• JavaScript API. The user can use JavaScript (document, console, building block on workflow) similar to R scripts. API provides functions to acquire data from the Bi-oUML repository, import/export the data, start analysis tools and workflows, and provides detailed access to complex BioUML objects (e.g. models). In contrast with R scripts, JavaScript is executed inside the BioUML server. • Integration with Galaxy. The Galaxy platform provides explicit descriptions (Galaxy tool XML file) of parameters for thousands of biological tools, mainly command line tools. BioUML extends the Galaxy tool configuration syntax that allows a closer interaction between the Galaxy and BioUML systems (http://wiki.biouml.org/ index.php/Creating Galaxy tool). • BioUML can read these XML files and generate forms where the user can specify values for corresponding parameters of the tools integrated in Galaxy. • Workflows. For reproducible research, analysis tools can be joined into workflows. BioUML provides a powerful editor to visually construct workflows, and the engine for workflow execution is located on a server or cloud.
BioUML workflows can include the following component types: • Analysis method: Method for analyses with specified inputs/outputs and parameters. It can be a BioUML method, Galaxy tool or Java wrapper for R functions. • Analysis script: R script or JavaScript code, R methods. • Analyses parameter: Subset of parameters that the user should specify to start the workflow. • Analyses expression: Used to set and connect the input and output analysis parameters in the workflow. • Cycle: Subset of workflow steps that will execute repeatedly. Cycles can iterate over the elements of folder, over table columns, over ranges of integers and over arrays of elements. See http://wiki.biouml.org/index.php/ Workflow for more details.

PATHWAY VISUALIZATION
The BioUML diagram editor/viewer can be used not only for visual modelling but also for the visualization of different biological pathways. For this purpose, the Bi-oUML server contains the following databases: Reactome (37), PantherDB (38) and Biomodels (https://www.ebi.ac. uk/biomodels/). One can load their own pathways in the W230 Nucleic Acids Research, 2019, Vol. 47, Web Server issue following formats: BioPAX, Antimony, SBGN-ML, SBML and Cytoscape CX (39). BioUML utilizes several algorithms for the automatic layout of visual diagrams, including Hierarchical, Forcedirected, Greedy and Grid layouts (40).
Data from omics experiments (transcriptomics, proteomics, metabolomics) can be mapped for different biological pathways and visualized by highlighting corresponding nodes on the diagram (http://wiki.biouml.org/index.php/ Expression mapping).

Integrated genome browser
BioUML provides a fully integrated genome browser (41) that supports most of the features available in other modern genome browsers and comprises a comprehensive set of visualization tools for data processing results, which is extensively used to visualize information from a GTRD database (42).

Collaborative reproducible research
User data (tables, diagrams, etc.) in BioUML are organized into projects. The administrator (creator) of the project can invite other users to participate in the project and manage their permissions. The user registration and management of access rights are performed via a central authentication and authorization system (https://bio-store.org). All user actions in a project, including performed analyses and scripts, are tracked in the project journal.
BioUML provides a collaborative editing functionality. Numerical models, pathways and workflows can be simultaneously modified by several researchers, and changes are instantly reflected on the screens of all users, while an embedded chat function facilitates user coordination and collaboration. The system also supports revision control and the possibility to revert to previous versions.

From virtual cell to virtual patient
The BioUML vision is to provide a computational platform to build virtual cells, virtual physiological humans and virtual patients. We have created two databases on the Bi-oUML server that demonstrate our work in this direction using the BioUML platform.
The Virtual cell database includes three projects:  (45). Only two submodels were completely finished--Cytokinesis and FtsZ Polymerization.
The Virtual human database includes a number of modular models that describe human physiology, including a classic model of blood circulation (46), a model of heart pumping and blood flow (47), a comprehensive model of blood flow through 55 of the largest arteries in the human body (48) and models with a focus on the regulation of blood volume (including kidney) (49,50).
Antihypertensive drugs: This is a database of pharmacokinetic (PK) and pharmacodynamic (PD) models of antihypertensive drugs from different drug groups, including aliskiren, losartan, amlodipine, enalapril, bisoprolol and hydrochlorothiazide.
Complex model: This database combines physiological models with PK/PD models to build so-called 'virtual patients'. These can be created in diverse forms using different parts of human physiological models, with different focuses on subsystems depending on the research objectives.
Virtual muscle (51): This is a detailed kinetic model describing both the facilitated and passive transport of metabolites between muscle tissues and blood vessels and metabolic processes in cellular compartments (cytosol and mitochondria). We have rebuilt this model as a modular model that became an example of a multilevel model, taking into account cellular compartments and tissue organization.

GTRD database
The GTRD database demonstrates how the BioUML platform can be used to a build web interface for access to a database. We have developed a special GTRD perspective (42,52) that provides browsing, information display, advanced search possibilities, and integration of the genome browser and information from the Ensembl database (gene structures, repeats, etc.) to visualize the GTRD data.

Workflows as a cookbook for the analysis of omics data
Each workflow can be considered as a ready recipe for the specific analysis of corresponding omics data. A scientist needs only to import data, select the appropriate recipe, specify input/output data and press the 'Run' button. The platform will automatically analyse the data. This was a key idea of a geneXplain platform (http://genexplain.com/ genexplain-platform/) (53) that now provides hundreds of workflows for the analysis of different types of omics data (microarrays, transcriptomics, proteomics, metabolomics, etc.). The geneXpain platform is a branch of the BioUML tree, with the focus on commercial application. It includes Nucleic Acids Research, 2019, Vol. 47, Web Server issue W231 such commercial databases as TRANSFAC ® (transcription factors and their binding sites in a genome; 54), TRANSPATH ® (signal transduction network in eukaryotic cells; 55) and HumanPSD ® (disease biomarkers, drugs and clinical trials; 56). The geneXplain platform contains several of its own sophisticated methods for promoter and pathway analysis, such as Match™ (57) for the identification of transcription factor binding sites, CMA (Composite Module Analysis; 58) for the identification of composite regulatory modules in promoters and enhancers, tools for finding master regulators (59) in networks and other tools.
Recently, a new tool, Genome Enhancer (http://mygenome-enhancer.com/), has been developed based on the BioUML platform. Genome Enhancer is a tool for the fully automated analysis of multi-omics data. Depending on a user's data, the platform automatically generates a corresponding workflow, executes the full analysis and presents the results as a well-structured detailed research article.

DISCUSSION
The BioUML platform spans a comprehensive range of capabilities, including access to biological databases, powerful tools for systems biology (visual modelling, simulation, parameters fitting and analyses), a genome browser, scripts (R, JavaScript) and workflows for a diverse array of biomedical data analysis. There is a range of other software platforms that provide similar capabilities for data analysis and modelling with specific extensions for systems biology. The most prominent are: • R Studio provides a web interface and R/Bioconductor provides hundreds of packages for biomedical data analysis. • MATLAB has several packages for biomedical data analysis and systems biology, including SimBiology and IQM Tools (https://iqmtools.intiquan.com/, formerly Systems Biology Toolbox). SimuLink provides a powerful tool for the visual development of modular models. For instance, the comprehensive complex model of the bacterial cell, Mycoplasma genitalium (44) was created using MATLAB. • Jupyter notebook (60) is widely used interactive computing environment across dozens of programming languages (Python, R, Julia and Scala).
In general, the BioUML platform has the following advantages: • A state-of-the-art simulation engine that supports visual modelling using different approaches. As mentioned above, BioUML is the only platform that can pass the SBML semantic test suite, including hierarchical models.
• It provides capabilities for both steps of the systems biology 'dry-wet-dry' cycle--the modelling and analysis of omics data. • The platform can provide a perspective mechanism to facilitate a user's focus on the tasks and databases they are working with.