iCAVE: an open source tool for visualizing biomolecular networks in 3D, stereoscopic 3D and immersive 3D

Abstract Visualizations of biomolecular networks assist in systems-level data exploration in many cellular processes. Data generated from high-throughput experiments increasingly inform these networks, yet current tools do not adequately scale with concomitant increase in their size and complexity. We present an open source software platform, interactome-CAVE (iCAVE), for visualizing large and complex biomolecular interaction networks in 3D. Users can explore networks (i) in 3D using a desktop, (ii) in stereoscopic 3D using 3D-vision glasses and a desktop, or (iii) in immersive 3D within a CAVE environment. iCAVE introduces 3D extensions of known 2D network layout, clustering, and edge-bundling algorithms, as well as new 3D network layout algorithms. Furthermore, users can simultaneously query several built-in databases within iCAVE for network generation or visualize their own networks (e.g., disease, drug, protein, metabolite). iCAVE has modular structure that allows rapid development by addition of algorithms, datasets, or features without affecting other parts of the code. Overall, iCAVE is the first freely available open source tool that enables 3D (optionally stereoscopic or immersive) visualizations of complex, dense, or multi-layered biomolecular networks. While primarily designed for researchers utilizing biomolecular networks, iCAVE can assist researchers in any field.


5
Note that while few network visualization tools incorporate 3D layouts [37][38][39], they are not immersive 3D, i.e. they do not have interoperation capability with Virtual Reality (VR) technologies, and have 2D displays. For example, Arena 3D [37] mixes 3D and 2D properties by arranging data in multilayered graphs in 2D, with each layer representing a different data type. While the tool includes several layout and clustering algorithms for each layer, and has zoom and rotation features, it does not offer global layout and clustering algorithms to make full use of the third dimension and each layer is in 2D [37]. 3DScapeCS [38] is a Cytoscape PlugIn written in Java, with built-in extensions of the classic 2D force-directed layouts. Users cannot add new layouts or functionalities and it does not utilize 3D-effects to improve comprehension (e.g. transparency or advanced shadow effects). BioLayoutExpress [39] (now Miru) is a stand-alone 3D application specifically for gene expression networks that offers three network layouts, a clustering method, no edge bundling and with limited network topology statistics. Importantly, it is not freely available. In summary, 3D biomolecular network visualization is a nascent field. We need free open-source tools for biologists to visualize their networks, and for algorithm developers to add and test new methods that take advantage of the third dimension. Such a tool will also enable visualization designers to perform user studies to better understand the relative advantages of various 3D features. This is necessary, as how best to utilize features specific to 3D or to take advantage of new 3D technologies are currently open research questions.
To the best of our knowledge, iCAVE is the first 3D, stereoscopic-3D and immersive-3D biomolecular network visualization tool that is open source, freely available and utilizable with commercial hardware/software. iCAVE introduces new built-in 3D algorithms for laying out nodes and their connections in 3D space and has built-in topology-based graph clustering algorithms For example, it enables visual integration of multiple clusters or data types within the same graph as a multi-layered network (e.g. metabolomic, proteomic, genomic, GWASdisease, protein-drug interactions). Users can also add their own layout or clustering algorithms. While not extensive, it includes a few built-in databases to assist in preliminary mapping of High-Throughput (HT) experimental data in early discovery phase of network building. Customizable color, texture, size and layout options assist in displaying maximum information in a graph in an optimized manner. Users can easily select edge colors, weights and directions or bundle edges for simplified views. Data are input in a tab-delimited text file while visual outputs can be saved in 2D-snapshots or movies configured with user-defined rotation, zoom and 6 speeds. Additional reports on network statistics are provided in 2D. Overall, iCAVE enables network explorations in hypothesis-driven contexts that is flexible, collaborative and user friendly.

Results
Optional Stereoscopy. iCAVE users can turn 3D stereoscopy on or off during exploration. For example, consider rendering the 2D biomolecular network in Fig. 1A that represents a pathway affected by genomic alterations in glioblastoma [40]. Instead of the static 2D network in Fig. 1A, users can experience full 3D depth perception at the comfort of their own stereo-equipped computer (Fig. 1B), or inside a CAVE (Fig. 1D) using a simple 3D extension of a classical force-directed layout algorithm [41] (Fig 1). Even users without a stereoequipped computer can interact with the 3D network: they can use their mouse (in lieu of hand-held controls) to zoom in/out or rotate the network to a view without occlusions. Rotation and zoom enables viewing the network from different view angles, such as the screenshot in Fig.1C. User studies have shown that even simple 3D features like rotation help better identification of properties unique to complex networks [29]. Visualizations using stereoscopic 3D or immersive environments that enable inspection of a system from multiple perspectives have also been shown to make different properties of a system clearer [42]. Our case study supports this, as we observed a network feature that was not intuitive from the original 2D layout in Fig. 1A: nodes CBL and SPRY2 (with *) are connectors between two dense network regions (modules) (Fig. 1A-C). A targeted attack to these genes can split the network into two. We could not identify this in 2D. Such discoveries of network topological features, among others, give a richer, more intuitive and ultimately more insightful understanding of networks.
Addressing Large Networks. Important characteristics may be missed if users cannot interact with the complete network. In the simplest case, the nodes may form (i) dense sub-networks that are interconnected by a small number of connector nodes which render them critical or (ii) multiple networks (often one giant and few smaller ones) where the smaller sub-networks may represent functional groups of importance, such as a critical enzyme complexes. Hence, visualizing the complete network can be advantageous even if it is very large, to identify local patterns [43]. However, while human brain has a remarkable capacity to visually identify patterns, enabling interpretation of data, visualizations of large networks may exhibit problems with display clutter ,   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 molecular positioning or perceptual tension, leading the user to misinterpret closely positioned molecules as related [44]. Such misinterpretations are inherent in the limitations of human visual perception, and have been well-studied in (Gestalt) psychology: people tend to organize visual elements into groups [45].
In 3D elements that appear to form a pattern because of their visual positioning in one viewpoint can be interpreted correctly by rotating the image to a different viewpoint (e.g. Fig. 1). Furthermore, in networks that are denser or larger than that of Fig. 1, the potential 2D hairball effect can obscure important interactions. iCAVE users can simply navigate to a view without occlusions by moving their head, rotating the image, and zooming in or out, eliminating edge-crossings. To further address cluttering, iCAVE provides an edge-bundled display [46] option for visually bundling adjacent edges together, analogous to bundling electrical wires or cables. Bundling is extremely useful in identifying global patterns in very large networks and can suggest vulnerabilities as targets. Several layout algorithms built-in within iCAVE address the molecular positioning problem; depending on the topology of a network, one may work better than another. We suggest testing each to see which works best. We provide examples of how these features can help with exploring a network in the following sections.

New biological insights from networks with known 3D physical coordinates. Users can visualize physi-
cally constrained networks at multiple scales, from proteins ( Fig. 2A) to the whole brain (Fig. 2B). Coupled with edge bundling, these can provide insights in hypothesis generation. For example, Fig. 2A represents a snapshot of bacterial leucine transporter (LeuT) residue correlation network, where tnodes represent 3D coordinates of alpha-carbon of a residue and edges represent top 3,000 (Pearson) correlations between residue pairs from a Molecular Dynamics simulation (from Michael LeVine, personal communication). Remarkably, bundling the edges of this network enables the representation of highest density correlation highways that travel through substrate permeation core in protein center, connecting extracellular and intracellular domains.
These highways enable users to identify specific residues that have dense correlations with the permeation core even if they are away from it, which is unexpected. These residues may have previously unidentified importance in protein structure and function and are therefore potential candidates for follow-up studies.  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Automated Layouts utilize 3D for molecular positioning. Biomolecular networks tend to follow basic and reproducible organizing principles, and navigating the entire network provides a good initial understanding. The layout algorithm must address the complex problem of arranging the nodes to clearly disseminate the network topology, and at the same time be visually pleasant and user-friendly. iCAVE offers several network layout options to achieve these aims: Due to user familiarity, we extended variations of the force-directed layout to 3D: (i) the classical force-directed algorithm [41] treats the network as a physical system with edges analogous to springs and nodes to electrically charged particles that repel each other. Final layout is established when the repulsive and attractive forces balance each other [47]; (ii) Lin-log layout [48] is better suited for larger networks because it keeps highly connected nodes in close proximity with minimal number of edge crossings; and (iii) hybrid force-directed layout [49] partitions the graph into smaller units before applying the force-directed algorithm (see Methods). We further implemented two novel layout algorithms to take full advantage of immersive 3D: Semantic levels layout algorithm segregates the network into separate layers (default 7) in the third dimension.
The layout of each layer is calculated with a 3D extension of the force-directed approach. Semantic layers layout can be especially useful for user-defined networks where the number of layers and node assignments to layers can correspond to different data types (e.g. a 2D projection in Fig. 4 and 3D video in Supplementary Video 3, with layer1: genes; layer2: diseases; layer3: drugs).
Hemispherical layout is a novel layout algorithm we have developed, that positions the network on the surface of a 3D hemisphere. The most connected node is positioned at the top center of the hemisphere. Then, the whole hemisphere surface is populated based on a decreasing rank-order of connectivity. The node positions are fixed and the edges are drawn on the hemisphere surface (e.g. see a 2D projection in Fig. 5C and 3D video in Supplementary Video 4).
Each layout algorithm has unique strengths and we recommend the user to test different options. Semantic layout is often ideal for hierarchical networks. Force-directed layout often captures the essence of large net- works. Hemispherical layout leads to clean images with optional edge bundling ( Fig. 5C and Supplementary Video 4).

Statistics on Network Topological Properties.
Most real-world networks exhibit substantial and non-trivial features, where connections are neither purely regular nor random. iCAVE automatically generates and reports network topology statistics and centrality measures both graphically and in tabular form. These include the number of nodes, the number of edges, network diameter, node-betweenness centrality, closeness centrality, neighborhood connectivity, shortest path, topological coefficient, and node degree distribution properties of the network.

COMBO Database for Simultaneous Query of Multiple Data Types.
Publicly available biomolecular interaction data are often contained in massive databases [19]. While not comprehensive, iCAVE combines data from multiple resources into a single COMBO repository to enable quick queries. These include protein-protein in-

Illustrative Examples.
Example 1. Visualizing the complete global network, even if it is very large, can enable visual identification of a pattern. For example, consider a large probabilistic causal network constructed from human omental adipose tissue in a morbidly obese patient cohort in Fig. 3A. The network consists of 7,601 nodes, 13,979 edges [54].
Nodes are the genes expressed in tissue; edges are derived from a Bayesian network reconstruction algorithm that leverages DNA variation for causality. Here, we highlight nodes that represent a signature of genes causally associated with inflammatory bowel disease (IBD) SNPs or disease pathways. Notice that within this global view of the massive network, there is a pattern of the IBD genes clustering together, which visually supports the hypothesis of functional relatedness.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 pression, mid-level contains 'information flow bottlenecks' and connections with miRNA and distal regions, revealing ideal drug targets. Such multi-layered heterogenous information integration assists in differentiating intra-level interconnections as well as inter-level edge types and node labels. Note that nodes in each layer are also arranged in 3D using 3D force-directed layout.

Example 3.
Visualizing the global network of interactions while scaling or coloring a subset of the nodes based on their specific properties can enable hypothesis support. In this example, the visualization helps support the principle that functionally significant and highly conserved genes tend to be more central in physical proteinprotein and regulatory networks [56]. Based on this hypothesis, Fig. 3C visualizes a network of tolerance to loss-of-function (LoF) mutations and evolutionary conservation, with nodes for (LoF) tolerant (blue) and essential genes (red) easily distinguishable [56]. Node size is based on degree centrality of a gene 7  properties that enabled help support this hypothesis [57].

Graph Clustering To Identify Network Motifs.
Clustering is critical in network exploration, as biomolecules that cluster together tend be functionally related. iCAVE offers the following graph clustering algorithms: Edge-Betweenness clustering (EBC). The number of shortest paths going through a particular edge is EB. An edge with a high EB value connects multiple communities. At each step, the EBC algorithm removes the edge with the highest EB value until it has optimized a modularity metric on how unlikely the in-cluster degree of a node is in comparison to a random edge. EBC [58] is an attractive algorithm since it does not require an estimate of the number of clusters a priori, unlike a majority of existing graph clustering algorithms.
Modularity clustering (MC) uses the first eigenvector of the modularity matrix to assign nodes to clusters [60].
While ideal for weighted networks, MC delivers intuitive layouts for networks that do not have weights as well.
Layout Options for Cluster Visualization. iCAVE can easily visualize the clusters generated by iCAVE or another tool. By default, each cluster is positioned in space with force-directed layout [41], analogous to node positioning. Every cluster is embedded inside a transparent bubble, with members and their connections organized using the hemispherical layout. This arrangement provides a visual aesthetic, and (optional) edge bundling further clarifies the global topology (i.e. thicker bundles for high intra-cluster connectivity). Users can choose alternative layouts for cluster bubble positioning. Lin-log cluster layout is a variation of the forcedirected model [41], where highly connected clusters are arranged in closer proximity.

Methods
OpenGL API for visualization. New algorithms are added as separate .cpp files and the corresponding header files are imported to the main program (vrnetview.cpp).
Label Creation. Since VRUI offers limited label creation options that render low quality and unreadable text, we developed texture mapping for high quality rendering. Supplementary Figure 1 illustrates VRUI vs. iCAVE labels.

Network Topological Properties
iCAVE automatically calculates the following network properties, rank-orders nodes based on these and represents their distribution both graphically and in tabular form: Node degree property yields hubs. Generally, only a few biomolecules (hubs) have many network interactions [62,63]. Hubs are often central in mediating interactions among the less connected biomolecules [64] , [65].
Neighborhood connectivity metric assists in identifying modularity, where small interconnected subgraphs may potentially represent specific enzymes, structures or processes [66,67] and provide significant insights to perturbed disease mechanisms. For example, the degree of gene co-expression correlates strongly with the complexity of an embedded motif [68].
Network average and local clustering coefficients quantify connectivity of the whole network or a single node. Local clustering coefficient is the ratio between the number of edges that connect the neighbors of a node versus the maximum possible number of edges. The network average clustering coefficient is the average of the local clustering coefficients of all nodes [69]. Only nodes that belong to networks with >3 nodes are considered. The range of coefficient values varies from 0 (no interconnection), to 1 (perfect interconnection).
Network diameter is the length of shortest path between two farthest nodes. Unconnected nodes are not considered. Irregular networks usually have small diameters, while regular networks have large diameters.
Betweenness centrality is a global metric on the importance of a node, which is equal to the number of shortest paths from all vertices to all others that pass through that node, calculating the load on a node [71].
Real world scale-free networks usually involve short path lengths across the network, and a few nodes have high betweenness-centrality. Connector or high-traffic biomolecules that are vulnerable to targeted attacks, usually suggest potential non-hub drug targets [72] , [73,74].

Shared nearest neighbors: A similarity metric based on the sharing of nearest neighbors between any two
nodes. Particularly useful in network topology-based motif, sub-graph or cluster identification.
Shortest paths: Quantifies the importance of a node within the network, calculated by the number of shortest paths going through the node. Purely random graphs exhibit a small average shortest path length (~ the logarithm of the number of nodes) along with a small clustering coefficient.

Layout Algorithms.
A graph G(V ={1, . . . , n},E) represents a binary relation E over node set V. iCAVE both extends classical layouts to 3D and offers novel algorithms. Based on the underlying topology, a user can choose the best layout that helps with data interpretation.

Lin-log layouts.
We used r-PloyLog [48] energy model to implement the node-repulsion and edge-repulsion LinLog models. For all r R with r > 0, the node-repulsion energy of a layout p is: where p(u) is the position of node u. Edge-repulsion energy is: where deg(u) is the number of edges incident to node u. At r=3, the 3-PolyLog reduces to FR and at r=1 to LinLog model. LinLog models group nodes according to cut density and the normalized cut, therefore the layout leads to graph clustering.

Semantic levels layout
is ideal for integrative analysis of multiple data resources (e.g. genotype, phenotype, drugs, proteins, metabolites). Initially, FR algorithm is performed in 2D. Then, multiple equidistant levels (default =7) are created in the z-dimension. Based on network topology, we consecutively assign the nodes to one of the layers. iCAVE user-interface allows the manual manipulation of the number of layers and the distance between them. If layers are not predefined, we suggest experimenting with different options.

Fig. 3. iCAVE print-ready images of networks in
Chain Clustering of the same network based on its connectivity, available as one of the clustering options in iCAVE. Each cluster is represented inside spherical bubble. While topology suggests that most similar metabolites cluster together, this is not always the case, as shown. In all panels, addition of metabolite labels is useroptional .   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62 63 Visualizations of complex biomolecular interaction networks are critical in multiple systems. The volume of data represented in biomolecular interaction networks is growing at an unprecedented rate due to increasing prevalence of high-throughput experimental techniques. While studies of systems consisting of thousands of biomolecules are now routine, currently available visualization tools do not scale well with large datasets. The problem is compounded with recent sequencing technologies that yield massive data. In order to achieve better understandings of such complex processes, it is important to maximally integrate data across multiple dimensions, pushing the limits of current visualization tools. Clearly, there is a strong need for new complex, heterogeneous data visualization solutions.
In our manuscript, we present a new integrative visualization platform, interactome-CAVE (iCAVE) for visualizing large and complex networks in 3D. Users can explore networks (i) in 3D using a desktop; (ii) in stereoscopic 3D using 3D-vision glasses and a desktop; (iii) in immersive 3D within a CAVE-type environment. iCAVE incorporates several layout algorithms to automatically generate 3D visualizations that solve the scalability limitations of traditional representations. Built-in network topology analyses enable effective representations that maximize understanding of the underlying network structures of large, dense, layered or clustered networks. A user can perform simultaneous integrative visualizations of multiple database resources utilizing directionality, weight or other network properties with different layout, textures, colors or densities. Portable between desktops and CAVE environments, iCAVE provides a freely available resource for gaining novel insights from complex HT datasets.
iCAVE addresses an existing need in the user community and has already been employed in several studies (e.g. Khurana Cell Systems, 2016). Overall, we describe a novel and user-friendly complex network visualization software that we can greatly empower investigators from diverse biomedical fields in gaining novel insights from massive, heterogenous datasets, and therefore of general interest to the broad readership of GigaScience.