OrthoVenn3: an integrated platform for exploring and visualizing orthologous data across genomes

Abstract Advancements in comparative genomics research have led to a growing interest in studying species evolution and genetic diversity. To facilitate this research, OrthoVenn3 has been developed as a powerful, web-based tool that enables users to efficiently identify and annotate orthologous clusters and infer phylogenetic relationships across a range of species. The latest upgrade of OrthoVenn includes several important new features, including enhanced orthologous cluster identification accuracy, improved visualization capabilities for numerous sets of data, and wrapped phylogenetic analysis. Furthermore, OrthoVenn3 now provides gene family contraction and expansion analysis to support researchers better understanding the evolutionary history of gene families, as well as collinearity analysis to detect conserved and variable genomic structures. With its intuitive user interface and robust functionality, OrthoVenn3 is a valuable resource for comparative genomics research. The tool is freely accessible at https://orthovenn3.bioinfotoolkits.net.


INTRODUCTION
Comparati v e genomics studies have emerged as a critical area of research in the life sciences due to the e xplosi v e growth of genome sequencing ( 1 , 2 ). It can be performed at different aspects of the genome and obtain multiple viewpoints about the organisms (3)(4)(5). To facilitate this type of analysis, OrthoVenn, an online whole-genome comparati v e analysis tool, was de v eloped. First released in 2015 and updated in 2019, both versions of OrthoVenn were published in the Nucleic Acids Research w e b server issue ( 6 , 7 ). OrthoVenn automates the identification and annotation of orthologous clusters, offering rich data visualization through occurrence tables, Venn diagrams, network diagr ams, and so on, dr awing broad attention and widespread usage.
This paper introduces a new version of OrthoVenn with multiple upda tes tha t improve its functionality for comparati v e genomics r esear ch. First, we incr eased the data capacity to include more species and added gene annota tion informa tion. Second, we integra ted OrthoFinder2 ( 8 ), a widely used method for identifying orthologous clusters, to enhance the accuracy of OrthoVenn3. Third, we incorporated UpSet, a tool for visualizing set intersections in a matrix la y out ( 9 ), making it easier to identify unique and shared clusters among numerous species. Up-Set is well suited for the quantitati v e analysis of species with more than six sets. In addition to these features, Or-thoVenn3 wraps new tools for comprehensi v e comparati v e genomics analysis, including (i) phylogenetic analysis, allowing for the inference of evolutionary relationships among species based on their orthologous clusters, (ii) gene family contraction and expansion analysis, which provides insight into the gain or loss of gene families among different species and (iii) collinearity anal ysis, w hich helps to identify regions of genomic rearrangement and evolution.
In summary, OrthoVenn3 is an effecti v e and user-friendly online tool for comparati v e genomics r esear ch, providing r esear chers with intuiti v e da ta visualiza tion and di v erse anal ysis ca pabilities. The tool r equir es protein sequences in fasta format as input, with optional gene annotation information in bed format. OrthoVenn3 offers multiple outputs, including the UpSet table, occurrence table, phylogenetic tree, and collinearity gra ph, w hich provides users with various perspecti v es on their da ta. As an illustra tion, we conducted interspecies comparati v e analysis on se v en plant species to showcase the utility of OrthoVenn3. The results of this analysis are discussed in the 'CASE STUDY' section, highlighting the tool's ability to identify orthologous clusters, visualize their distribution across different species and infer e volutionary relationships. Ov erall, OrthoVenn3 is a comprehensi v e and v ersa tile tool for compara ti v e genomics r esear ch, potentiall y ad vancing our understanding of the evolutionary relationships and genetic diversity among diverse species.

DA T A HUB UPDATES
OrthoVenn3 has expanded its built-in database, which is sourced from the Ensembl database (2022 version) ( 10 ), resulting in an increase in the number of species from 540 to 733. This expansion provides access to a total of 11 960 346 protein sequences. Additionally, OrthoVenn3 also added gene annota tion informa tion for protein sequence as a new feature. To provide easy access to this information, protein sequence and gene annotation data for each species ar e stor ed separa tely in six built-in da tabases for vertebra tes (164 species), metazoa (107 species), protists (90 species), fungi (139 species), plants (92 species) and bacteria (141 species). To simplify the search for species of interest, Or-thoVenn3 has introduced a search box for species, enabling users to search for and add species by name. Furthermore, to ensure the accuracy of species information, we intend to regularly update species information, aligning our update frequency with that of the Ensemb l database. Ov erall, these de v elopments offer users improved functionality, greater accessibility, and mor e compr ehensi v e cov erage, thereby increasing the utility of OrthoVenn3 as a tool for comparati v e genomics r esear ch.

Orthologous cluster identification algorithm and visualization
Identifying orthologous clusters is critical for comparati v e genomic studies ( 11 , 12 ), as it enables the comparison of evolutionary relationships between genes across different species ( 13 ). To improve the accuracy and efficiency of this process, OrthoVenn3 has incorporated the OrthoFinder algorithm, which is the most balanced orthologous gene cluster identification algorithm according to the latest benchmark test results of Quest for Orthologs ( 14 ). Howe v er, the classic approach -Venn Diagram cannot effecti v ely visualize more than six intersecting sets. Hence, OrthoVenn3 has adopted UpSet tables ( 9 ) for data visualization, which allows for viewing pairwise intersections between sets with > 30 sets or more. The da ta visualiza tion methods employed by OrthoVenn3 are particularly effecti v e for analyzing numer ous species, pr oviding users with a variety of options for exploring the results of orthologous gene cluster identification. The UpSet table presents the number of orthologous clusters in each species, as well as the number of shared orthologous gene clusters among species, in a clear and intuiti v e format (Figure 1 A). By clicking on the nodes of the Up-Set table, users can access Gene Ontology (GO) ( 32 ) term annotations for individual orthologous clusters, allowing for more detailed analysis of the results.
OrthoVenn3 also provides users with the ability to select target species for visualization, as well as two visualization modes: UpSet and occurrence tables. The use of bar charts (Figure 1 E) to display the number of genes, orthologous groups, and singletons in each species helps users to better interpret the analysis results. Additionally, the identification of single-copy gene clusters (Figure 1 F) is supported, and users can access GO term annotations for these clusters by clicking on the relevant number. OrthoVenn3 further offers a Cluster ID and Protein ID search function (Figure 1 C) that allows users to retrie v e annotation information from the results of orthologous clusters. The Blast module allows for the upload of protein or nucleotide sequences for comparison with the output results (Figure 1 D). Finally, users can select up to 6 species to draw classic Venn diagrams or Edwards diagrams (Figure 1 B), and the heatmap module (Figure 1 G) supports viewing the shared orthologous gene clusters between species.

Phylogenetic analysis function
Phylo genetic anal ysis is widel y used in species classification, phylogenetic reconstruction, and inference of evolutionary history ( 15 , 16 ). In response to user's feedback, we have added a phylo genetic anal ysis function in the latest version of OrthoVenn3. We used FastTree ( 18 ) to construct phylogenetic trees based on conserved singlecopy gene sequences. The single-copy gene clusters contain only one gene from each species are considered independent evolutionary units among species ( 17 ). FastTree employs the maximum likelihood method ( 19 ) to analyze a large number of sequences at a faster speed, ensuring accuracy w hile significantl y reducing anal ysis time. This makes it possible to construct phylogenetic trees with large datasets of multiple species and helps r eaders mor e accurately understand biodi v ersity's origin and e volutionary history.
Sim ultaneousl y, OrthoVenn3 implemented a tree dendrogram fea ture tha t allows users to visualize phylogenetic trees with customizable styles that classify and label various de v elopmental branches or nodes. Users can change the color of nodes and branches by clicking on the color block at the top of the diagram (see Figure 2 A). The Bar-Color button can change the color of the statistical bar chart that displays the number of orthologous clusters for each species. Additionally, OrthoVenn3 supports specifying a species as the root node, which allows users to adjust the structural order of the phylogenetic tree. Users can export results in SVG and PNG formats with custom styles.

Gene family contraction and expansion analysis function
Gene families are genes with similar structures and functions that can undergo expansions or contractions during evolution, potentially contributing to species differences ( 20 , 21 ). As such, analyzing gene family expansions and contractions is an essential part of genomic r esear ch ( 22 ). Or-thoVenn3 now includes this functionality, offering a rich and customizable visualization approach that is both intuiti v e and interacti v e, with a wide range of options for presenting results.
OrthoVenn3 provides visualization for users to easily view the changes in the contraction and expansion of gene families. The gene family size can be visualized using a pie chart that shows the number of contracted (purple) and expanded (blue) gene families, intuitively displaying the evolutionary history of gene families and differences between species (Figure 2 B). Users can customize the color and size of the pie chart by clicking the button To further understand the genetic mechanism behind the phenotypic differences of species, OrthoVenn3 supports GO term annotation of gene families with contractions and expansions. Users can view the functional annota tion informa tion by clicking on the numbers of the contracted or expanded gene families. Through gene family contraction and expansion analysis, users can gain insights into the evolutionary relationships between different species and the evolutionary history of gene families.

Collinearity analysis function
Analyses of collinearity are crucial in evolutionary studies because they enable the detection and comparison of changes in genome structure and composition, including gene family expansions and transposon insertions ( 23 ). In our r esear ch, we utilized the MCScanX program ( 24 ) to identify collinearity between chromosomes of different species. This new feature enables us to compare the Nucleic Acids Research, 2023, Vol. 51, Web Server issue W401 structure and composition of chromosomes across various species, which aids in understanding their evolutionary relationships and inferring chromosome rearrangements and evolution ( 25 ). Additionally, OrthoVenn3 pro vides tw o scaling models, the global scale and the in-species scale, to display collinearity. The global scale is suitable for displaying the collinearity of similar chromosome lengths based on the proportion of chromosome length among species, whereas the in-species scale focuses on the proportion of chromosome length within a single species. The inspecies scale helps avoid incongruity of visualization effects caused by differences in chromosome length among different species, leading to improved quality of visualization results (Figure 3 A, B). OrthoVenn3 allows users to search and label multiple genes on chromosomes sim ultaneousl y, enabling users to view the collinearity relationships among genes and their distribution regions on chromosomes.
OrthoVenn3 not only supports collinearity analysis for interspecies comparisons, but also for orthologous clusters. This feature enables users to identify the gain and loss of genes within orthologous clusters and investigate the functional evolution of gene families ( 26 ). The collinearity anal-ysis results are presented with orthologous genes shown in green by default, while collinear genes are highlighted in orange by hovering over them (Figure 3 C).

CASE STUDY
To demonstrate the capabilities of OrthoVenn3, we conducted a comparati v e genomics study on Ec hinoc hloa crusgalli , focusing on its phylogenetic relationships and gene families annotated with environmental adaptation and invasion based on a recent study ( 27 ). Genomics data for E. crus-galli and six other plants were obtained from the Ensembl (2022 version) ( 10 ) database, and we employed the OrthoMCL algorithm for analysis. We used Diamond ( 28 ) with an e value of 1e-2 and set the split time of A. thaliana-Z. mays to 150 million years ago, and O. sativa-Z. mays to 50 million years ago. The results can be found at https://orthovenn3.bioinfotoolkits.net/result/ b2f35873861c470f9f299b415e585044/orthologous .
The results re v ealed a total of 36 907 gene families, comprising 9052 highly conserved orthologous clusters and 28 265 families identified in the E. crus-galli genome. Notably, some gene families are involved in critical biological processes, such as signal transduction and growth regulation. The phylogenetic tree constructed from the analysis indica ted tha t E. crus-galli is most closely related to S. italica , with an estimated split time of a pproximatel y 21.44 million years ago (Figure 2 A, B).
Through the analysis of phylogenetic relationships, we gain insight into the temporal and spatial relationships of Ec hinoc hloa crus-galli during evolution, and changes in gene families may lead to changes in function. In our investigation of environmental adaptation and invasion, we identified 559 expanded gene families and 50 contracted gene families. These gene families were further annotated with GO terms and were found to be associated with functions such as monooxygenase activity, tr ansfer ase activity, and glutathione metabolism. Previous studies ( 29 ) have demonstra ted tha t these gene families are commonly associated with detoxification and weed resistance to synthetic herbicides, which supports the results of OrthoVenn3 annotation. Furthermore, OrthoV enn3' s collinearity analysis revealed that certain chromosomal regions of E. crus-galli have good collinearity with Z. mays , indicating that it may be their ancestral chromosomal r egions. Other r egions may have undergone rearrangement during evolution, providing insight into the evolutionary relationship and history between species.

FUTURE DIRECTIONS
OrthoVenn3 is a versatile w e b-based tool for comparati v e genomics analysis that enables the analysis and visualization of genomics data in a single platform. Howe v er, traditional methods based on sequence similarity and topological structure may no longer meet the needs of orthologous cluster r esear ch ( 11 , 30 ). To over come these limitations, the integration of OrthoVenn with deep learning technology is a promising direction for future research.
Deep learning is a machine learning method based on artificial neural networks that can automatically extract high-le v el features from data and achie v e precise predictions through feature learning and model training ( 31 ). For the analysis of orthologous clusters, deep learning technology will be trained on sequence and structural features to achie v e more accurate prediction and recognition of orthologous clusters. As deep learning technology can handle large-scale, complex, and high-dimensional data and perform transfer learning on different datasets, it has the potential to enhance the accuracy and efficiency of orthologous cluster identification and analysis.
Ther efor e, futur e versions of OrthoVenn are expected to le v erage deep learning technology to become essential tools for comparati v e genomics r esear ch. By automatically extracting features from data, deep learning can provide more accurate predictions and enhance the analysis efficiency of orthologous clusters. This approach has great potential to improve our understanding of the evolution and function of genes across different species, leading to breakthroughs in biomedical r esear ch and biotechnology.

DA T A A V AILABILITY
The tool is freely accessible at https://orthovenn3. bioinfotoolkits.net .