PLAZA 3.0: an access point for plant comparative genomics

Comparative sequence analysis has significantly altered our view on the complexity of genome organization and gene functions in different kingdoms. PLAZA 3.0 is designed to make comparative genomics data for plants available through a user-friendly web interface. Structural and functional annotation, gene families, protein domains, phylogenetic trees and detailed information about genome organization can easily be queried and visualized. Compared with the first version released in 2009, which featured nine organisms, the number of integrated genomes is more than four times higher, and now covers 37 plant species. The new species provide a wider phylogenetic range as well as a more in-depth sampling of specific clades, and genomes of additional crop species are present. The functional annotation has been expanded and now comprises data from Gene Ontology, MapMan, UniProtKB/Swiss-Prot, PlnTFDB and PlantTFDB. Furthermore, we improved the algorithms to transfer functional annotation from well-characterized plant genomes to other species. The additional data and new features make PLAZA 3.0 (http://bioinformatics.psb.ugent.be/plaza/) a versatile and comprehensible resource for users wanting to explore genome information to study different aspects of plant biology, both in model and non-model organisms.


INTRODUCTION
Since the introduction of next generation sequencing technologies, the price for sequencing a new genome has dropped considerably. While in the past almost exclusively genomes from model organisms were sequenced, the decrease in costs has allowed numerous other plant species with agricultural, economic, environmental or evolutionary importance to be sequenced more recently (1). As sequencing genomic DNA has become accessible to a wide range of researchers, many challenges related to the subsequent data analysis remain, especially for species with large genomes or lacking resources to facilitate genome analysis. The extraction of biological knowledge from a genome sequence, through the detection of similarities and differences with genomes of closely or more distantly related species, is an important concept. By using such comparative approaches, (i) knowledge can be transferred from model to non-model organisms (2), (ii) insights can be gained in the evolution of specific genes or entire metabolic and signaling pathways (3), (iii) genes of importance for niche-specific plant adaptations can be identified (4) and (iv) large-scale genomic events, such as whole-genome duplications (WGDs), can be unveiled (5). As the number of potential pairwise comparisons grows superlinearly with the number of available genomes, such comparative analyses require considerable computational resources. Furthermore, the increase in data poses challenges for efficient storage and retrieval of data, as well as the visualization of data in an accessible and humaninterpretable way. Therefore, integrating genomic data from multiple species to generate new biological insights through comparative genomics remains important and challenging.
To overcome these issues, several online comparative genomics platforms are available, each focusing on a specific set of organisms and features. Genome browsers give a detailed representation of the genomic sequence and associated features such as annotated genes, RNA-seq reads, chromatin modifications, etc. (6)(7)(8). While such platforms offer a detailed view of a single genome, comparative information is often limited and difficult to interpret in a multispecies context. Platforms focusing on gene families rely on grouping homologous (derived from a common ancestor) genes (9) and within a family detailed phylogenetic reconstructions are possible (10). Less common are tools that look at genes in their genomic context to study crossspecies genome evolution and WGDs (11). Finally, comprehensive platforms were created (12)(13)(14)(15)(16) which, in contrast to genome browsers, integrate numerous types of information (e.g. gene families, phylogenetic trees and genomic homology) along with structural and functional annotation, providing a versatile starting point for numerous types of analyses, going from simple sequence retrieval over exploring genomic variation to tracing the effects of large-scale duplications.
In this manuscript, we present version 3.0 of PLAZA (http://bioinformatics.psb.ugent.be/plaza/), an online resource that offers comparative genomics data for 37 plant species (Supplementary Table S1) and allows users to browse the annotated genomes, gene families and phylogenetic trees. Furthermore, functional annotation has been transferred from model to non-model organisms using a novel approach, enabling the identification of specific genes or pathways across organisms. Genome organization can be explored through different visualization tools based on gene collinearity or synteny information. The PLAZA Workbench makes it possible for users to analyze multiple genes, stored in an experiment, efficiently, while bulk downloads are available for expert users to perform customized largescale analyses.

OVERVIEW AND ACCESS
PLAZA 3.0 has been divided into a monocot-and dicotcentric section containing 31 and 16 species, respectively. This allows the total number of species included in one platform to remain small enough to perform fast searches, load pages quickly and provide responsive visualizations. Both databases contain 10 shared organisms, which either serve as reference species to link between both sections or as outgroups. For each of the included species, the genome sequence and structural annotation has been included along with functional annotation such as Gene Ontology (GO) (17), MapMan (18) and InterPro protein domains (19). While PLAZA can simply act as a browser for such data, the true power of the platform emerges from additional data types generated on top of the original genome information. For instance, homologous genes are grouped together into gene families using BLAST (20) and TribeMCL (21), while subfamilies are identified using BLAST and OrthoMCL (22). For each (sub-)family, multiple sequence alignments are generated and stored that help to unveil conserved protein domains. Pre-computed approximately-maximumlikelihood phylogenetic trees generated using FastTree (23) allow users to explore orthologous and paralogous relations between genes in detail. Based on the phylogenetic trees and (sub-)families, high-quality functional annotations with experimental support from different model organisms (Arabidopsis thaliana, Solanum lycopersicum and Oryza sativa) are transferred to other species lacking functional annotation. Genome evolution can be visualized and studied through remaining collinear regions (regions with con-served gene content and order), which were pre-computed using i-ADHoRe 3.0 (24) and stored in the database.
On the PLAZA portal each data type has its own page, with an intuitive and consistent layout. The top of the page highlights the most general information, with more specific and detailed information further down the page. Numerous hyperlinks are present to allow users to go from one type of data to another (e.g. from a gene to its family or orthologs, or from a gene family to a phylogenetic tree). Every page also has its own toolbox, which provides links to additional analyses and detailed visualizations ( Figure 1).
Expert users can download all sequences, gene families, orthology information and functional annotation data in bulk from an FTP server, while the PLAZA Workbench enables the efficient retrieval of sequence or functional information for a set of genes. The latter also allows performing additional analyses, such as GO enrichment, which can be used to unravel overrepresented GO categories in a set of genes for any plant species present in the system.

New species
Currently more than 55 sequenced plant genomes have been released (25), but their quality differs considerably between model organisms that have nearly completed sequences and other species which, so far, were sequenced at low coverage only. The latter are often presented as a collection of small contigs that cannot be assembled into larger scaffolds or ordered into linkage groups or chromosomes. While these low-coverage genomes can be of considerable value in specific studies, the fragmented nature of their sequences results in many partial gene models lacking start or stop codons. A recurring issue with such models is that they hinder the generation of multiple sequence alignments and thus can impair the construction of reliable phylogenetic trees. To avoid such complications, assembly statistics accompanying manuscripts from publically available plant genomes were carefully examined. All genomes that did not meet our quality requirements, based on the N50 number (>500 kb), were excluded. Additionally, in some cases where genomes from closely related species, for instance of the same genus, were available, only the sequence with the highest quality was retained. An overview of the number of genes and species included in the different PLAZA versions is available in Figure 2 and Supplementary Table S1. Note that previous PLAZA releases (14,15,26) will remain available to the scientific community.
three new Brassicaceae species (Capsella rubella (38), Brassica rapa (39) and Thelungiella parvula (40)) are now included. Having a large sample of closely related species allows evolutionary biologists to study genomic adaptations to specific niches and how evolution has altered genes and gene families in a recent evolutionary timeframe. An additional distant outgroup species, Amborella trichopoda (41), was also included. Amborella is the last remaining member of the Amborellaceae, a sister clade to all other angiosperms, offering unique opportunities to study the diversification of flowering plants and their specific adaptations at the genomic level.  (43) and Hordeum vulgare (barley) (44). All cereals from previous versions remained, though the Oryza sativa ssp. japonica (rice) genome was updated to release 7 of MSU Rice Gene Models (45).

Improved functional annotation
In the previous versions of PLAZA, GO was used to assign Cellular Components, Molecular Functions and Biological Processes to genes, and InterPro domains (19) were included to indicate the functional regions of encoded proteins. Both these types remain in PLAZA 3.0, but in addition MapMan (18) has been included as an additional ontology to describe gene functions. MapMan was initially designed for Arabidopsis thaliana, but has recently been applied to other plants as well. Transcription factor families are also easier to identify in PLAZA 3.0 as PlnTFDB (46) and PlantTFDB (47) classifications have now been integrated.
As in earlier versions, experimentally confirmed GO annotation was transferred using a stringent, tree-based, orthology projection method (14). For each gene, all orthologs (genes derived from a common ancestor through speciation, considered to have the same function in different organisms) were identified based on a phylogenetic tree following a strict set of rules: (i) bootstrap values of the nodes considered needed to be 0.7 or higher and (ii) to avoid including co-orthologs from distantly related species, tree-based orthologs were limited to either dicots or monocots.
To facilitate the projection of high-quality functional annotation data over greater phylogenetic distances, two new methods were implemented (see Supplementary Method 1 for details). First, the integrative orthology approach (iOrtho), where four different methods to detect orthologs (using a BLAST-, clustering-, tree-and collinearity-based approach) are combined into a single prediction (15), is now used to transfer functional annotation from species with experimental evidence (Arabidopsis, tomato and rice) to all other species. While this allows transfer over greater evolutionary distances, the use of multiple methods assures that GO terms are only assigned to genes that are confirmed by multiple orthology inference approaches, avoiding potential overprediction. Second, we included a method based on homologous gene families, where enriched functional terms (i.e. GO terms that occur in a family significantly more often than in the whole database and cover at least 50% of the family members having primary GO annotations) are assigned to all other family members lacking this term. Figure 3 illustrates the fraction of genes that have a GO Biological Process label provided by the GO consortium, UniProtKB/Swiss-Prot or found using InterProScan (primary source, blue), found using PLAZA 3.0's GO projection (green) or that lack an annotation (gray). While the amount of primary annotation is similar to Gramene (release 41) (16) and PLAZA 2.5, the new GO projection is able to assign a Biological Process to considerably more genes. Especially for Zea mays (corn), there is a large improvement as the current method allows information to be transferred over large phylogenetic distances (i.e. from dicots to monocots).
On a gene page, the different sources of functional annotation are displayed and in cases where the annotation was transferred from another gene, the origin and projection method (homology-based, iORTHO or tree-based orthology) used are shown (Supplementary Figure S1). Users have the option to only consider primary labels (from the GO consortium, UniProtKB/Swiss-Prot and InterProScan), to additionally include orthology-based projected terms, or to take all GO annotations into account (primary, orthologybased and homology-based projection).
For the best annotated species (e.g. Arabidopsis thaliana), well-curated genes come with short descriptions provided by expert annotators. For species with less extensive annotation, such easily interpretable descriptions are rare or lacking completely. Therefore, AnnoMine was used to generate text descriptions (Supplementary Method 2). This tool performed, for all genes, sequence similarity searches against the UniProtKB/Swiss-Prot (48) database, which contain curated high-quality gene descriptions. Gene descriptions from BLAST hits, weighted by the BLASTP E-value, were processed by an integrative text-mining algorithm that, based on statistically overrepresented co-occurrences of words, assigned a description to the gene. The fraction of genes that have a description in five selected species is shown in Figure 3. For Arabidopsis thaliana extensive annotation efforts assigned descriptions to the majority of the genes and while these efforts have been transferred to closely related Brassicaceae species, for more distant species proper descriptions are often lacking. In contrast with other platforms and earlier PLAZA releases, now a large fraction of genes have an AnnoMine gene description (63% of the dicot and monocot protein-coding genes), including many genes from non-model plants. Although this text-mining procedure cannot replace expert annotators, it provides a valuable functional indication in the absence of a curated description.

Genome evolution
Collinearity, defined as conservation of gene content and order, has been used in PLAZA to determine homologous regions between genomes and duplicated regions within a genome. The latter are usually remnants of large-scale duplication events and various studies have revealed that traces of WGDs are present in all plant genomes sequenced to date (49). However, as gene loss and rearrangements accumulate after such an event, the detection of WGDs using collinearity becomes increasingly difficult as their ages increase (50). Therefore, in some cases, collinearity is a suboptimal measure to detect remnants of ancient duplications. To overcome this limitation, PLAZA 3.0 now also includes Nucleic Acids Research, 2015, Vol. 43, Database issue D979 information on syntenic duplicates, which are paralogs from regions with conserved gene content regardless of the order (51). As such, an additional 125 266 and 55 277 genes were found to be putatively derived from WGDs that were not found by the default collinearity searches in the dicots and monocots versions, respectively (Supplementary Method 3).

Technical improvements
While not directly visible for users, considerable changes have been made to build the PLAZA 3.0 platform and store the different data types. Structural changes to the database and the way data are stored now allow faster retrieval, also for complex queries comprising multiple data types. The result is that, despite the increase in data, many pages on the website load faster. For visualizations that summarize large amounts of data (like the Skyline plot, to browse for a locus or region collinearity in multiple species), these improvements resulted in a 2-to 3-fold speed-up.
Furthermore, third-party tools required to build PLAZA have been updated to their latest version or replaced by more modern alternatives. BLAST (20), used to find similarities between proteins prior to gene family delineation, has been upgraded from version 2.2.17 to 2.2.27+, Or-thoMCL 1.4 (22) was changed to version 2.0 and Inter-ProScan (52) version 4.6 was replaced with 5.44. In previous builds, two multiple sequence alignment algorithms were used, namely MUSCLE for the alignments shown on the website and ClustalW for calculations of K S (the fraction of synonymous substitutions per synonymous site) values. Now MUSCLE, which offers an excellent compromise between speed and accuracy, is used consistently. To further reduce the amount of computing power needed to build PLAZA 3.0, FastTree 2.1.7 (23) was selected to replace PhyML (53) for the construction of phylogenetic trees.
More noticeable for users is that all graphs, which previously were rendered using Flash, were replaced by Javascripts generating SVG output. This has several advantages, such as (i) devices where Flash is not available now will be able to display these graphs and (ii) SVGs can easily be downloaded and stored for future reference or used as high-resolution images for publications. This in combination with a new fluid grid layout (where elements can move position if the necessary monitor width is not available, avoiding the need for horizontal scroll bars) provides excellent support for mobile devices, which are being used by a growing number of visitors. Finally, GenomeView (8), which used to be a java applet started within the browser, has been updated and is now a web-started java application that is considerably faster than previous versions. PLAZA 3.0 offers an important update toward new publicly available plant genomes while technical improvements result in a web-based portal that loads faster and remains responsive despite the increase in data. A new layout provides a richer, more intuitive user experience while supporting additional devices. Furthermore, through the integration of additional functional classification systems as well as the implementation of new transfer methods, PLAZA 3.0 now offers comprehensive functional annotation for all species included.

SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.