GreenPhylDB v5: a comparative pangenomic database for plant genomes

Abstract Comparative genomics is the analysis of genomic relationships among different species and serves as a significant base for evolutionary and functional genomic studies. GreenPhylDB (https://www.greenphyl.org) is a database designed to facilitate the exploration of gene families and homologous relationships among plant genomes, including staple crops critically important for global food security. GreenPhylDB is available since 2007, after the release of the Arabidopsis thaliana and Oryza sativa genomes and has undergone multiple releases. With the number of plant genomes currently available, it becomes challenging to select a single reference for comparative genomics studies but there is still a lack of databases taking advantage several genomes by species for orthology detection. GreenPhylDBv5 introduces the concept of comparative pangenomics by harnessing multiple genome sequences by species. We created 19 pangenes and processed them with other species still relying on one genome. In total, 46 plant species were considered to build gene families and predict their homologous relationships through phylogenetic-based analyses. In addition, since the previous publication, we rejuvenated the website and included a new set of original tools including protein-domain combination, tree topologies searches and a section for users to store their own results in order to support community curation efforts.


INTRODUCTION
Plant comparative genomics resources usually compare reference genomes to compute homology sequences and en-able functional annotation transfer (1,2,3,4,5). However, with the growing number of whole genome sequences available within the same species, it has been shown than a single reference is not enough to capture its total genetic diversity (6). A pangenome, usually defined as the full gene repertoire within a species, can be partitioned into core genes that are shared by all individuals and dispensable genes that are present only in a subset of individuals (6,7,8). Characterizing them can have a great potential in plants for crop improvement (7,9,10) as candidate genes can potentially be missing in the genotype used to set up a reference genome. Pangenomic studies have recently been conducted in several crops, revealing significant differences with presence absence variations (PAVs) and/or copy number variations across genotypes (11,12,13,14,15,16,17,18). It became obvious that distinguishing core from dispensable genes is important as dispensable genes can be associated with useful trait diversity (10). Finally, PAVs can have an influence on orthology detection as specific genotype gene losses can lead to false negative results in interspecific comparisons or to pseudo-orthology (19).
Until now, comparative genomics databases have not fully taken advantage of these new datasets. Here, we present an updated version of the GreenPhylDB, a database that features multiple genomes for 19 species (e.g. rice, maize, banana, grape and cacao) as well as 27 other species with single reference genomes, for a total of 46 genomes. Publicly available genomes were processed to generate representative pangenes (i.e. a set of representative or consensus sequences) for species that were used in multi-species sequence clustering. Resulting gene families were functionally annotated and analysed with orthology detection methods.

Sequence retrieval and quality checks
We retrieved 132 publicly available datasets (coding DNA sequence and protein-coding genes) for 46 (Supplementary  Table S1) and assessed their gene annotation predictions using BUSCO Plants v3.0.2 (embryophita odb10) (20). We checked that the number of CDS was consistent with the number of proteins and that they shared the same locus tag name. When protein-coding genes were missing, we generated the sequences from GFF files originating from data providers. Finally, alternate splices were filtered, and the longest sequence was conserved.

Pangene construction
Out of the full dataset, 105 genomes were considered to produce 19 pangenes computed with the get homologues-est software v20092018 (21) ( Table 1), based on NCBI Blast-v2.2 and using the following program options (-M -F -t 0 -m cluster) as previously applied to the Brachypodium distachyon dataset (13). We processed each cluster the following way : i. For single-gene copy clusters (a single sequence per genome), protein sequences were aligned using MAFFT v7.313 (22) (parameters adjusted according to the number of sequences) and an automatic procedure to generate a consensus sequence was applied ( Figure 1A). For each position of the alignment, we kept the most frequent amino acid. In case of a tie, the amino acid of the genome with the highest BUSCO scores ('complete' then 'fragmented' and finally lowest 'missing' scores) was selected. Finally, if more gaps than amino acids were present, this position was removed from the sequence. ii. For multi-copy clusters (multiple sequences per genome), we applied the same procedure as for singlegene copy clusters but added a preliminary step to select a representative sequence by cluster. Multiple sequence alignments were used to generate a distance matrix using distmat (Jukes-Cantor correction method) from EMBOSS v6.6 ( Figure 1B). The matrix was required to define the distance for each sequence of all other genomes and the sequences with the smallest sum of distance was selected as representative of the considered genome ( Figure 1C). Then, the consensus step was applied. It is worth mentioning that get homologs-est generated sequence clusters of not too distantly related sequences. Large gene families can include several multi-copy clusters being grouped together at the sequence clustering step. iii. For genotype-specific clusters (paralogs in a single genome), we generated a distance matrix between all sequences and the sequence with the lowest average distance (min(d/sum(d))) of all sequences was putatively considered as the most representative sequence. Those sequences were added to the pangene. iv. Finally, singletons (cluster of one sequence) were searched for similarity using DIAMOND (23) with a default e-value on the protein-coding genes of all other species genomes to predict their putative prediction accuracy. Sequences with a minimum of one hit in at least two species were added to the pangene; otherwise sequences were excluded.
As a unique identifier was required for each pangene, we defined a nomenclature with a prefix composed of the [5-letter UniProt taxonomy database code] pan, followed by p (for protein) and an auto-increment of 6-digits (e.g. musac pan p029014 for Musa acuminata).

Sequence clustering and functional annotation
Pangenes and protein-coding genes of reference genomes (without pangenes) were searched all against all using DI-AMOND. We then performed a clustering using Tribe-MCL (24) (M = 1.2, 2, 3 and 5), defining 4 levels of stringency (from 1 to 4) to take into account potential subclassification and we obtained 9419, 18 805, 23 409 and 29 345 clusters, respectively.
We then scanned all sequences for protein domain signatures using InterProscan (25,26) and also crossed linked matches with UniProtKB-SwissProt entries (27). Cluster names resulting from curation from previous GreenPhylDB versions (2,28) were transferred when at least 51% of sequences were found clustered together as before (based on species in common between releases). In addition, for this release, we implemented an automatic method to name clusters based on the name of InterPro domains (family type only) that were found specific to clusters. In other words, when detected in at least 51% of the sequences composing an unannotated cluster, the name of the InterPro signature was assigned to it. In total, GreenPhylDB comprises 3538 clusters functionally characterized across the four levels.

Homology inference
The previous phylogenetic-based methodology that we applied in the previous version has been conserved but uses a larger set of genomes to update our automated pipeline. The pipeline uses MAFFT for the multiple alignment step. FastTree 2 (v2.1.11) (29) was preferred over PhyML (30) due to the size of the clusters. Gene rooting and orthologous scoring was computed with Rap-Green (31) using the viridiplantae species tree extracted from NCBI taxonomy and converted into PhyloXML (2). We successfully produced gene trees at level 1 for more than 99.8% of the clusters (n = 9413) which enabled us to predict ∼17.8 million of orthologs and ∼1.8 million of in-paralogs (or ultraparalogs) relationships. The pipeline was complemented by a Reciprocal Best Hits (RBH) method--computed between all pairs of genomes--that resulted in more than ∼12.1 million orthologous relationships.

USING GREENPHYL
With this updated version, the website has received a facelift. It now takes advantage of the bootstrap and D3.js frameworks to improve the user experience and to be more responsive. Alternatively, it can also be accessed programmatically using Resource Description Framework (RDF) as implemented in AgroLD, a knowledge-based system relying on semantic web technologies (32).

Gene family pages
All cluster (or gene family) pages present the same type of information divided into several tabs : 1. Gene family composition: a bar chart allows users to visualize at a glance the composition of the gene family by species (Figure 2A). Species are ordered taxonomically to easily detect possible variations between phyla. Each bar is clickable and produces a ta-ble with the list of sequences and associated crossreferences (i.e. InterPro, UniProt). Sequences can be exported in multiple formats and/or stored in a user list. 2. Gene family structure: sequences are clustered at four levels of clustering, from less stringent to more stringent, in most cases narrowing the number of sequences ( Figure 2B). 3. Protein domains: here, InterProscan was used to assess the domain conservation consistency and the specificity of the sequence clusters ( Figure 2C). For each cluster, we performed statistical analyses to determine whether InterPro signatures were specific and therefore not found in any other sequences of clusters of the same level. 4. Phylogenomic analyses: this section includes multiple sequence alignments that can be downloaded and visualised using MSAviewer (33) and gene trees can be explored with InTreeGreat (https://www.southgreen.fr/ content/intreegreat-tool) and PhyD3 (34) ( Figure 2D). Some gene trees can be very large, and the interface proposes an option to prune automatically the tree based on user choice for a range of species. 5. Homologous predictions: the interface enables users to display and refine all the homologies detected by the phylogenetic-based approach and Reciprocal Best Hits (RBH). It is possible to filter and select only a subset of species of interest.

Pangene pages
The new page type for pangene sequences is a central and unique concept in this version as they were used for the clustering and homology predictions instead of all individual sequences that compose it. When browsing these pages, users can quickly see which genes are present or missing by looking at the status: core or dispensable compartments.
Then, information related to the sequence composition is reported. Users can access information about pangene classification, the consensus sequence (except for singletons) with the multiple sequence alignments used to create it as well as related homology predictions ( Figure 3). In the case of multi-copy clusters, it is possible to see which sequence was selected as representative (.rep) or participant (.p) and also why they were selected by browsing the distance matrix.

New tools
The database can still be searched via keyword searches or by entering a query sequence for similarity search using DI-AMOND (which replaces BLAST for faster processing), enhanced by new tools to further explore those datasets.

Quick search.
A new interface has been designed to retrieve in a comprehensive and concise way all the information associated with gene family names, sequence annotation and annotations from InterPro and UniProt mappings.

IPR2genomes--InterPro domain search.
It is now possible to search sequences and associated clusters based on a combination of InterPro domain signatures. This can be partic- ularly helpful when searching for transcription factors for which sequences must contain some domains but not others (35) as the Markov Cluster Algorithm (MCL) may fail grouping them accurately. The interface allows the use of various operators (e.g. AND, OR, NOT, ONLY) to filter a set of sequences for all genomes. Results can be compared with the MCL automatic clustering to check consistencies or differences.
TreePattern --Tree topology search. A tree search can be done by filtering on gene tree topologies (31). Users can draw the topology with species or taxonomic groups as leaves or nodes of the tree and apply constraints such presence or absence of duplications. Resulting trees can be accessed individually (or exported in bulk results as CSV file) and defined patterns are highlighted. This feature is useful for identifying gene families with an expected evolutionary scenario due to gene duplications.

Manual curation and sharing of gene families
While automatic clustering is a relevant and efficient starting point, sometimes limitations (e.g. missing sequences, er-rors in gene annotations) are present and prevent access to ready-to-use datasets, justifying a deeper characterization that will eventually lead to a refinement of the automatic clustering. As a result, knowledge generated on individual gene families is often available only in publications and their supplementary information as PDFs. To encourage knowledge capture, we developed a section for advanced users to create and share their own gene families. Two methods are possible: users can either start from scratch and upload their data or use existing clusters and take advantage of multiples operations implemented in the 'MyList' features: such feature was indeed developed to facilitate intersecting, combining clusters. This new tool can be valuable during the review process by providing a unique identifier to referees--and eventually to users--to explore the structure and composition of the submitted gene family.

USE CASES
In this section, we describe three possible uses that are en- i. You want to analyse a gene family with focus on a specific species sampling. a. search the gene family by keyword(s), locus gene ids or sequences using dedicated search families (toolbox menu). b. browse the family structure and explore sub clusters at level 2, 3 or 4. c. browse the family composition of the cluster (or a specific sub cluster) to list all the sequences and pangenes selecting one or several bars in the diagram. Export the selected sequences of interest using proposed file formats. d. (optional) if the gene family sequences are characterized by several protein domains, check possible additional sequences in the database using IPR2genomes (toolbox menu). For individual protein domains, its specificity is indicated in each gene family page (if specific, no need to search in other gene families). e. (optional) In case of additional analyses (e.g. addition of sequences from a new sequences genomes), you can create a 'custom family' by uploading the sequences on the website and share the link in a manuscript, with collaborators or reviewers during the review process for a user-friendly exploration of the dataset. ii. You are interested in finding genes that are linked to a specific evolutionary scenario (e.g. duplicated genes in one species but not in another) a. retrieve full list of gene trees and related gene families using TreePattern (toolbox menu). b. browse examples gene families to see patterns (highlighted with dashed lines in the tree). c. click on the sequence name to access the family. d. browse the family composition of the cluster (or a specific sub cluster) to list all the sequences and pangenes selecting one or several bars in the diagram. Export the selected sequences of interest using proposed file formats iii. You have a candidate gene in rice, maize or banana (or any of the 19 pangenes) and want to retrieve the related sequences in other genomes of the same species and then find orthologs in other species (e.g. Arabidopsis). a. search by sequence or by locus ID to identify the pangene ID. b. retrieve the pangene composition to get all members and check the status (core or dispensable). c. (optional) check the multiple gene alignment to see level of divergence. d. go to the gene family and explore (or download) the gene tree. e. retrieve predicted orthologs (by phylogeny and/or Reciprocal Best Hits). Alternatively, use the homologous sequence search directly (toolbox menu).

CONCLUSION
This new version of GreenPhylDB provides a unique way to scale up plant comparative genomics studies across multiple plants species by leveraging pangenomic datasets. This release paves the way to the transition from reference-based genomics to pangenome-based systems and tools. In this context, the website includes new powerful search interfaces to explore the content of the gene family collection. Advanced users can also deposit the results of their expert gene family curation for further use and reference. GreenPhylDB is an important resource to understand the genetic basis of genome diversity among plant species and has the potential to accelerate gene discovery to support crop improvement.

DATA AVAILABILITY
All datasets produced by our automatic analyses are accessible via GreenPhylDB user interfaces or can be downloaded at https://www.greenphyl.org/cgi-bin/ downloads.cgi.