A tutorial of diverse genome analysis tools found in the CoGe web-platform using Plasmodium spp. as a model

Abstract Integrated platforms for storage, management, analysis and sharing of large quantities of omics data have become fundamental to comparative genomics. CoGe (https://genomevolution.org/coge/) is an online platform designed to manage and study genomic data, enabling both data- and hypothesis-driven comparative genomics. CoGe’s tools and resources can be used to organize and analyse both publicly available and private genomic data from any species. Here, we demonstrate the capabilities of CoGe through three example workflows using 17 Plasmodium genomes as a model. Plasmodium genomes present unique challenges for comparative genomics due to their rapidly evolving and highly variable genomic AT/GC content. These example workflows are intended to serve as templates to help guide researchers who would like to use CoGe to examine diverse aspects of genome evolution. In the first workflow, trends in genome composition and amino acid usage are explored. In the second, changes in genome structure and the distribution of synonymous (Ks) and non-synonymous (Kn) substitution values are evaluated across species with different levels of evolutionary relatedness. In the third workflow, microsyntenic analyses of multigene families’ genomic organization are conducted using two Plasmodium-specific gene families—serine repeat antigen, and cytoadherence-linked asexual gene—as models. In general, these example workflows show how to achieve quick, reproducible and shareable results using the CoGe platform. We were able to replicate previously published results, as well as leverage CoGe’s tools and resources to gain additional insight into various aspects of Plasmodium genome evolution. Our results highlight the usefulness of the CoGe platform, particularly in understanding complex features of genome evolution. Database URL: https://genomevolution.org/coge/


Introduction
During the last decade, 'omics' data generation and collection has grown exponentially and contains valuable information about most groups in the tree of life (1)(2)(3). 'Omics' data are generated in laboratories around the world, and generally requires multiple tools and databases to analyse and host increasingly larger amounts of information. The difficulty of navigating this plethora of publicly available data can hinder collaborative efforts. Hence, platforms capable of leveraging large quantities of omics data, tools for its exploration and analysis and resources to facilitate reproducible and collaborative research are essential in comparative genomics. CoGe (https://genomevolution.org/ coge/) is one of several platforms developed to fill this niche. The CoGe platform combines a variety of interconnected data management, analysis and visualization tools to facilitate exploratory and hypothesis-driven research of complex omics data. Though applicable to any biological group, here we showcase the types of comparative analyses that can be performed with CoGe's tools and services by using Plasmodium genomes as a model.
Advances in high-throughput technologies and a desire to better understand parasites of the genus Plasmodium, the causal agents of malaria in humans, lead to a dramatic increase in publicly available information for the genus (4). Plasmodium genomes are characterized by a combination of gene loss and the acquisition of species-or lineagespecific genes, many of which mediate host-parasite interactions (5). All Plasmodium species have a complex life cycle involving a vertebrate host and a mosquito vector. The genomes of Plasmodium parasites are small (between 17 and 28 Mb) in comparison to those of their vertebrate (1 Gb for birds; 2-3 Gb for mammals) and mosquito (230-284 Mb) hosts. Plasmodium parasites also have shared genomic characteristics (e.g. chromosome number, an apicoplast and a mitochondria) (6). Moreover, in comparison to other groups (e.g. plant genomes), their structural organization and gene content are largely conserved across species. Nonetheless, despite these conserved features, Plasmodium species exhibit significant genomic sequence evolution and different Plasmodium clades have highly dissimilar DNA GC content. Overall, these characteristics make Plasmodium parasites a unique group for comparative genomic studies.
Arguably the two most important repositories for malaria research are NCBI/Genbank (7) and PlasmoDB (8). However, these platforms are somewhat limited in the ways that they allow users to interact with their data. Here, we have imported all available Plasmodium genomes and annotations into CoGe and made them publicly available. By making these genomes publicly available within the platform, genomic analyses beyond the scope of this tutorial can be developed in situ by interested researchers. All evolutionary and genomic analyses presented here were performed using CoGe's tools and services, with links to regenerate them. Three model workflows are presented to showcase the usefulness of CoGe in different aspects of comparative genomics: (i) an assessment of compositional bias and amino acid usage, (ii) an evaluation of the frequency and location of chromosomal rearrangements through whole genome syntenic analyses, and synonymous and non-synonymous substitution trends between genomes and (iii) an exploration into the microsyntenic genomic structural differences in genus-specific multigene families.

System requirements
CoGe is an open-access analysis platform that only requires a web browser (Chrome or Firefox are recommended) and a connection to the Internet. For full operability Flash, Javascript, popups and cookies need to be allowed.

Genome data used on these tutorials
Representative genomes from the four major Plasmodium clades (simian, rodents, Laverania subgenus and birds/reptiles) were obtained from NCBI/Genbank (7), PlasmoDB (8) and GeneDB (9). Reference genome sequences and annotations were imported and made publicly available within the CoGe platform for usage and analysis. All publicly available Plasmodium genomes used in this study were organized into a CoGe Notebook: (https://genomevo lution.org/coge/NotebookView.pl?lid¼2155). Notebooks provide the means to manage collections of genomic, functional genomic (e.g. transcriptomic) and diversity (e.g. SNP) data. Additionally, a summary table with a list of species, CoGe's genome IDs, their respective genome links in CoGe, their associated publication or bioproject and their in-text reference has been provided for all species referenced in the three workflows later (Supplementary File S1). In addition to using already loaded datasets, researchers may also add their own genomes or related data into CoGe. User-loaded data can be kept private, shared with collaborators or made fully public.
Describing CoGe's capabilities with example workflows complex genomes. In the three workflows later, we have outlined step-by-step instructions for addressing key aspects of genome evolution, as well as a brief discussion of the insights gained from each analysis. Links to regenerate all analyses are provided in Supplementary File S2. Although all of the analyses and data may be used anonymously, researchers who log into CoGe get additional features such as automatic tracking of analyses (with links to regenerate them), the ability to add new data (can be made public and private) and access to restricted data (when permission is granted).
Workflow 1: assessment of genome compositional bias and amino acid usage Genome nucleotide composition (i.e. GC content) has been shown to significantly affect codon and amino acid usage patterns in eukaryotes (10)(11)(12). Furthermore, GC content variations also affect chromosome length (13), gene conversion rates (14) and protein expression (15). One of the most noticeable characteristics of Plasmodium parasites is their variable range of genomic compositions [e.g. Plasmodium falciparum (18.44%) (16) and Plasmodium vivax (44.87%) (17)]. Thus, the degree in which changes in genome composition affect amino acid usage can be explored in detail by using Plasmodium parasites as models. Three CoGe's tools-GenomeList, GenomeInfo and CodeOn-were used to characterize genome composition and its effect on amino acid usage for 17 Plasmodium species. A diagram of the steps followed in this example workflow is included in Figure 1. Genomic attributes for each species are shown in Figure 2, organized by their phylogeny (18
Nonetheless, to our knowledge, no other study has thoroughly compared GC content variation or done so in as many species as the ones included here. We simultaneously assessed inter-and intra-clade variations of GC content in both the entire codon and specifically on the third nucleotide position. In our Plasmodium model, GC content in the entire codon and the third nucleotide position were strongly GC biased in GC-rich genomes and strongly AT biased in AT-rich genomes ( Figure 2). Nonetheless, we identified species where GC content in the third nucleotide position was less GC biased than coding GC content (e.g. Plasmodium malariae and Plasmodium ovale curtisi). These differences were only evident by performing simultaneous multispecies comparisons. Though small, they may suggest unique long-term evolutionary trends of P. malariae and P. ovale curtisi with respect to other primateinfecting Plasmodium species from the simian clade.
CodeOn clearly showed a change in amino acid usage trends across species with different coding GC content ( Figure 3). Amino acids at the ends of the GC composition spectrum (those coded by codons that are GC-rich or GC-poor) had the biggest change in usage across species, while amino acids in the middle of the spectrum ($50% GC-rich) showed little to no preference ( Figure 3). Despite similarities in amino acid usage, differences in the way these amino acids are coded (codon usage bias) have been reported, even in comparisons between closely related species (Plasmodium vivax vs. Plasmodium knowlesi) (28).

Workflow 2: whole genome comparisons, synonymous (ks) and non-synonymous (kn) substitutions
Genome organizational changes have significant implications in coordinated gene expression (29), genome-specific specialization, (30) and the discovery of orthologous genes (31). CoGe's provides powerful tools for exploring changes in genome structure and sequence evolution across multiple species, and making inferences on the evolutionary mechanisms and forces behind them.

SynMap
SynMap (32,33) was used to identify large-scale changes in genome organization amongst Plasmodium species (Supplementary File S2). Specifically, whole genome pairwise comparisons were performed using default SynMap parameters across species pairs with different levels of evolutionary relatedness (i.e. sister taxa, closely related species and distantly related species). Briefly, SynMap (i) identifies putative syntenic gene pairs using a sequence comparison algorithm (LAST by default), (ii) identifies and filters tandem duplicated using a program called blast2raw and (iii) uses DAGChainer to find collinear series of homologous genes or sequences and identify syntenic pairs. SynMap uses CodeML (34) to calculate the non-synonymous (Kn) and synonymous (Ks) substitution rates for all syntenic gene pairs identified in each pairwise comparison, which can then be used to draw further evolutionary conclusions such as age of duplication events and acting selection. Briefly, CodeML's workflow in CoGe is to (i) identify syntenic gene pair, (ii) extract out DNA CDS, (iii) translate to protein sequence, (iv) perform a global sequence alignment of the protein sequence using the Needleman-Wunsch global sequence alignment algorithm (https://pypi.python.org/pypi/ nwalign/) and the BLOSOM62 scoring matrix, (v) backtranslate the protein alignment to a codon alignment and (vi) feed the codon alignment into CodeML for Kn, Ks estimation. This workflow is detailed in the documentation for SynMap (https://goo.gl/L2XVZE). A diagram of the steps followed in this example workflow is included in Figure 4.
Species-specific substitution trends were characterized in closely to distantly related species (Figures 2 and 5) as a mean to assess relations between genome organization and Links to regenerate these screen captures are provided within the step-by-step instructions found in the text. evolution at the nucleotide level. Synonymous (Ks) and non-synonymous (Kn) substitution rates were calculated between syntenic gene pairs using CodeML. Although intra-clade variation was observed in both the simian clade and Laverania subgenus, in general, Ks and Kn values varied slightly more amongst parasites of the subgenus Laverania than in their simian clade counterparts (Figure 5a The distribution of Ks values between P. vivax and P. cynomolgi [$3.25-3.77 Mya. (18)] and between P. vivax and P. knowlesi [5.42-6.43 Mya. for Southern Asian parasites (18)], suggest that there are no considerable changes in mutation rates between these species at a genome-wide level. In contrast, differences in Ks and Kn values were more prevalent in comparisons between species of the Laverania subgenus, perhaps as a result of slightly older intra-clade divergence times [$5.28-5.93 Mya. for P. falciparum/P. reichenowi (18) and $7-9 Mya. for P. falciparum/P. gaboni (19,35)].

Syntenic path assembly
Although some model species have fully assembled genomes, for most groups in the tree of life only incomplete genome assemblies are available. CoGe's syntenic path assembly (SPA) tool can help overcome some of the challenges posed by incomplete assemblies by ordering and orienting contigs from an incomplete assembly based on synteny to a reference genome (Supplementary File S3). Here, we reoriented and reorganized a complete (P. falciparum) and an incomplete (Plasmodium inui) genome assembly with SPA, using a P. vivax genome as a reference. In addition, the SPA can help make whole genome synteny of assembled genomes easier to visualize (Supplementary File S3). A diagram of the steps followed in this example workflow is included on Figure 4.

SPA results
When comparing broad-scale genome organization between two complete assemblies (P. vivax vs. P. falciparum), reorientation with SPA aids in the interpretation of putative structural changes (e.g. identifying genome inversion). Alternatively, the organization of non-assembled contigs (P. vivax vs. P. inui) using SPA can be useful in identifying evolutionary complex regions (e.g. highly repetitive regions). It is, however, important to note that SPA can result in loss of identified structural changes as it enforces order by a reference genome.

GEvo
Detailed microsynteny analyses of the regions identified by whole genome syntenic comparisons can aid in the identification of genome-specific characteristics or in finding discrepancies between assemblies. CoGe's GEvo tool can be used to analyse and visualize local genomic organization and genomic features for microsynteny (differences in local genome organization are inferred by the collinear arrangement of homologous genes). This tool can be accessed via SynMap, by zooming in and selecting a gene-pair of interest or by searching specific Gene IDs in GEvo. Here, the P. vivax (Salvador-1 and P01) strains were compared with P. cynomolgi using SynMap and identified breakpoints were further evaluated using GEvo. A diagram of the steps followed in this example workflow is included on Figure 6.

GEvo results
Synteny between the P. vivax (Salvador-1) genome and the P. cynomolgi was maintained with the exception of two previously reported (22) inversion events on Chromosomes 3 ($20 000 bp) and 6 ($50 000 bp). SynMap comparisons of P. vivax (P01) to P. cynomolgi revealed that P. vivax (P01) (https://genomevolution.org/r/ lquj) lacked these inversion events (https://genomevolu tion.org/r/lj12) (Figure 7a). A microsynteny assessment of the breakpoint regions using GEvo showed syntenic regions of inverted genomic order on both Chromosome 3 (https://genomevolution.org/r/pho0) (Figure 7b) and in Chromosome 6 (https://genomevolution.org/r/phqb) in the P. vivax (Salvador-1) genome. Nonetheless, in both cases proximal regions of low sequence quality were observed only for P. vivax Salvador-1 (Figure 7b). Such regions are often filled with 'N' in the genomic assembly and are colored orange in GEvo. Given the improvements in sequencing and assembly technologies used in the P01 strain (36) with respect to those used on the Salvador-1 strain (17), it is likely that these regions represent assembly errors in the Salvador-1 genome.

Workflow 3: finding multigene family members
Whole genome duplication and gene gain/loss events are prominent mechanisms for gene content variation (37). Evolutionary comparisons of gene content have been used to describe lineage-specific events [e.g. the degradation of metabolic pathways (38)], gene turnover rates between closely related species (39) and to study the role of duplications on evolutionary adaptation and innovation (40). CoGe's tools can be used to explore these unique patterns in gene content evolution.

SynFind
SynFind (41) can identify the location of regions syntenic to a query gene, the syntenic depth (number of times a region is syntenic to target genome regions) and the number of genes in each syntenic region. Briefly, SynFind identifies homologous gene pairs using LAST (42) or LASTZ (43) for identifying sequence similarity. Later, a window of genes up and downstream from the query gene is selected by the researcher in which a minimum number of genes must be found to define a region as syntenic. The final syntenic score is based on the number of genes found within the window passing the minimum number of genes' threshold. In addition, a research may a scoring function whereby matches within a window are collinear or just present (density). These results can then be utilized to generate genomewide lists of syntenic gene sets or be sent to GEvo for microsyntenic analysis. In Plasmodium spp., differences in gene content are often associated with changes in multigene family size and organization observable at the interand intra-specific levels (22,44,45). Here, we used two Plasmodium-specific multigene families [serine repeat antigen (SERA) (45) and cytoadherence-linked asexual gene (CLAG) (46)] as models for the analysis of multigene family evolution and gene content change. Plasmodium falciparum SERA-5 (PlasmoDB ID: PF3D7_0207600), a putative vaccine candidate (47) and P. falciparum CLAG-9 (PlasmoDB ID: PF3D7_0935800), a gene implicated in cytoadherence of infected erythrocytes (48) and solute transport (46) were used as family-specific gene queries. A diagram of the steps followed in this example workflow is included on Figure 8.  'Organism'). Change the SynFind general parameters (i.e. comparison algorithm) or synteny finding parameters (i.e. gene window size, minimum number of genes and maximum syntenic depth) before starting the analysis if needed. 5. Click on 'Run SynFind' to start the analysis (https:// genomevolution.org/r/ohlf). Results can be selected and exported from the 'Download' menu.

SynFind results
SynFind identified a unique number of syntelogs (syntenic gene copies) and regional proxies (syntenic regions missing the query gene and thus potential evidence of duplication followed by loss) when PfCLAG-9 (https://genomevolu tion.org/r/ohll) or PfSERA-5 (https://genomevolution.org/ r/ohlf) were queried (Supplementary File S4). At least one PfCLAG-9 syntelog or regional proxy was found for all analysed Plasmodium species. In contrast, multiple PfSERA-5 syntelogs or regional proxies were identified in each species, with some exceptions. These analyses show the distinctive evolutionary patterns of both families, with many SERA paralogs having conserved synteny while CLAG paralogs do not.

CoGeBLAST
CoGeBLAST uses the BLAST suite of search algorithms (49) or LASTZ (43) to query any set of genomes in CoGe and further extends the base functionality of BLAST by incorporating useful genome visualizations into the search results. CoGeBLAST's visualization can be used to identify patterns of gene organization (e.g. the organization of Plasmodium multigene families SERA and CLAG). In addition, CoGeBLAST results can be sent to GEvo for microsynteny analysis, enabling closer examination of local genome organization near query genes, as well as extracting the sequences of genes with significant BLAST hits for additional downstream analyses (e.g. inferring phylogenetic relationships). CoGeBLAST was used to perform sequence similarity searches across Plasmodium genomes and further explore differences in gene content. A diagram of the steps followed in this example workflow is included in Figure 9.
All genomes with names matching the search term will

Conclusions
The data presented herein is intended to serve as a demonstration of how CoGe's tools and services can be used to assess genome-wide evolutionary patterns, further characterize sequenced genomes and perform different types of comparative genomic analyses. It should be noted that only a fraction of the tools and services available on CoGe have been covered here. Tools related to exploration of complex evolutionary patterns (e.g. codon change matrices) and features that allow group collaboration and data Links to regenerate these screen captures are provided within the step-by-step instructions found in the text.
sharing have not been discussed. Furthermore, though we described CoGe tools using publicly available Plasmodium data; the instructions, tools and resources shown here are applicable in studies investigating any number of genomes from any species. It is important to note that all analyses made in CoGe are reproducible, with links given to regenerate each analysis. Although CoGe is open for public and anonymous use for all publicly available genomes, for researchers that choose to get an account, CoGe will automatically track each analysis the researcher does and list them in their User page (Supplementary File S6). In addition, having a CoGe user account lets researchers add in their own data, keep them private and share them with collaborators. For computationally savvy researchers, CoGe also has a REST application programming interface that allows researchers to write programs to retrieve data, run analyses and integrate CoGe's features into their programs. As more genomic data are generated, open computational platforms such as CoGe lets researchers easily manage and analyse their data without the needs to stand up the entire computational infrastructure required to support large-scale analyses. Figure 10. GEvo analysis using the CoGeBLAST's output. Independent analyses are shown for the SERA (https://genomevolution.org/r/pee1) and the CLAG multigene families (https://genomevolution.org/r/z36c). Wedges formed between adjacent genomes show regions of sequence similarity in four Plasmodium species, a colinear set being used to identify syntenic blocks. Red arrow on top shows the location of the CLAG-9 and SERA-5 paralogs on P. vivax (Salvador-1). Note that SERA-5 exists in a tandem gene cluster, which results in having many overlapping regions of sequence similarity showing matches to each member of tandem gene cluster. Links to regenerate these analyses are in Supplementary File S2.