ANISEED 2019: 4D exploration of genetic data for an extended range of tunicates

Abstract ANISEED (https://www.aniseed.cnrs.fr) is the main model organism database for the worldwide community of scientists working on tunicates, the vertebrate sister-group. Information provided for each species includes functionally-annotated gene and transcript models with orthology relationships within tunicates, and with echinoderms, cephalochordates and vertebrates. Beyond genes the system describes other genetic elements, including repeated elements and cis-regulatory modules. Gene expression profiles for several thousand genes are formalized in both wild-type and experimentally-manipulated conditions, using formal anatomical ontologies. These data can be explored through three complementary types of browsers, each offering a different view-point. A developmental browser summarizes the information in a gene- or territory-centric manner. Advanced genomic browsers integrate the genetic features surrounding genes or gene sets within a species. A Genomicus synteny browser explores the conservation of local gene order across deuterostome. This new release covers an extended taxonomic range of 14 species, including for the first time a non-ascidian species, the appendicularian Oikopleura dioica. Functional annotations, provided for each species, were enhanced through a combination of manual curation of gene models and the development of an improved orthology detection pipeline. Finally, gene expression profiles and anatomical territories can be explored in 4D online through the newly developed Morphonet morphogenetic browser.


INTRODUCTION
Tunicates are marine invertebrates with a key phylogenetic position as the sister group of the vertebrates (1,2). Three major groups of tunicates have been classically described. The sessile ascidians form the largest group with several thousand species listed. Two additional groups of tunicates have a pelagic life-style and rapid molecular evolution rates, the thaliaceans and the appendicularians. Their phylogenetic position with respect to ascidians has long remained debated. Molecular phylogenies suggest that the fast-evolving appendicularians are the sister group of all other tunicates, and that thaliaceans form a monophyletic group nested within ascidians (3,4) Tunicates studies have led to important discoveries in a variety of scientific fields. They illuminated the origin of vertebrate features, including the neural crest (5) or the secondary heart field (6,7). The simplicity of ascidian embryos makes them ideal to decipher the regulatory networks controlling embryonic development (8)(9)(10) and their evolution within the taxon (11)(12)(13)(14). Colonial ascidians have striking regenerative capacities, including Whole Body Regeneration from a small number of vascular cells (15)(16)(17). Some tunicates also have an important function in marine ecosystems (18) or can be damaging invasive species (19). They can finally be used to study the response of the marine fauna to global climate change (20) or to monitor pollution (21). Unlocking the potential of tunicate research across many fields requires the development of a suitable computational framework to centralize molecular, taxonomic and ecological information.
ANISEED is the main model organism database for the worldwide community of scientists working on tunicates, the sister-group of vertebrates (22)(23)(24). Established 15 years ago, the system has grown to become a fundamental resource for this community of around eighty labs worldwide, mostly located in Europe, Japan and the USA. On average in 2018, 170 000 pages were visited each month by roughly 1700 unique visitors, coming from all main international ascidian labs.
The ANISEED 2017 release (24) covered 10 species and integrated for each species: (i) a taxonomy page with suitable links to external taxonomic, ecological and molecular resources; (ii) a main knowledge base, the 'Developmental browser' structured around extended functional, gene expression and anatomical ontologies and interactive gene phylogenies as a comparative framework to study the developmental programs of different species; (iii) a multispecies genomic browser to visualize the position of genetic features along chromosomes; (iv) a Genomicus synteny browser (25) to analyse the evolution of gene order across tunicate and other chordate genomes. Care was taken during the development of ANISEED that the tool remains generic and adaptable with minimal effort to any developmental model organism.
During the preparation of ANISEED 2019, we added three additional solitary or colonial ascidian species with recently sequenced genome: Molgula occulta, Corella inflata and Botryllus leachii and extended for the first time the system to a non-ascidian species, the appendicularian Oikopleura dioica. We significantly improved the functional annotation of genes, through the manual curation of gene model sets in some species and the refinement of our orthology assignment procedure, which now detects vertebrate orthologs for a majority of genes from all ascidians species, including the main ascidian model species, Ciona robusta (formerly referred to as Ciona intestinalis type A). We enriched the genomics datasets related to the control of gene expression in existing and new species. Finally, we interfaced the developmental browser of ANISEED 2019 to the MORPHONET morphogenetic browser, allowing 4D exploration of gene expression profiles.

Extension of the taxonomic range covered
In addition to the ten ANISEED 2017 species, three new ascidian species, for which genome and gene models were recently made available, were added to the portal. The solitary stolidobranch Molgula occulta is so closely related to M. oculata that hybrids between these two species can be produced, yet it is one of the few ascidian species that gives rise to tail-less larvae (26). The solitary Corellidae phlebobranch Corella inflata, is a distant relative of Cionidae (Ciona species) and Ascididiae (Phallusia species), which can be efficiently electroporated (B. Davidson, personal communication). The third species is a colonial stolidobranch species, Botrylloides leachii, closely related to Botryllus schlosseri, but with a much smaller genome size (27). Its regenerative potential is such that it is capable of whole body regeneration (WBR), including the germline, from a tiny piece of vascular tissue (28).
ANISEED 2019 now also covers for the first time a second tunicate group: the Appendicularia, which retain a tadpole morphology throughout their short adult life (29). A high-quality genome assembly was recently generated from a Japanese isolate of this species, which was annotated using the ANISEED annotation pipeline and can be explored through a dedicated genome browser and a section of the developmental browser. Aplousobranchs and thaliaceans are currently not represented as no sequenced genome of sufficient quality have been reported for these groups (Figure 1).

Improved functional annotation pipeline
To improve functional annotations, we first curated the gene model sets retrieved from the various genome projects. The main improvement was achieved for Ciona robusta (formerly referred to as Ciona intestinalis type A), for which we completed the KH2012 gene model set with 1247 NCBI models for genes that had been missed in the KH set. In Phallusia mammillata, 724 inaccurate transcripts for 672 gene models were suppressed and the strand of 81 transcripts was reverted. Besides coding genes, repeats elements were manually reannotated.
Analysis of the quality of the results of our previous orthology assignment pipeline (24) indicated that orthologs of genes whose conserved domain extended over less than 40% of the protein sequence were frequently missed, as a result of the default threshold of the SiLiX software (30) used to build clusters of homologous proteins. This limitation was particularly problematic for a major class of developmental regulators, the transcription factors, whose conserved region is often limited to a short DNA-binding domain. To circumvent this issue, we adopted an iterative clustering procedure, starting with high-stringency SiLiX clustering and using progressively lower clustering stringency. Briefly, all genes from our 13 ascidian species, two echinoderms (Acanthaster planci, Strongylocentrotus purpuratus), two cephalochordates (Branchiostoma lanceolatum, Branchiostoma belcheri) and six vertebrates (Homo sapiens, Mus musculus, Gallus gallus, Pelodiscus sinensis, Latimeria chalumnae, Callorhinchus milii) were clustered at high stringency. Genes assigned to a family composed of genes from at least one echinoderm, six ascidians and four vertebrates were set aside. All other genes were again clustered, at a reduced stringency, and those assigned to a family with at least the same composition as above were set aside and the remaining genes were clustered at even more reduced stringency. This sequential procedure progressively built families from increasingly divergent genes. Ten stringency steps were used by tuning two SiliX parameters used to filter blast hits: -ident and -overlap (respectively, minimum of % identity, and minimum of % overlap between proteins, see supplementary methods). This new approach successfully increased the number of detected orthologs in each orthology class between ascidian species and with vertebrates, as illustrated on Figure 2 for Ciona robusta. Detection of one-to-many and many-to-many relationships was particularly improved. Detection of Human orthologs of C. robusta transcription factors was also strongly improved, as was the detection of TF orthologs in Phallusia, Halocynthia and Molgula (Supplementary Figure S1). Comparison to a manually-curated set of orthology relationships between C. robusta and Homo sapiens transcription factors (see supplementary methods) revealed a very high selectivity of the 2019 ANISEED orthology pipeline (88% of orthology relationships detected by the 2019 pipeline match the ground truth) as well as a 33% improvement in the number of detected orthology relationships between the 2017 and 2019 pipelines (Supplementary Figure S2).
As in the previous release, interactive phylogenetic trees are presented for each cluster. In addition, a specific tab in each gene card now lists for each gene its different classes of orthologs (one-to-one, one-to-many and many-to-many) in each of the 23 deuterostome reference species from which the clustering was built, with direct links to the gene card of the relevant database. The system's Genomicus synteny browser was also updated with these new relationships.
As in the previous release, functional gene annotation included conserved InterPro domains, the three most-related human genes, and Gene Ontology annotations. The latter were inherited from GO annotations of IPR domains, best human blast hits and orthologs as previously (24). In addition, this release now also provides annotations from a ded- icated tunicate-specific GO Slim developed in the previous version, which are also mined by the 'Genes (by GO term)' search tool.

Extension of the genomics and gene expression datasets
ANISEED 2017 Genomics datasets included staged RNAseq for C. robusta, P. mammillata and Halocynthia roretzi, ChIP-seq for the H3K4me3 promoter mark in C. robusta and P. mammillata and SELEX-seq-based in silico transcription factor binding site prediction for C. robusta and P. mammillata. The major improvement in this release was the inclusion of a novel type of information, genome-wide chromatin accessibility status using ATAC-seq (31). In addition, we refined the TF binding site predictions and extended them to Halocynthia. Finally, we extended RNA-seq datasets to whole body regeneration experiments in Botrylloides leachii.  (32) for C. robusta and P. mammillata. In each species, the hub presents the normalized coverage values for ATAC-seq experiments carried out in WT embryos at the blastula (64-cell), early gastrula (112cell), late gastrula and mid neurula stages as described in (33). Additional tracks present for WT P. mammillata embryos (16-cell, 32-cell) and for experimentally-perturbed 64and 112-cell embryos from both species, in which the Wntß-catenin pathway was activated by inhibition of the GSK3 kinase.
In silico prediction of conserved functional transcription factor binding sites. In the previous release, local scores corresponding to SELEX-eq based in silico predictions of the binding of 129 C. robusta transcription factors (34) and 84 P. mammillata orthologs were presented as public hubs of the WashU browsers of these species. We updated this dataset with the improved orthology detection pipeline, which increased the number of Phallusia orthologs to 107, and extended it to the 88 H. roretzi orthologs of the Ciona TFs. This dataset allows the visual identification of candidate binding sites, but the continuous nature of the score does not allow to programmatically identify putative binding sites. We therefore completed this dataset by extracting the summits corresponding to the center of peaks, associating to each of these summits the top score of the peak, and only keeping the top 10% of these summits to enrich for functional medium-to high-affinity binding sites. These candidate binding sites were highly predictive of functional binding sites. We extracted from the C. robusta cisregulatory analysis section of ANISEED 320 experimentally identified TF binding sites, for which the structural TF class of the binding factor was known. 274 of these binding sites (85%) matched one the top 10% SELEX peaks of the expected structural class. Figure 3 illustrates on a well-characterized enhancer, the Otx a-element (35), that the combination of this dataset with chromatin accessibility maps is a powerful tool to identify the cis-regulatory logic driving development.
A transcriptomic analysis of whole-body regeneration. In addition to the transcriptional dynamics of embryonic development in solitary C., Phallusia and Halocynthia species, ANISEED 2019 now also includes a WashU public RNAseq track hub showing the dynamics of B. leachii gene expression across 5 stages of whole-body regeneration, from a minuscule piece of vascular tissue, to a fully-grown adult colony.
Manually-curated expression data. ANISEED combines large-scale genomics information to smaller-scale experiments extracted from the literature and manually curated. Manual curation continued over the past 2 years, the main improvements consisting in a marked extension of the P. mammillata expression section (Supplementary Figure S3) and the manual curation of C. robusta expression datasets by in situ hybridization initially entered programmatically, leading to the removal of over four thousand expression profiles annotated 'no expression' or 'whole embryo' and conflicting either with the supporting evidence picture, or with higher-confidence datasets.

New functionalities
Gene set extraction. The 'Gene set' view in the WashU genome browser offers the possibility to display several non-contiguous gene loci, including predefined lengths of 5 and 3 flanking sequences, in the same window. To further support this functionality, queries in ANISEED now support the extraction of lists of gene IDs that can be pasted into the 'Add a new Gene set' field of the WashU 'Gene & region set' App. To illustrate the process, Supplementary Figure S4 shows the expression by RNA-seq of 60 B. leachii genes annotated as Notch binding (GO:0005112). This overview identifies at a glance six genes with dynamic expression during Whole-body Regeneration, highlighting the potential of the approach to rapidly select members of a gene family with interesting expression, epigenetic profile, or presence of expected TF binding sites, depending on the scientific question addressed.
Online 4D visualization of gene expression profiles through MORPHONET. ANISEED 2019 stores over 20 000 expression profiles by In situ hybridization. For some regulatory genes, >150 expression patterns have been collected from the literature, sometimes with discrepancies between experiments and authors. To facilitate the exploration of this dataset, we interfaced ANISEED to the Morphonet online morphogenetic browser (36). Each gene card includes a specific tab, which opens a Morphonet session to visualize the gene's expression pattern in 4D ( Figure 4). Importantly, the Morphonet visualization summarizes all available WT expression patterns at a given stage, the density of the label of a given cell increasing with the proportion of experiments showing expression in this cell.

Compatibility with FAIR guidelines and principles
Over the years, we have given particular attention to offer findable, accessible, interoperable and reusable data, in agreement with the FAIR guidelines. To fulfill this aim, ANISEED uses established international standards, when available, at all levels of its conception (Chado database schema; Gene Ontology, Sequence Ontology, ChEBI ontology; MISFISHIE, MINSEQE minimal information standards; InterPro database of protein families). In addition, necessary tunicate-specific ontologies and guidelines, such as the anatomical ontologies the Tunicate GO slim or the guidelines for nomenclature of genetic elements, are developed by the ANISEED biocurator team in collaboration with the relevant communities (37,38) and when applicable formatted according to the OBO Flat File Format.
All public data can be freely accessed and mined through web interfaces. In addition, an API (https://www.aniseed. cnrs.fr/api) and an extensive download section of files with standardized formats (https://www.aniseed.cnrs.fr/aniseed/ download/download data) are provided. ANISEED uses and provides standard formats for sharing data (JSON, Fasta, GFF3, GAF and NHX phylogenetic trees for example). Finally, all genomic elements in the database are retrievable by their unique identifier.
Nucleic Acids Research, 2020, Vol. 48, Database issue D673 The bottom panel is an enlarged view of the Otx a-element (REG00000010), a short enhancer activated by ETS and GATA4/5/6 factors (35) through two ETS sites (blue boxes) and three GATA sites (green boxes). 'Score' indicated a continuous scoring of predicted affinity. 'Summits' associate the highest score of each peak to its summit base.  Tools are distributed under the GNU General Public License v3 (https://www.aniseed.cnrs.fr/aniseed/default/ license). We are happy to share the code, currently deposited in a local Git server, with all interested scientists and to provide support for its installation. While ANISEED was initially developed for ascidians, a class of animals with stereotyped development, it can be used with minimal adaptation to any other taxon, for which an OBO anatomical ontology is available.