A chromosome-scale genome assembly of the false clownfish, Amphiprion ocellaris

Abstract The false clownfish Amphiprion ocellaris is a popular fish species and an emerging model organism for studying the ecology, evolution, adaptation, and developmental biology of reef fishes. Despite this, high-quality genomic resources for this species are scarce, hindering advanced genomic analyses. Leveraging the power of PacBio long-read sequencing and Hi-C chromosome conformation capture techniques, we constructed a high-quality chromosome-scale genome assembly for the clownfish A. ocellaris. The initial genome assembly comprised of 1,551 contigs of 861.42 Mb, with an N50 of 863.85 kb. Hi-C scaffolding of the genome resulted in 24 chromosomes containing 856.61 Mb. The genome was annotated with 26,797 protein-coding genes and had 96.62% completeness of conserved actinopterygian genes, making this genome the most complete and high quality among published anemonefish genomes. Transcriptomic analysis identified tissue-specific gene expression patterns, with the brain and optic lobe having the largest number of expressed genes. Further, comparative genomic analysis revealed 91 genome elements conserved only in A. ocellaris and its sister species Amphiprion percula, and not in other anemonefish species. These elements are close to genes that are involved in various nervous system functions and exhibited distinct expression patterns in brain tissue, potentially highlighting the genetic toolkits involved in lineage-specific divergence and behaviors of the clownfish branch. Overall, our study provides the highest quality A. ocellaris genome assembly and annotation to date, whilst also providing a valuable resource for understanding the ecology and evolution of reef fishes.


Introduction
The false clownfish Amphiprion ocellaris is one of 28 anemonefishes (from the subfamily Amphiprioninae in the family Pomacentridae) among thousands of tropical marine fish species (Roux et al. 2020). Yet, together with its sister species, the orange clownfish Amphiprion percula, it is one of the most recognizable fish, especially among the nonscientific community, following the Disney movie "Finding Nemo" (Militz and Foale 2017). Even before the release of this film more than 15 years ago, the visual appeal and ability to complete their life cycle in captivity made clownfish a highly desired species in the marine aquarium trade (Rhyne et al. 2017;Militz et al. 2018). For biologists, on the other hand, anemonefishes offer a unique opportunity to answer complex research questions about symbiosis, social dynamics, sex change, speciation, and phenotypic plasticity (Roux et al. 2020).
Until now, genome assemblies of at least 10 anemonefish species including A. ocellaris have been published (Marcionetti et al. 2018(Marcionetti et al. , 2019Tan et al. 2018;Lehmann et al. 2019). Yet, except for A. percula, these genomes are mainly based on Illumina short-read technology and are therefore highly fragmented, resulting in multiple gaps and misassemblies. However, third-generation sequencing platforms such as Pacific Biosciences, produce longer reads (5-60 kb) that enhance the continuous assembly of genome sequences (van Dijk et al. 2018;Logsdon et al. 2020). This makes it possible to assemble complex regions of genomes, thus improving our ability to decipher genomic structures (such as chromosome rearrangements) and long-range regulatory analysis (Rhie et al. 2021). For example, 29% of N-gaps in the human reference genome (GRCh38) could be filled with PacBio long reads (Shi et al. 2016). In the case of A. ocellaris, the inclusion of Nanopore long reads together with Illumina data led to a 94% decrease in the number of scaffolds, an 18 times increase in scaffold N50 (401.72 kb), and a 16% improvement in genome completeness (Tan et al. 2018). The PacBio long-read assembly of the A. percula genome further emphasized the power of long-read technology, with an initial contig assembly N50 of 1.86 Mb, further anchored into the chromosome-scale assembly (scaffold N50 of 38.4 Mb) (Lehmann et al. 2019).
Here, we constructed a high-quality chromosome-scale genome assembly and gene annotation for the false clownfish A. ocellaris. Using a combination of PacBio and Hi-C sequencing, we produced a de novo assembly comprised of 1,551 contigs with an N50 length of 863,854 bp that were successfully anchored into 24 chromosomes of 856,612,077 bp. We annotated 26,797 proteincoding gene models with the proportion of conserved actinopterygian genes reaching 96.62%, making the quality and completeness of our genome better than previously published anemonefish genomes. A comparative genomic approach identified genomic elements conserved only in the A. ocellaris/A. percula branch but not in other anemonefishes, many of which are associated with genes involved in nervous system functioning and were differentially expressed in brain tissues. Ultimately, our work adds to the growing body of high-quality fish genomes critical to study genetic, ecological, evolutionary, and developmental aspects of marine fishes in general.

Specimen collection and nucleic acid sequencing
Three adult A. ocellaris clownfish (1 female and 2 males) were collected from 5 m depth in Motobu, Okinawa (26 71 0 29.83 00 N, 127 91 0 57.51 00 E) on 2020 March 25. Fish were kept under natural conditions at the OIST Marine Science Station in a 270-l (60 Â 90 Â 50 cm) tank until 2020 May 19. Individuals were euthanized following the guidelines for animal use issued by the Animal Resources Section of OIST Graduate University. Tissues for genome sequencing were snap frozen in liquid nitrogen and then stored at À80 C until further processing. Genomic DNA was extracted from a male clownfish using a Qiagen tissue genomic DNA extraction kit (Hilden, Germany) and sequenced at Macrogen (Tokyo, Japan). For genome assembly, we sequenced genomic DNA from the brain tissue of the same male fish using 2 different platforms: PacBio Sequel II and Illumina NovaSeq6000 (Supplementary Table 1). For long-read sequencing, 8 mg of genomic DNA was used to generate a 20 kb SMRTbell library according to the manufacturer's instructions (Pacific Biosciences, CA, USA). Briefly, a 10-ml SMRTbell library was prepared using a SMRTbell Express Template Prep Kit 2.0 and the resulting templates were bound to DNA polymerases with a Sequel II Binding Kit 2.0 and Internal Control Kit 1.0. Sequencing on the PacBio Sequel II platform was performed using a Sequel II Sequencing Kit 2.0 and a SMRT cells 8M Tray. SMRT cells using 15 h movies were captured. For short-read sequencing, a library was prepared from 1 mg of genomic DNA and a TruSeq DNA PCR-free Sample Preparation Kit (Illumina, CA, USA). Paired-end (151 bp per read) sequencing was conducted using a NovaSeq6000 platform (Illumina, CA, USA).
Hi-C reads were also sequenced to capture chromatin conformation for chromosome assembly. Liver (>100 mg) tissue from another male fish was snap frozen and stored at À80 C (Supplementary Table 1). The tissue was sliced into small pieces using a razor blade (to increase the surface area for efficient cross-linking), resuspended in 15 ml of 1% formaldehyde solution, and then incubated at room temperature for 20 min with periodic mixing. Glycine powder was added to the solution for a final concentration of 125 mM followed by a 15-min incubation at room temperature with periodic mixing. Samples were spun down at 1,000 g for 1 min, the supernatant was removed, and the tissue was rinsed with Milli-Q water. Tissues were then ground into a fine powder using a liquid nitrogen-chilled mortar and pestle. Powdered samples were collected and stored at À80 C. Chromatin isolation, library preparation, and Hi-C sequencing was performed by Phase Genomics (WA, USA). Following the manufacturer's instructions, a Proximo Hi-C 2.0 Kit (Phase Genomics, WA, USA) was used to prepare the proximity ligation library and process it into an Illumina-compatible sequencing library. Hi-C reads were sequenced on an Illumina NovaSeq6000 platform to generate 150 bp paired-end reads.
Tissues for transcriptome sequencing were dissected from 2 individuals (1 male and 1 female) and stored in RNAlater stabilization solution (Sigma Life Science, MO, USA) at À80 C. Transcriptome library preparation and sequencing were performed by Macrogen (Tokyo, Japan). Briefly, mRNA was extracted from brain optic lobe, caudal fin, eye, gill, gonads (from male and female fish), intestine, kidney, liver, the rest of the brain, skin (from orange and white bands), and stomach tissues using a Qiagen RNeasy Mini Kit (Hilden, Germany). Only high-quality RNA samples with an RNA integration number >7.0 were used for library construction. Libraries were prepared with 1 mg of total RNA for each sample using a TruSeq Stranded mRNA Sample Prep Kit (Illumina, CA, USA). Paired-end sequencing (151 bp) was conducted on a NovaSeq6000 machine.
Chromosome-scale genome assembly of A. ocellaris Prior to de novo assembly genome size was estimated using Jellyfish v2.3.0 (Marc¸ais and Kingsford 2011) with k-mer ¼ 17 and default parameters, and GenomeScope v1.0 (Vurture et al. 2017) with default parameters. Quality-trimmed Illumina short reads obtained from Trimmomatic v0.39 (Bolger et al. 2014) using the parameter set "ILLUMINACLIP: TruSeq3-PE.fa: 2:30:10:8: keepBothReads LEADING: 3 TRAILING: 3 MINLEN: 36" were used as input for Jellyfish. Genomic contigs were assembled using the FALCON software version as of 2020 September 28. For chromosome-scale assembly, initial contigs obtained from FALCON-phase were scaffolded with Phase Genomics' Proximo algorithm based on Hi-C chromatin contact maps. In brief, the processed Hi-C sequencing reads were aligned to the Falcon assembly with BWA-MEM (Li 2013) using the -5 SP and -t 8 options. PCR duplicates were flagged with SAMBLASTER v0.1.26 (Faust and Hall 2014) and subsequently removed from all following analyses. Nonprimary and secondary alignments were filtered using SAMtools v1.10 (Li et al. 2009) with the -F 2304 flag. FALCON-Phase (Kronenberg et al. 2018) was then used to correct phase switching errors in the scaffolds obtained from FALCON-Unzip (Chin et al. 2016).
A genome-wide contact frequency matrix was built from the aligned Hi-C read pairs and normalized by the number of DPNII restriction sites (GATC) on the scaffolds, as previously described (Bickhart et al. 2017). A total of 40,000 individual Proximo runs were performed to optimize chromosome construction. Juicebox v1.13.01 (Durand et al. 2016) was used to correct scaffolding errors and FALCON-Phase was again used to correct phase switching errors detectable at the chromosome level but not at the scaffold level. Local base accuracy in the long read-based draft assembly was improved with Illumina short reads using Pilon v1.23 (Walker et al. 2014). Quality-trimmed Illumina short reads obtained from Trimmomatic v0.39 (Bolger et al. 2014) using the same parameter set described above were aligned to the proximo-assembled chromosome-scale genome with Bowtie2 v2.4.1 (Langmead and Salzberg 2012) using the default settings. SAM files were converted to BAM files with SAMtools v1.10 (Li et al. 2009) and then used as input for Pilon. Error correction was completed using 5 iterations of Pilon.

Assembly and annotation of the mitochondrial genome
The mitochondrial genome of A. ocellaris was assembled using Norgal v1.0.0 (Al-Nakeeb et al. 2017) with quality trimmed Illumina genomic reads. MitoAnnotator v3.67 (Sato et al. 2018) was used to annotate the organelle genes. Annotated genes in this study were compared with previously published A. ocellaris genes using BLASTn v2.10.0 (Altschul et al. 1990) with e-value 10 À4 as a threshold to predict homology. Only the longest isoform of each gene model was used for the homology search.

Analysis of gene expression
Transcriptomic reads from each tissue were processed with Trimmomatic v0.39 (Bolger et al. 2014) using the parameter set mentioned above and mapped to the genome using HISAT2 v2.2.1 (Kim et al. 2019). SAM files were then converted to BAM files using SAMtools v1.10 (Li et al. 2009). Expression levels were quantified and TPM (transcripts per million) was normalized with StringTie v2.1.4 (Pertea et al. 2016). Tissue-specify index (s) was calculated for each gene using the R package tispec v0.99 (Condon 2020), with the relationship between s and TPM expression values visualized on a 2D histogram with ggplot2 v3.3.5 (Wickham 2009). TPM expression values per tissue were visualized in an Upset plot with the UpSetR v1.4.0 package (Conway et al. 2017).
Phylogenetic trees were first constructed based on maximumlikelihood criteria using the 2 datasets described above. The MPI version of RAxML v8.2.9 (raxmlHPC-MPI-AVX) (Stamatakis 2014) was executed using a LG substitution matrix, heterogeneity model GAMMA, and 1,000 bootstrap inferences. Next, a subset of proteins for each species that has a complete match to the Actinopterygii-lineage (actinopterygii_odb10) identified by BUSCO v4.1.4 (Simão et al. 2015) were selected, concatenated, and used to construct new maximum-likelihood and Bayesian trees. Bayesian tree reconstructions were conducted under the CAT-GTR model as implemented in PhyloBayes MPI v1.8 (Lartillot et al. 2013). Two independent chains were run for at least 5,000 cycles and sampled every 10 trees. The first 2,000 trees were removed as burn-in. Chain convergence was evaluated so that the maximum and average differences observed at the end of each run were < 0.01 in all cases. Trees were visualized and rerooted using iTOL v6.4 (Letunic and Bork 2021). Branch supports in the phylogenetic trees were evaluated with the standard bootstrap values from RAxML and PhyloBayes for maximum-likelihood and Bayesian trees, respectively. Site concordance factors (i.e. the proportion of alignment sites that support each branch) were also evaluated using IQ-TREE v2.1.3 (Minh et al. 2020).

Interspecies synteny
Patterns of synteny (i.e. the degree to which genes remain on corresponding chromosomes) and collinearity (i.e. in corresponding order) across all anemonefish genomes were investigated using the MCScanX toolkit (Wang et al. 2012). Briefly, an all-vs-all BLASTp search (using the parameters "-evalue 10 À10 -max_tar-get_seqs 5") was first performed to identify gene pairs among species. Synteny blocks between 2 species were then calculated using the following parameters: "-k 50 -g -1 -s 10 -e 1e-05 -u 10,000 -m 25 -b 2." This approach identified collinear blocks that had at least 10 genes with an alignment significance <10 À5 in a maximum range of 10,000 nucleotides between genes. Results were visualized using SynVisio (Bandi and Gutwin 2020). The divergence time between 2 species were obtained from the TimeTree database (Kumar et al. 2017).

Identification of conserved genomic elements
For whole-genome alignment analysis, genome sequences and gene annotations of the previously selected 11 anemonefish species and A. polyacanthus were used. Repeat elements were identified using RepeatModeler v2.0.1 (Flynn et al. 2020) and RepeatMasker v4.1.1 (Tempel 2012) as described above and then soft-masked using BEDTools v2.30.0 (Quinlan 2014). Repeatmasked genome sequences and phylogenetic trees constructed with RAxML v8.2.9 (Stamatakis 2014) were used as input for whole genome alignment with Cactus multiple genome aligner v1.3.0 (Armstrong et al. 2020). Resulting HAL databases were converted to MAF format using hal2maf v2.1 (Hickey et al. 2013) with the A. ocellaris genome as a reference. MAFFILTER v1.3.1 (Dutheil et al. 2014) was then used to exclude repetitive regions and short alignments (<100 bp). RPHAST v1.6.11 (Hubisz et al. 2011) was used to identify conserved genomic elements in the A. ocellaris/A. percula branch from the alignment. Adjusted P-values of significant conservation for genomic elements were computed using the Benjamini and Hochberg method with the p.adjust function implemented in the R package stats v4.1.0 (R Core Team 2013). Genomic elements with an adjusted P-value <0.05 were considered as significantly conserved. Genes close to these genomic elements were identified with the closestBed function of BEDTools v2.30.0 (Quinlan 2014). Conserved elements were visualized using Circos v0.69-8 (Krzywinski et al. 2009). Expression values of genes close to conserved elements in the A. ocellaris/A. percula branch were visualized with a heatmap using the heatmap.2 function in gplots v3.1.1 (Warnes 2015).

Results and discussion
Chromosome-scale genome assembly of A. ocellaris To construct high-quality chromosomes of A. ocellaris, we first generated 12,376,320 PacBio long-reads (average read length 10,239 bp) and 672,631,646 Illumina short reads (read length 151 bp) from brain tissue of an adult A. ocellaris individual (Supplementary Table 1). Prior to the de novo draft genome assembly, we investigated the global properties of the genome with Illumina short reads using Jellyfish v2.3.0 (Marc¸ais and Kingsford 2011) and GenomeScope v1.0 (Vurture et al. 2017 To improve the quality of the chromosome assembly, we performed iterative error-correction on the 24 chromosome sequences with Illumina short reads using Pilon v1.23 (Walker et al. 2014). At the 5th iterative run, 97.94% of the reads were aligned to the 24 chromosome sequences. Finally, we obtained 24 chromosomes, with a length ranging from 21,987,767 to 43,941,765 bp, totaling 856,612,077 bp (Fig. 1). Overall GC content of the A. ocellaris genome was 39.58%. The mean base-level coverage of the assembled chromosomes was 103.89Â. Completeness of the genome assembly was assessed with BUSCO v4.1.4 (Simão et al. 2015) using the Actinopterygii-lineage dataset (actinopterygii_odb10). The overall BUSCO score was 97.01% (complete and single-copy BUSCOs: 96.21%; complete and duplicated BUSCOs: 0.8%; fragmented BUSCOs: 0.52%; missing BUSCOs: 2.47%) (Table 1).
Finally, we compared our chromosome-scale assembly with 3 other A. ocellaris draft genomes that have been previously published ( Tan Table 2). In addition to large differences in size, which ranged from 744,831,443 to 880,704,246 bp, many misassembly events such as relocations, translocations, and inversions were observed in these other A. ocellaris genomes. This is likely due to the limitations of the short-read sequencing technologies upon which these assemblies were constructed.

Prediction of A. ocellaris gene models
Repetitive elements in the A. ocellaris genome were examined by 2 approaches: (1) pattern matching using previously cataloged repetitive elements and (2) de novo. First the vertebrata repeat library from DFAM (Storer et al. 2021) was queried against the A. ocellaris genome sequences using RepeatMasker v4.1.1 (Tempel 2012). We then identified 2,301 de novo repetitive elements using RepeatModeler v2.0.1 (Flynn et al. 2020) and again searched for them in the A. ocellaris genome using RepeatMasker. A large fraction of the genome consisted of DNA transposons (24.11%), longinterspersed nuclear elements (7.67%), long-terminal repeats (LTRs, 3.77%), and rolling-circle transposons (1.55%) ( Fig. 2;  Supplementary Table 3). In total, 44.7% (382,912,159 bp) of the whole genome was identified as repetitive elements. This is similar to the repeat content estimated from the unassembled short reads (42.62%) using GenomeScope v1.0 (Vurture et al. 2017) as described above. It should be noted though, that the sum of occupied percentages in the genome per repeat group is larger than the actual percentage (44.7%) in the genome due to nested and overlapping repetitive elements (Fig. 2).
We then annotated the genome using BRAKER v2.1.6 (Brůna et al. 2021) with mRNA and protein evidence. This evidence consisted of mapped transcriptomic reads sequenced from 13 tissues (Supplementary Table 1), protein datasets from UniProtKB/Swiss-Prot (UniProt Consortium 2021), and the proteomes of 9 fish species. BRAKER predicted 26,433 gene models that were supported by either mRNA or protein hints, and 12,333 gene models with no evidence support. To account for the incompleteness of the evidence provided here and the gene annotation algorithm, we further added 364 nonsupport genes that have homology to the Swiss-Prot protein database and/or Pfam domains to the final gene models. This led to 26,797 final gene models from which 26,498 genes (98.88%) had significant homology to the NCBI nr database (bit-score ! 50) and 21,230 genes (79.23%) had at least 1 associated GO term. The completeness of our gene annotation was assessed using BUSCO v4.1.4 (Simão et al. 2015). We obtained 96.62% of completeness using the Actinopterygii-lineage dataset (complete and single-copy BUSCOs: 95.52%; complete and duplicated BUSCOs: 1.1%; fragmented BUSCOs: 1.04%; missing BUSCOs: 2.34%) ( Table 1). This is higher than all other A. ocellaris and anemonefish gene annotations (Marcionetti et al. 2018(Marcionetti et al. , 2019Tan et al. 2018;Lehmann et al. 2019), in both the overall completeness and duplicated ratio, thus suggesting that the genome we present here is currently the best anemonefish genome annotation. Furthermore, our gene models include the majority (93.97-97.63%) of gene models reported in previously published A. ocellaris genomes (Tan et al. 2018;Marcionetti et al. 2019). However, gene models from these studies include fewer gene models from this study (87.93-91.43%), again indicating, that our gene models are the most comprehensive published to date (Supplementary Table 4).

Analysis of gene expression patterns across tissues
Gene expression levels of A. ocellaris genes were quantified using 13 tissue transcriptomes (Supplementary Table 1). We investigated tissue-specificity of gene expression levels using the tau (s) index as it is the most robust metric for identifying tissuespecific genes (Kryuchkova-Mostacci and Robinson-Rechavi 2017). Given the range (0-1) of the s index, we obtained 1,237 (4.62%) absolutely specific genes (s ¼ 1; genes expressed only in 1 tissue), 5,302 (19.79%) highly specific genes (0.85 s < 1; genes highly expressed in a few tissues), and 3,431 (12.8%) housekeeping genes (s 0.2; genes expressed in nearly all tissues without biased expression) as defined by the R package tispec v0.99 (Condon 2020). Tissue-specificity of gene expression showed a negative correlation (Pearson's correlation coefficient between s and log 10 maximum TPM value per gene ¼ À0.46) with expression levels (Fig. 3a), which is consistent with previous observations that highly tissue-specific genes tend to have lower expression levels (Kryuchkova-Mostacci and Robinson-Rechavi 2017; Bentz et al. 2019).
Next, we checked gene expression patterns across tissues. After filtering for TPM ! 10, brain was the tissue with the highest number of expressed genes (n ¼ 13,283) followed by optic lobe (n ¼ 12,547) and eye (n ¼ 11,809) (Fig. 3b). This high number of genes expressed in the brain has also been reported in other vertebrates (Lein et al. 2007;Hawrylycz et al. 2012;Bentz et al. 2019), and is most likely due to the complex role the brain has as the bodies control center. Furthermore, we observed that 1,957 genes were expressed in all 13 tissues sequenced here, and only 438 and 255 genes were exclusively expressed in the brain and optic lobe, respectively (Fig. 3b). Considering the high quality and similar numbers of transcriptomic reads per tissue generated in this study (Supplementary Table 1), we are confident that these results represent the most accurate transcriptomic atlas for A. ocellaris to date.

Phylogenetic analysis
Comparative analyses investigating the diversity and abundance of A. ocellaris gene families relative to other anemonefishes were performed using OrthoFinder v2.5.2 (Emms and Kelly 2019) with A. polyacanthus as an outgroup species (Supplementary Table 5). Overall, most sequences (96.7%) could be assigned to one of 29,111 orthogroups, with the remainder identified as "unassigned genes" with no clear orthologs (Supplementary Table 6). Fifty  percentage of all proteins were in orthogroups consisting of 12 or more genes (G 50 ¼ 12) and were contained in the largest 10,672 orthogroups. Further, 15,899 orthogroups were shared amongst all the species examined here, and from these, 12,765 consisted entirely of single-copy genes (Supplementary Table 6). Interestingly, all trees ( Fig. 4; Supplementary Fig. 2) obtained here using maximum likelihood or Bayesian inference approaches had the same topology and shared some similarities with aspects of earlier work (Litsios and Salamin 2014;Litsios et al. 2014b;Marcionetti et al. 2019): a monophyletic A. polymnus and A. sebae group clustering with an Indian Ocean clade represented by A. bicinctus and A. nigripes, the skunk anemonefishes A. akallopisos and A. perideraion, an "ephippium complex" comprising A. frenatus and A. melanopus, and the monophyletic A. ocellaris/A. percula sister-species. Our tree diverges most dramatically from previous analyses in that P. biaculeatus was not positioned within the A. ocellaris/A. percula clade but became the root of all other anemonefishes with 100% bootstrap support ( Fig. 4; Supplementary Fig. 2). This topology has only been reported in 2 other studies that used mitochondrial genes to reconstruct anemonefish phylogenies (Santini and Polacco 2006;Nguyen et al. 2020). Pomacentrids (and anemonefishes in particular) have long been a challenge in systematics due to their high diversity and intraspecific variation (Tang et al. 2021), therefore discordances in our tree may also stem from insufficient information (i.e. only 11 out of the 28 described species were used). Specifically, the inclusion of Amphiprion latezonatus, the sister group of all Amphiprion except for the A. ocellaris/A. percula clade, could be essential to resolve the molecular phylogeny of anemonefishes (Santini and Polacco 2006;Litsios and Salamin 2014;Litsios et al. 2014b). This topology could also be the result of gene choice, as incongruences between trees based on mitochondrial and nuclear data has previously been observed (Litsios and Salamin 2014). Yet, here, we used an alignment matrix consisting   Fig. 4. Phylogenetic reconstruction of the Amphiprioninae species tree using a maximum-likelihood approach. Numbers on each branching node are the bootstrap support (%) and the site concordance factor (%). These values were calculated only for nonoutgroup species using the IQ-TREE algorithm.
of more than 12,000 single-copy genes (182,497 parsimony informative sites and 2.8% gaps) and still obtained this topology (Fig. 4). This was further confirmed using BUSCO genes ( Supplementary Fig. 2), predefined sets of reliable markers for phylogenetic inference (Waterhouse et al. 2018). We also observed weak support values using site concordance factors (i.e. the percentage of sites supporting a specific branch over 1,000 randomly sampled quartets) in some branches. For example, a support value of 45.89% was recovered at the branching node of P. biaculeatus despite having 100% bootstrap support ( Fig. 4; Supplementary Fig. 2), thus suggesting high uncertainty. Still, despite the incongruence observed here, our phylogenetic reconstructions are based on large-scale genomic evidence, whilst other studies have used only a few genes. Although we are confident that our trees have a good resolution and represent one of the most enriched phylogenies for anemonefishes in terms of supporting genomic loci and reduced stochastic error, we are nonetheless cautious in our interpretation of the phylogenetic delimitation of species presented here. Certainly, establishing a well-resolved phylogeny of anemonefish, particularly the early divergent species (i.e. P. biaculeatus, the A. ocellaris/A. percula clade, and A. latezonatus), is critically important to understanding the evolution, genomic underpinning of their lifestyle (e.g. symbiosis with sea anemones, complex social structure) and fascinating biological features (e.g. pigmentation, sex change, aging).

Whole-genome synteny of anemonefishes
Syntenic blocks are often used to evaluate micro-and macroscale patterns of evolutionary conservation and divergence among related species. Identifying conserved gene order at the chromosomal level among species furthers our understanding of the molecular processes that led to the evolution of chromosome structure across species (Wang et al. 2012;Liu et al. 2018). Thus, here we used MCScanX (Wang et al. 2012) to investigate whole genome synteny among all species present. Overall, synteny patterns were consistent with the phylogenetic tree, in that closely related species had a higher number of conserved blocks than distant species (Supplementary Table 7 and Supplementary Fig.  3), ultimately reflecting how gene gains or losses and sequence divergence increase proportionally with evolutionary time (Liu et al. 2018). Yet, since all species studied here are still closely related, shared synteny among species pairs is considerably high. As expected, synteny between A. ocellaris and A. percula was much higher than comparisons to other anemonefishes (Supplementary Table 7 and Fig. 3). This analysis identified 175 syntenic blocks of 19,872 genes (Supplementary Table 7) ranging from 11 to 1,010 gene pairs with 76.2% of these being collinear (i.e. conserved order).
Although studying pairwise collinear relationships among chromosomal regions allows for the elucidation of gene family evolution, the alignment of multiple regions is even more important as it can reveal complex chromosomal duplication and/or rearrangement relationships (Wang et al. 2012). Teleost fish genomes have been dynamically shaped by several forces (such as WGD and transposon activity). These in turn, led to various types of chromosome rearrangements either through differential loss of genes or formation of deletions, duplications, inversions, and translocations, which together contribute to reproductive isolation and therefore might promote the formation of a new species (Volff 2005). Some chromosomal regions are translocated to new positions whereas others are inversed ( Supplementary  Fig. 4). While the information shown here is merely an initial overview of the large-scale synteny of the false clownfish and other anemonefishes, it is still an important first step in obtaining evolutionary insights into the Amphiprioninae lineage.
Lineage-specific conserved genomic elements in the A. ocellaris/A. percula branch Conserved genomic elements are relatively unchanged sequences across species. They are often parts of essential proteins or regulatory units and can be related to characteristics of specific lineages (Volff 2005). To identify the signature of such conserved elements in the A. ocellaris genome, we first attempted to identify conserved elements in A. ocellaris but not in other anemonefish using the PHAST program (see Materials and Methods) (Hubisz et al. 2011). However, we were unable to identify genomic elements that were only conserved in this species, therefore we next sought to identify conserved elements shared by the 2 sister species of A. ocellaris and A. percula. We identified 91 conserved genome elements that showed significant conservation (adjusted P-value < 0.05).
To understand the possible role of these conserved elements, we investigated the function of 62 genes located around these elements (Supplementary Table 8). It is interesting to note that at least 21 out of these 62 genes could be involved in neurological functions. For example, the pcdh10 gene, encoding the protocadherin-10 protein, has been shown to be expressed in the olfactory and visual system of vertebrates as well as being involved in synapse and axon formation in the central nervous system (Mancini et al. 2020). Furthermore, a recent study also showed that mice lacking 1 copy of this gene have reduced social approach behavior (Schoch et al. 2017). Similarly, the asic2 gene is expressed in the central and peripheral nervous system of vertebrates and its encoded protein, acid-sensing ion channel 2, is vital for chemo-and mechano-sensing the environment (Cheng et al. 2018). The neuronal pentraxin 2 protein, encoded by the nptx2 gene, plays a role in the alteration of cellular activities for longterm neuroplasticity (Chapman et al. 2019). The tafa5 gene encodes a neurokine involved in behavior related to spatial memory in mice (Huang et al. 2021) and peripheral nociception in zebrafish (Jeong et al. 2020).
Additionally, these 62 genes also showed distinct expression patterns in A. ocellaris brain tissues (optic lobe and the rest of the brain) (Fig. 5). Mean TPM expression levels of these genes were 29.8 and 29.71 for the optic lobe and the other part of brain, respectively, whereas mean TPM for other tissues was 14.17, potentially indicating different, yet unknown roles of these genes in the brain. Although further investigation is required, this data suggest that neuronal genes located around specifically conserved elements in the A. ocellaris/A. percula branch could represent genomic signatures related to the distinct ecology of these 2 sister species. Certainly, anemonefish societies are highly species-specific (Litsios et al. 2012(Litsios et al. , 2014a. For example, while A. percula have a reduced range of movement, spending more time inside of their anemones and thus have a lower probability of social rank being usurped by outsiders, other species like Amphiprion clarkii have more opportunity for movement (due to their higher swimming abilities), increasing the probability of being taken over by outsiders so that the dominant individuals must display constant aggression to maintain control of their territory (Cleveland et al. 2011;Verde et al. 2015;Schmiege et al. 2017). Future research should endeavor to better characterize these differences and investigate whether they are linked to the genes we have identified here.

Conclusions
Here, we assembled the highly contiguous and complete chromosome-scale genome of the false clownfish A. ocellaris by de novo assembly using PacBio long reads and Hi-C chromatin conformation capture technologies. We annotated 26,797 protein-coding genes with 96.62% completeness of conserved actinopterygian genes, the highest level among anemonefish genomes available so far. We also identified tissue-specific gene expression patterns in A. ocellaris. Finally, we identified genomic elements conserved only in A. ocellaris/A. percula, which might underpin lineage-specific characteristics of these 2 species when compared to other anemonefishes. The high-quality of our genome and annotation will not only serve as a resource to better understand the genomic architecture of anemonefishes, but it will further strengthen the false clownfish as an emerging model organism for molecular, ecological, developmental, and environmental studies of reef fishes.

Data availability
The genomic and transcriptomic sequencing reads generated in this study have been deposited in NCBI GenBank database under the BioProject ID PRJNA787397. This Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank under the accession number JAJUWX000000000. The genome annotation is available in Dryad repository (https://datadryad.org/stash/share/ UpzvIVKZOj21CcO38uwnPMgF1_ONpMM_LqX_S_0pSoE).
Supplemental material is available at G3 online.