Whole genome sequence of the deep-sea sponge Geodia barretti (Metazoa, Porifera, Demospongiae)

Abstract Sponges are among the earliest branching extant animals. As such, genetic data from this group are valuable for understanding the evolution of various traits and processes in other animals. However, like many marine organisms, they are notoriously difficult to sequence, and hence, genomic data are scarce. Here, we present the draft genome assembly for the North Atlantic deep-sea high microbial abundance species Geodia barretti Bowerbank 1858, from a single individual collected on the West Coast of Sweden. The nuclear genome assembly has 4,535 scaffolds, an N50 of 48,447 bp and a total length of 144 Mb; the mitochondrial genome is 17,996 bp long. BUSCO completeness was 71.5%. The genome was annotated using a combination of ab initio and evidence-based methods finding 31,884 protein-coding genes.


Introduction
Sponges (phylum Porifera) hold an evolutionarily important position as one of the earliest animal lineages (Redmond and McLysaght 2021;Schultz et al. 2023).Sponges do not produce organs but feature instances of true epithelial tissue, a hallmark of metazoans (Leys and Riesgo 2012).Their body features cells of varying complexity and fate (Musser et al. 2021), and only recently, major breakthroughs in sponge cell cultures were made (Conkling et al. 2019).Among the focal aspects of research on sponges are their remarkable microbiota and chemical diversities (Thomas et al. 2016;Calado et al. 2022).
In the past, these 2 aspects complicated access to their genomes, as DNA extracted from sponges is contaminated by microbial DNA and by compounds binding DNA (Marshall and Barrows 2004) and potentially interfering with sequencing.Workarounds like producing a genome from DNA of thousands of larvae which naturally have a lower abundance of microbial symbionts (Srivastava et al. 2010) or whole-genome amplification (WGA) from single cells (Ryu et al. 2016) both yielded highly fragmented assemblies.Recent sequencing strategies (long reads, synthetic long reads, and/or Hi-C) yield chromosome-level assemblies, as in the case of Ephydatia muelleri (Kenny et al. 2020) as well as Petrosia ficiformis and Chondrosia reniformis (McKenna et al. 2021).However, 13 years after the first sponge genome, there were only 12 sponge genomes available, of over 9,500 described species of sponges (Fig. 1; Table 1) (de Voogd et al. 2023).
Geodia barretti Bowerbank 1858 (Fig. 2a) is a widespread North Atlantic deep-sea demosponge found in depths of 30-2,000 m (Cárdenas et al. 2013) and, thus, would represent 1 of the few genomes of a deep-sea animal.As a high microbial abundance (HMA) sponge, G. barretti hosts an outstanding density and diversity of microbes (Fig. 2b) with an average of 3 × 10 10 microbes/cm 3 (Hoffmann et al. 2006;Leys et al. 2018) from over 400 prokaryotic amplicon sequence variants across 17 phyla (Radax et al. 2012;Steffen et al. 2022).The characterization as "HMA" (highly abundant and diverse microbiota) and "LMA" (lowly abundant and species-poor microbiota) sponges has been recognized since the 1970s and is a species specific attribute, but its significance to the organisms is not yet fully elucidated (Vacelet 1975;Hentschel et al. 2003).This microbiota partly accounts for its richness in natural products (Erngren et al. 2021;Steffen et al. 2021) with still many unknown bioactive metabolites.Finally, G. barretti is a key species of sponge grounds, deep-sea habitats characterized by mass accumulation of sponges (Klitgaard and Tendal 2004;Cárdenas et al. 2013).Sponge grounds are considered vulnerable marine ecosystems (VMEs) and as such G. barretti is part of the VME indicator species list (ICES 2019).Its physiology has been extensively studied to understand the many ecosystem services it provides (Cárdenas and Rapp 2013;Koutsouveli, Cárdenas, Santodomingo, et al. 2020;Maier et al. 2020;Rooks et al. 2020;Bart et al. 2021) as well as to investigate its resilience to human activities (Kutti et al. 2015;Colaço et al. 2022)

DNA extraction and sequencing
DNA extraction for whole-genome sequencing was impeded by rapidly degrading DNA and chemical contamination coisolated with the DNA.We obtained the best results by macerating flashfrozen tissue in 0.2 M EDTA pH 8.0 and straining the dissociated cells through a filter (40 µm, Nalgene) before extracting DNA using a traditional chloroform/isoamyl alcohol partitioning protocol (Dharamshi et al. 2022).The resulting DNA was assayed by NanoDrop (passing range: 260/280, 1.8-2.2;260/230, 2.0-2.2) and denaturing gradient gel electrophoresis (Fig. 2c), and suitable fractions were sequenced on PacBio (RSII and Sequel) and Illumina (HiSeqX) platforms, all performed by the SNP&SEQ Technology Platform, SciLifeLab Uppsala, Sweden.

RNAseq data preparation
Poly-A selected RNA-sequencing (RNAseq) data of UPSZMC 184975 and 6 other G. barretti individuals were used for identification of sponge contigs and gene annotation (Koutsouveli, Cárdenas, Santodomingo, et al. 2020).For identification of sponge contigs in the subsequent whole-genome assembly, the RNAseq data were de novo assembled using Trinity (Haas et al. 2013) with default parameters.For annotation, the same RNAseq data were reassembled with fastp (Chen et al. 2018), hisat2 (Kim et al. 2015), and StringTie (Pertea et al. 2015) with the genome assembly in order to avoid a high number of small unsupported genes and erroneous transcripts.

Assembly
PacBio data were assembled with Flye using the "-meta" flag (Kolmogorov et al. 2020) and polished with the Illumina short reads using one round of Pilon (Walker et al. 2014).
To remove contamination (i.e.nonsponge contigs/scaffolds) and only keep sponge contigs, 2 different strategies were combined: taxonomic identification of contigs and contig coverages with RNAseq data.First, taxonomic classification of the contigs was performed by both contigtax (https://github.com/NBISweden/contigtax, v0.5.9) and BlobTools (Laetsch and Blaxter 2017).In order to run the latter, the short reads were aligned to the assembly with BWA (Li and Durbin 2009); and using the assembled contigs, we performed both a blastn search (Altschul et al. 1990) against the nt database and a DIAMOND blastx search (Buchfink et al. 2015) against the nr database.The obtained BAM file and BLAST output files were used as input files for BlobTools.We kept all contigs annotated as "Eukaryota" by both contigax and BlobTools as a first noncontaminated set for the sponge assembly.Second, 3 of the de novo built transcriptomes were mapped on the full assembly using gmap (Wu et al. 2016).The contigs that on average for the 3 transcriptomes had at least 20% of their length mapped by transcripts were also added to the noncontaminated set of contigs.Coverage was calculated by mapping short and long reads to the genome with BWA and minimap2 (Li 2018), respectively.The coverage per position was extracted from the resulting BAM file using samtools depth (Li et al. 2009) and the average and median across all positions was calculated.

Annotation
All parts of the annotation workflow (annotation preprocessing, transcript assembly, ab initio training, and functional annotation) were performed with 4 Nextflow pipelines from https://github.com/NBISweden/pipelines-nextflow.For annotation preprocessing, all nucleotides were changed to uppercase to prevent interpretation as repeats.The Ns at start or end of contigs are trimmed out to avoid problems when submitting data to public archives.Repeats were masked using RepeatModeler package.Candidate repeats modeled by RepeatModeler were vetted against our protein set (minus transposons) to exclude any nucleotide motif stemming from low-complexity coding sequences.From the repeat library, identification of repeat sequences present in the genome was performed using RepeatMasker (Smit et al. 2013) and RepeatRunner (Smith et al. 2007).
The genome was annotated using the MAKER package (Holt and Yandell 2011), and both evidence-based (using the transcriptomes and gene sets) and ab initio approaches (optimize_augustus.pl)were performed in several (3) iterations until the number of false positive predictions clearly decreased.The annotation quality is given by the annotation edit distance (AED) provided in the gff file.The functional annotation was performed with an in-house pipeline (https://github.com/NBISweden/pipelines-nextflow/tree/master/subworkflows/functional_annotation) based on BLAST and InterPro Scan.tRNAs were annotated by tRNAscan, and only those with an AED < 1 are reported.The mtDNA was recovered in a single contig and annotated by lifting over annotations from the mtDNA of Geodia neptuni (AY320032) with Geneious v 8.1.9at a similarity threshold of 75%; ORFs were annotated using genetic code 4. The annotations were adapted to EMBL format using a gff conversion tool (Norling et al. 2018).

Comparative genomics
For placing this sponge genome in its context, all other available sponge genome assemblies known to us were downloaded from NCBI GenBank or their respective repositories (Table 1).After initial analysis, 6 assemblies were excluded.Six genomes of Aplysina aerophoba under BioProject PRJEB24804 appear uncharacteristically small (largest assembly is 3 Mb) and mainly consist of microbial symbiont sequences (D.Sipkema, pers.comm.).In addition, the alternate pseudohaplotype of P. ficiformis (GCA_947044245.1)was excluded as it is only 12% complete according to BUSCO, as was C. reniformis (GCA_947172445.1).For Amphimedon queenslandica, there are currently 3 genome assemblies available: the original first sponge genome "v1.0" (GCA_000090795.1)(Srivastava et al. 2010), superseded by a second version "v1.1" (GCA_000090795.2),followed by a third assembly from the same research group "UQ_AmQuee_3" (GCA_016292275.1)but from a different/unrelated specimen and sequencing project.Although there is no publication for this third genome assembly, it was included here since it seemed highly complete.
Genome completeness was assessed with BUSCO and meta-zoa_odb10 (Simão et al. 2015).For identification of biosynthetic gene clusters (BGCs), genomes were vetted by antiSMASH (bacterial version) using prodigal-m as gene finder with default parameters (Blin et al. 2021).All figures were created in R (R Core Team 2016) using the packages within tidyverse.

Species
due to 16S rRNA primer biases (Steinert et al. 2017;Steffen et al. 2022).In total, the mapped reads covered 66 phyla, showing the complexity of microbial communities residing within marine sponges.Similar results are observed with contigtax.At the super kingdom level, more than half of the contigs were annotated as "unclassified," a third as bacteria, and only 12% as Eukaryota.Among the contigs annotated as bacteria, the majority (30%) belong to the phylum Candidatus Poribacteria.Other major categories are Proteobacteria (19%) and Chloroflexi (14%), 2 phyla that have been identified as frequent and abundant symbionts in sponge microbiota (Thomas et al. 2016) including in G. barretti (Radax et al. 2012;Steinert et al. 2017;Steffen et al. 2022).Among the eukaryotic contigs, 72% are annotated as Porifera.For comparison, Supplementary Figs. 3 and 4 contain the final assembly evaluated with BlobTools showing a decrease in contribution of foreign sequences to the genome assembly.
The genome assembly has a length of 144,789,364 bp (144.7 Mb) across 4,535 contigs.There are 110 Ns in the genome, and it has a GC content of 49.3%.According to BUSCO (v.5.3.1 metazoa_odb10), the genome is 71.5% complete with 3.6% duplicates of single-copy orthologs.N50 length is 48,446 (contig size ranging from minimum 1,002 bp and median 22,605 bp to maximum 495,233 bp).For comparison, the haploid genome size was estimated to be 127 Mb based on a C-value of 0.13 pg measured by Feulgen image analysis densitometry (FIAD) (Ryan Gregory and Darren Kelly, pers.comm.).The excess sequences could be due to noncollapsed heterozygous regions in the sponge genome and/or incorporation of microbial symbiont sequences in the genome.As part of the genome assembly, we recovered the mitochondrial genome in 1 chromosome.The mtDNA was circular and had a length of 17,996 bp and the characteristic synteny of tetractinellid mtDNA (Plese et al. 2021): rnl-cox2-atp8-atp6-cox3-cob-atp9-nad4-nad6-nad3-nad4L-cox1-nad1-nad2-nad5-rrns.Beyond the sponge genome, the sequencing data are in fact metagenomic data, and we invite the use of it for exploration of the microbial and viral communities of G. barretti UPSZMC 184975, which was beyond the scope of our work.
In the assembly, RepeatMasker annotated 117,982 repeats with a total size of 26,945.75kb or 18.61% of the genome (mean 228.47 bp).RepeatRunner annotated 1,043 repeats with a total size of 631.47 kb or 0.44% of the genome (mean 605.43 bp).The difference in results is to be expected as the 2 programs are complementary.RepeatMasker identifies repeats based on similarity to known repeats, whereas RepeatRunner identifies highly divergent repeats.
The genome annotation contains 31,884 protein-coding genes.The BUSCO scores together with the number of annotated protein-coding genes suggest that protein-coding genes are well represented in the assembled sequence.There were 66,936 mRNAs (as there were several isoforms per gene) with an average of 7.2 exons per mRNA, an average exon length of 244 bp, and an average coding sequence length of 1,122 bp.Of those genes, 27,544 were functionally annotated, as were 59,664 of the mRNAs.The genome contained 156 tRNAs with an AED < 1.

Comparison with currently available sponge genomes
It is worthwhile noting that about half of the 12 previously published sponge genomes are not deposited in widely used databases such as NCBI GenBank or ENA but in other data repositories.Therefore, all currently valid download links are summarized in Table 1.To place our genome in its context, we summarized technical assembly metrics in Table 2 and biological metrics in Table 3 and visualize a subset in Fig. 3.
Assembly sizes in sponges range from 58 to 419 Mb (Table 2), which is in the range of genome sizes reported in the literature.Using FIAD and flow cytometry across a set of 75 sponge species, Tethya actinia and an unknown Dictyonellidae had the smallest genome with 39.1 Mb, while Mycale laevis 616.1-694.4Mb and Placospongia intermedia 528.1-782.4Mb had the largest genomes (Jeffery et al. 2013).In terms of difference between the size of genome assembly and genome size measured from cells, the assembly for Xestospongia testudinaria is almost 60% larger than the genome size estimated by flow cytometry (161.37 vs 258 Mb assembly).Indeed, both X. testudinaria and Stylissa carteri assemblies are hologenomes, and this excess sequence could indicate significant microbial contamination and/or high heterozygosity.
We selected a set of frequently used metrics to place the genome assembly of G. barretti in context with the other sponge genomes available (Table 2).The first set includes various metrics to express contiguity, i.e. the degree of fragmentation in the assembly.Ideally, the number of contigs should be the number of chromosomes (haploid), which is the case for the genomes of C. reniformis and P. ficiformis.The genome assembly of E. muelleri represents 23 chromosomes in 24 scaffolds but opted to also include a number of unplaced contigs, thus increasing the total number of contigs.The second set of metrics in Table 2 was produced by BUSCO, a tool approximating biological completeness of a genome assembly by assessing the presence or absence of near universal singlecopy genes (orthologs) (Simão et al. 2015).The numbers in this table may deviate from the values given in the original publications as there are pronounced differences between different versions of BUSCO and different reference gene sets.For comparability, we computed the metrics again, all with the same version of the program (v. 5.3.1, metazoa_odb_10, 2021-02-24).Overall completeness (C) frequently used to describe assemblies ranged from 48 to 89%.Generally, while higher values are better, some genes may truly be absent thus not allowing for a 100% completeness.Genes (BUSCOs) may also escape detection due to technical limitations, mainly due to gene prediction difficulties for highly derived lineages.The 2 chromosome-level assemblies for instance score "only" 71.1 and 82.3% complete.This "completeness" is the sum of single (S) and duplicate (D) BUSCOs identified.The number of duplicate BUSCOs is an important parameter as it can be indicative of whether diploid genomes were correctly and consistently collapsed to a haploid assembly, which can be an issue in organisms with high heterozygosity and/or when assembling long reads (Guiglielmoni et al. 2021).The highest number of duplicate single-copy orthologs (12.6%) was detected in E. muelleri assembly.
For most of the assemblies, DNA isolated from the sponges was sufficient.However, in case of single larvae (O.pearsei) and cells (S. carteri and X. testudinaria), WGAs were performed prior to sequencing.Whether to have sequencing reads derived from (1) a single individual as compared to (2) several individuals or (3) WGA DNA matters as differences due to individual variation and amplification errors can lead to more fragmented assemblies.However, these strategies are a trade-off to avoid overwhelming microbial contamination in the sequencing reads (Fig. 2b).One of the crucial aspects of sponge biology is their pervasive association with microbial symbionts (prokaryotes: bacteria, and archaea), especially in their sessile adult stage (Vacelet 1975;Leys et al. 2018).Recognizing this association has led to the classification of sponge species as HMA or LMA sponges.Typically, HMA also implies high microbial diversity and vice versa.Larvae also contain microbial symbionts, albeit to a greatly reduced extent (Björk et al. 2019).These microbial symbionts affect sponge genome sequencing in several ways.Extracting DNA from sponges inevitably leads to contamination with microbial DNA.This can be a challenge in the assembly process and lead to contamination and fragmentation in the resulting genome.Identification of microbial sequences can be difficult as there is a lack of both bona fide sponge sequences in databases as well as genomes of deep-sea microbes.At the same time, a locus with similarity to microbial sequences can also originate from horizontal gene transfer (HGT), which was previously shown in the A. queenslandica genome (Conaco et al. 2016).Conversely, bacterial genes coding for eukaryote-like proteins, present in sponge microsymbionts (Reynolds and Thomas 2016), could potentially be mistaken for true sponge genes.This means that in silico decontamination is challenging.Overall, the most successful contiguous assemblies have so far leveraged Hi-C (C.reniformis, P. ficiformis, and E. muelleri) and/or some form of (synthetic) long reads (G.barretti, O. minuta, and T. wilhelma).
Terpene BGCs have been lately discovered in several genomes of octocoral species (Burkhardt et al. 2022;Scesa et al. 2022) and sponges (Wilson et al. 2023).We therefore analyzed all 13 genomes with antiSMASH to identify possible BGCs: results ranged from 1-2 BGCs (A. queenslandica and C. reniformis) to 133 BGCs (X.testudinaria) (Table 3).While we did not determine whether these gene clusters originate from contamination or genuine cases of HGT, they highlight once more the close association of the sponges with their microbes and raise the possibility that BGCs are relatively widespread in sponge genomes.Interestingly, 1-10 BGCs were still found in chromosome-level assemblies (C.reniformis, P. ficiformis, and E. muelleri), which hints at genuine cases of BGC transferred to the sponge, deserved to be further studied.In the G. barretti genome, 21 BGCs were detected (Supplementary Table 1): 1 arylpolyene, 1 betalactone, 1 nonribosomal peptide synthetase (NRPS), 5 NRPS-like, and 13 terpene clusters.This is the third highest number only exceeded by L. baikalensis with 69 and X. testudinaria with 133 BGCs (Table 2).A full list of all BGCs is included in Supplementary Table 1.Overall, there seems to be no correlation between HMA/LMA status and the number of BGCs detected.It is worth noting that the majority of BGCs in L. baikalensis and X. testudinaria start at the first position of a short contig which indicates that the BGC is likely truncated and/or incomplete.The high numbers of BGCs for these 2 species might thus be an overestimation due to counting the same BGC multiple times.
To conclude, the G. barretti genome is the first genome for the Tetractinellida order, the second most speciose order of demosponges with over 1,100 species (de Voogd et al. 2023).All 4 deep-sea sponge genomes published so far are glass sponges (class Hexactinellida) (Francis et al. 2023;Santini et al. 2023;Schultz et al. 2023), with 3 of them published after submission of this study; the G. barretti genome is the first deep-sea genome of a demosponge.This genome will firmly establish the North Atlantic G. barretti as a prominent deep-sea sponge species for future studies, allowing the generation of new hypotheses about multicellularity, immunity, chemistry, and symbiont/cell recognition and interaction, which could be tested in vitro, thanks to the successful G. barretti cell line and CRISPR/Cas12a gene-editing system (Hesp et al. 2020(Hesp et al. , 2023)).

Data availability
The data associated with this study are deposited at the European Nucleotide Archive
Fig. 3. a) L90, number of contigs that cumulatively cover 90% of the assembly and b) the corresponding BUSCO completeness (C).The tail indicates the number of duplicate BUSCOs.c) The N50 compared to the number of Ns in the assemblies.

Table 2 .
Various genome assembly metrics.(G)50, size of the smallest contig that, with all larger contigs, sums up to over half of the assembly length; L90, number of contigs to span 90% of the genome assembly; BUSCO, total BUSCOs searched are 954; values given in the table are the percentages of the total BUSCOs that were identified respectively in each genome.C, complete BUSCOs; S, complete and single-copy BUSCOs; D, complete and duplicated BUSCOs; F, fragmented BUSCOs; M, missing BUSCOs.A. queenslandica 1 is GCF_000090795.2; A. queenslandica 2 is GCA_016292275.1.Species listed are in alphabetical order.In bold the genome reported herein. N

Table 3 .
Number of BGCs according to antiSMASH and biological classification (type) as "HMA" or "LMA" species.% was calculated from individual nucleotide counts.The numbers of genes/CDS and the percentage of repetitive sequences were taken from the respective publications.Species are listed in alphabetical order.In bold the genome reported herein. GC