The sponge microbiome project

Abstract Marine sponges (phylum Porifera) are a diverse, phylogenetically deep-branching clade known for forming intimate partnerships with complex communities of microorganisms. To date, 16S rRNA gene sequencing studies have largely utilised different extraction and amplification methodologies to target the microbial communities of a limited number of sponge species, severely limiting comparative analyses of sponge microbial diversity and structure. Here, we provide an extensive and standardised dataset that will facilitate sponge microbiome comparisons across large spatial, temporal, and environmental scales. Samples from marine sponges (n = 3569 specimens), seawater (n = 370), marine sediments (n = 65) and other environments (n = 29) were collected from different locations across the globe. This dataset incorporates at least 268 different sponge species, including several yet unidentified taxa. The V4 region of the 16S rRNA gene was amplified and sequenced from extracted DNA using standardised procedures. Raw sequences (total of 1.1 billion sequences) were processed and clustered with (i) a standard protocol using QIIME closed-reference picking resulting in 39 543 operational taxonomic units (OTU) at 97% sequence identity, (ii) a de novo clustering using Mothur resulting in 518 246 OTUs, and (iii) a new high-resolution Deblur protocol resulting in 83 908 unique bacterial sequences. Abundance tables, representative sequences, taxonomic classifications, and metadata are provided. This dataset represents a comprehensive resource of sponge-associated microbial communities based on 16S rRNA gene sequences that can be used to address overarching hypotheses regarding host-associated prokaryotes, including host specificity, convergent evolution, environmental drivers of microbiome structure, and the sponge-associated rare biosphere.


128
Quality-filtered, demultiplexed fastq files were processed using the default closed-reference pipeline 129 from QIIME v. 1.9.1 (QIIME, RRID:SCR_008249). Briefly, sequences were matched against 130 GreenGenes reference database (v. 13_8 clustered at 97% similarity). Sequences that failed to align 131 (e.g. chimeras) were discard, which resulted in a final number of 300,140,110 sequences. Taxonomy   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65   6 De-noising using Deblur: 160 Recently, sub-OTU methods that allow views of the data at single-nucleotide resolution have 161 become available. One such method is Deblur [27], which is a denoising algorithm for identification 162 of actual bacterial sequences present in a sample. Using an upper bound on the PCR and read-error 163 rates, Deblur processes each sample independently and outputs the list of sequences and their 164 frequencies in each sample, enabling single nucleotide resolution. For creating the deblurred biom 165 table, quality filtered, demultiplexed fasta files were used as input to Deblur using a trim length of 166 100, and min-reads of 25 (removing sOTUs with < 25 reads total in all samples combined). Taxonomy   The dataset covers 4033 samples with a total of 1,167,226,701 raw sequence reads. These 191 sequence reads clustered into 39,543 OTUs using QIIME's closed-reference processing, 518,246 192 OTUs from de novo clustering using Mothur (not filtered for OTU abundances), and 83,908 sOTUs 193 using Deblur (with a filtering of at least 25 reads total per sOTU). We recommend that data users  This dataset can be utilised to assess a broad range of ecological questions pertaining to 203 host-associated microbial communities generally or to sponge microbiology specifically. These 204 include: i) the degree of host-specificity, ii) the existence of biogeographic or environmental 205 patterns, iii) the relation of microbiomes to host phylogeny, iv) the variability of microbiomes within 206 or between host species, v) symbiont co-occurrence patterns as well as vi) assessing the existence of 207 a core sponge microbiome. An example of this type of analysis is shown in Figure 3,  The deblurred dataset has also been uploaded to an online server [19] that supplies both 232 html and REST-API access for querying bacterial sequences and obtaining the observed prevalence 233 and enriched metadata categories where the sequence is observed (Figure 4)
The colour code is based on phylum-level assignments and the phylum Thaumarchaeota has been shown in the same colour for the RDP and Silva database. The terminology "Thaumarchaeota" is used as class in the Greengenes taxonomy, which belongs to the phylum "Crenarchaeota". We therefore think it is appropriate to keep the colours different as they represent different taxonomic assignments.
We also now briefly comment on the use of different database as follows "The inclusion of these taxonomies is helpful considering that they have substantial differences as recently discussed [25]. For example, Greengenes and RDP have the taxon Poribacteria, a prominent sponge-enriched phylum [26], which did not exist in the SILVA version used."  Response: There was a mix-up with the labels. We have fixed this to "Total samples present" as well as changed the label to the second pie chart to "Total sample number distribution". We have also modified the figure legend to clarify the meaning of the two pie charts.
My understanding is that authors only consider the presence or absence of a particular OTU in the enrichment analysis. If possible, I would like to see an additional function for enrichment analysis based on the relative abundance of a particular OTU, since relative abundance provides another angle to evaluate the importance of the bacterial OTU in the community. This probably needs to be done on a dataset with normalized sequencing depth (ie, subsampled to 10,000 reads).

Response:
We thank the referee for this useful suggestion. A non-parameteric (Kruskal-Wallis) relative abundance test has been added to the webserver analysis. All category/value pairs significantly enriched in either of the two tests are now listed in the output, as well as the corresponding p-values. Figure 4 and the Database