SPIRE: a Searchable, Planetary-scale mIcrobiome REsource

Abstract Meta’omic data on microbial diversity and function accrue exponentially in public repositories, but derived information is often siloed according to data type, study or sampled microbial environment. Here we present SPIRE, a Searchable Planetary-scale mIcrobiome REsource that integrates various consistently processed metagenome-derived microbial data modalities across habitats, geography and phylogeny. SPIRE encompasses 99 146 metagenomic samples from 739 studies covering a wide array of microbial environments and augmented with manually-curated contextual data. Across a total metagenomic assembly of 16 Tbp, SPIRE comprises 35 billion predicted protein sequences and 1.16 million newly constructed metagenome-assembled genomes (MAGs) of medium or high quality. Beyond mapping to the high-quality genome reference provided by proGenomes3 (http://progenomes.embl.de), these novel MAGs form 92 134 novel species-level clusters, the majority of which are unclassified at species level using current tools. SPIRE enables taxonomic profiling of these species clusters via an updated, custom mOTUs database (https://motu-tool.org/) and includes several layers of functional annotation, as well as crosslinks to several (micro-)biological databases. The resource is accessible, searchable and browsable via http://spire.embl.de.


Introduction
Life on Earth is dominated by microbes: bacteria, archaea and small eukaryotes shape our world by driving biogeochemical cycles across ecosystems ( 1 ), they enable macroscopic life as plant and animal symbionts ( 2 ), and they represent by far the greatest biodiversity among known life ( 3 ).Yet most of this diversity remains biological 'dark matter' ( 4 ): although meta'omic techniques enable their study directly from sequencing data, the vast majority of microbes eludes laboratory cultivation and only a small fraction of the functional space encoded by microbial genes has been characterized ( 5 ,6 ).While sampling efforts have increased exponentially and generated petabytes of data in recent years ( 7 ), most major microbial habitats remain understudied to the extent that almost every newly sequenced metagenome adds 'novel' species (as inferred from metagenome-assembled genomes, MAGs) and thousands of 'novel' genes of unknown function to the census ( 8 ).
The bulk of metagenomic data is generated in individual studies to address specific research questions.Heterogeneity in sample preparation ( 9 ), sequencing protocols and bioinformatic processing workflows ( 10 ,11 ) complicate comparisons of findings across studies.Several initiatives have sought to integrate and consolidate datasets by re-processing them using consistent pipelines.For example, QIITA ( 12 ), MGnify ( 7 ) or the Microbe Atlas Project ( https:// microbeatlas.org/) host millions of amplicon samples, whereas other projects, such as curatedMetagenomicData ( 13 ), GMrepo ( 14 ) and the Ocean-MicrobiomicsDatabase ( 15 ), focus on taxonomic and functional profiles of human-associated or ocean metagenomes.Large MAG catalogs for multiple biomes are hosted online as part of the DOE's IMG / M ( 16 ) and EBI's MGnify ( 7 ) resources.Moreover, the Genome Taxonomy Database (GTDB, 17 ) has advanced the field by consistently organizing both isolate genomes and quality-filtered MAGs into a common prokaryotic reference tree that guides standardized, phylogeny-informed taxonomies (18)(19)(20).The GTDB encom-passes 85 205 species-level genome clusters across 181 phyla (as of release r214, April 2023), two thirds of which are represented only by MAGs, while also providing widely used tools for genome quality control ( 21 ) and taxonomic classification ( 22 ).Overall, existing resources focus on either providing large gene or genome catalogs, on functional and taxonomic profiling, or on harmonizing contextual data given heterogeneous data submission and annotation practices, and are often restricted to individual microbial habitats or cordon data on different habitats off into distinct subsets.
Here we introduce SPIRE, a Searchable, Planetary-scale, Integrated mIcrobiome REsource to study microbial diversity and function at global habitat, geographical and phylogenetic scales.As detailed below, SPIRE version1 encompasses 99 146 consistently processed whole-genome shotgun metagenomic samples from 739 distinct studies, integrated across environments and amended with manually curated contextual data, based on a newly developed lightweight 'microntology' of 92 terms describing microbial habitats and lifestyles.SPIRE combines 1.16 million newly constructed MAGs of medium or high quality ( 23 ) with the 907k high-quality reference genomes in proGenomes3 ( 24 ), clustered into 133 402 species-level genome clusters, 78 804 of which are unclassifiable at species level using current tools ( 22 ).Species clusters are profilable using mOTUs ( 25 ) via an updated custom database and pre-computed taxonomic profiles across all 99k metagenomic samples will be released as part of the resource.SPIRE further comprises 35 billion metagenomically called open reading frames (ORFs) with various layers of functional annotation, linked to clusters in the Global Microbial Gene Catalogue (GMGC, 8 ).SPIRE provides consistent integration of these heterogeneous data modalities and is designed to interoperate with other (micro-)biological resources, such as proGenomes ( 24 , https://progenomes.embl.de), the GMGC ( 8 , https://gmgc.embl.de), eggNOG ( 26 , http://eggnog6.embl.de ) and metaMap ( https:// metamap.biobyte.de/), among others.The resource can be accessed, browsed, and searched via https://spire.embl.de .D 779 Figure 1.Ov ervie w of sampled habit ats in SPIRE, as a subset of annot ated 'microntology' terms (see t able S1).Microntology terms are assigned using a 'multi-tag' system, meaning that individual samples can be annotated with multiple terms of varying granularity and redundantly within a flat hierarchy (e.g. a human fecal metagenome will be annotated as 'host-associated, animal host, mammalian host, human host, digestive tract, intestine', whereas a mangro v e-associated sample carries tags from both the 'aquatic' and 'terrestrial' term space, while moreo v er possibly being annotated as 'host-associated, plant host').Shown above is the total number of samples annotated to a subset of microntology terms under this system.

Metagenome collection and dataset curation
The core dataset underlying SPIRE was defined using a semiautomatic process, combining three data sources: (i) samples in the European Nucleotide Archive (ENA) meeting the criteria 'library_source = METAGENOMIC AND li-brary_strategy = WGS AND instrument_platform = ILLU-MINA AND base_count > = 10 ∧ 9 AND average read length > = 100' were selected from all projects where > = 20 samples satisfied the above criteria as of Sep 30th 2022; (ii) metagenomic samples available via the JGI's IMG / M resource ( 27 ) on Sep 30th 2019 (to comply with JGI data policies and embargo periods); (iii) manually selected 'allowlisted' studies of particular interest (e.g.providing data on exotic environments).For the resulting list, ENA project accessions were manually matched to publications where possible; in case of data submitted by the JGI, where each sample is associated with a distinct project accession, 'studies' were defined based on matched publications and as consistent groups based on sample metadata provided via IMG / M.
The metagenomic sample set was further filtered and curated by (i) removing amplicon and isolate genome sequencing datasets erroneously annotated as shotgun metagenomes; (ii) identifying and removing erroneously submitted datasets (e.g.where both mates in 'paired end' data were identical); (iii) identifying and removing duplicates (submitted under distinct project or sample accessions); (iv) removing samples from controlled experimental setups (e.g.laboratory mice, pathogen challenges or defined in vitro communities); (v) flagging special cases such as microcosms, paleobiological samples or pre-enriched samples; (vi) resolving misfits with the European Nucleotide Archive (ENA) and Sequence Read Archive (SRA) data model, e.g. if distinct biological samples were erroneously submitted under the same biosample accession, but distinct experiment or run accessions; (vii) iden-tifying and combining technical replicates (distinct experiment accessions) for the same biological sample.For the resulting list, raw sequencing data was downloaded from the ENA.
Following these steps, the final dataset in SPIRE comprises 99 146 metagenomic samples across 739 distinct studies.

Curation of contextual data and overview of sampled environments
Contextual data for each metagenomic sample was sourced (i) via annotation fields in ENA, (ii) via IMG / M metadata tables where applicable and (iii) directly from matched publications.Information was consolidated into common fields (e.g.latitude and longitude data were manually harmonized across different submitted formats).All samples were manually annotated against a newly developed ' microntology ' (see Table S1), a shallow and lightweight ontology of 92 terms to describe microbial habitats and lifestyles, crosslinked to terms in established resources such as the EnvO ( 28 ) or UBERON ( 29 ) ontologies.SPIRE sample annotation uses a 'multiple tag' system, meaning that each sample is described using a combination of concurrent tags, rather than one specific term in a (deep) hierarchy, allowing an annotation with increased flexibility, yet compatibility to established ontologies.As a result, for example, 68% of the ∼100k samples in SPIRE are annotated as 'host-associated' (66.5% as animal-associated, 56% as human-associated, 1.5% as plant-associated); 17% are aquatic samples (including 7.6% marine and 5.5% fresh water); 13.5% are terrestrial (including 6.4% soil samples); 10.3% are from anthropogenic or human-impacted environments (including 6.6% from built environments); see Figure 1 for details.Moreover, data included in SPIRE cover pole-topole latitudes, with samples from ∼200 countries and territories.Green indicates species clusters that contain both isolate genomes and MAGs.See Supplementary Table S2 for taxonomic classifications of all species clusters included in SPIRE.
All SPIRE MAGs were taxonomically classified using gtdbtk v2.11 against release r207 ( 22 ) and consensus taxonomy for species clusters at each taxonomic level was assigned based on a majority vote, with manual resolution of a few remaining conflicting labels.  2 ).This large proportion of 'novel' species relative to the GTDB may in part be due to a conservative parametrization of the gtdb-tk classifier (favoring specificity over sensitivity), but it indicates that SPIRE covers a vast diversity of previously uncharacterized and undescribed microbial diversity .Notably , 28 856 SPIRE clusters unclassified at species level contain more than a single genome.

Functional annotation
Detection of orthologs and inference of putative function for metagenomically-called ORFs (see above) were performed using eggNOG-mapper v2 ( 44 ,45 ).ORFs were further annotated for putative roles in antibiotics resistance using DeepARG ( 46 )

Database design
SPIRE relies on a mongoDB database as its foundation.Within this system, a repository of samples / MAGs and their attributes is stored.This data can be conveniently accessed through the web-based interface.Structured data such as annotation of genes and genomes is stored in a relational database management system to allow complex and time efficient queries.

Website
SPIRE is accessible, browsable, searchable and downloadable via spire.embl.de .The main access modes are by habitat / sample (searching based on accessions or metadata tags), by taxon (based on clade names and species-level clusters) and by genome (individual genomes within clusters).These modes are inter-accessible (e.g.browsing from a sample to a specific taxon observed therein, for which then multiple genomes can be accessed) and at each level, link-outs to relevant independent or third party databases are provided.We invite user contributions, suggestions for improvements and bug reports under spire.embl.de/ contribute.

Outlook
Given the exponential growth of publicly available metagenomic data, we anticipate biennial updates of the underlying data for SPIRE.We will continue to develop and update the processing pipeline to address rising computational demands and integrate novel or improved tools.Moreover, we will seek to extend the range of available functional annotations at gene and genome level, within the limits of computational scalability .Finally , and most importantly , we will continue to further integrate SPIRE with other resources such as proGenomes ( 24 ), eggNOG ( 26 ), the GMGC ( 8 ) and other ongoing efforts.

Discussion
SPIRE provides the largest sets of consistently processed metagenomes, newly generated MAGs and profilable microbial species clusters to date.Combined with a high degree of curation and integration of various data modalities (MAGs, contigs, genes, profiles, etc.), SPIRE is the most comprehensive resource available to study microbial diversity and function.Covering a broad range of habitats and geography, SPIRE enables true 'planetary-scale' analyses of microbiomes across various environments, including so far understudied ones.At the same time, SPIRE encompasses large amounts of 'novel', previously undescribed microbial diversity both at the gene and genome level.We are confident that SPIRE will enable and simplify a wide range of analyses for end users, ranging from the characterization of individual taxa or gene clusters of interest against a global data canvas, to truly 'planetary-scale' studies of microbial life across habitats and phylogeny.

Figure 2 .
Figure 2. Representation of taxonomic groups covered in SPIRE.Shown are the total number of species clusters (top) and total number of genomes (bottom) for the largest 25 bacterial and largest 15 archaeal phyla represented in SPIRE.Orange hues indicate clusters and genomes of isolates, as downloaded from proGenomes3 (progenomes.embl.de;'isolate only').Blue hues indicate clusters and genomes introduced in SPIRE ('MAGs only').Green indicates species clusters that contain both isolate genomes and MAGs.See Supplementary TableS2for taxonomic classifications of all species clusters included in SPIRE.