The global catalogue of microorganisms 10K type strain sequencing project: closing the genomic gaps for the validly published prokaryotic and fungi species

Abstract Genomic information is essential for taxonomic, phylogenetic, and functional studies to comprehensively decipher the characteristics of microorganisms, to explore microbiomes through metagenomics, and to answer fundamental questions of nature and human life. However, large gaps remain in the available genomic sequencing information published for bacterial and archaeal species, and the gaps are even larger for fungal type strains. The Global Catalogue of Microorganisms (GCM) leads an internationally coordinated effort to sequence type strains and close gaps in the genomic maps of microorganisms. Hence, the GCM aims to promote research by deep-mining genomic data.


Introduction
Microorganisms are the most abundant organisms on Earth. The total diversity of prokaryotes may comprise up to 10 9 species [1]. For prokaryotic species, only new names published in the International Journal of Systematic and Evolutionary Microbiology (IJSEM) as an original article or in the "Validation Lists" are considered valid. As of the end of 2017, 15,081 valid prokaryotic species and subspecies were published compared to 12,981 at the end of December 2015 [2]. The number of publications increased by 1,088 in 2016 and by 1,012 in 2017. The most commonly accepted estimate of the number of existing fungal species is 2.27 million, as hypothesized by Hawksworth [3], while the number of species reported in the Dictionary of Fungi is only about 100,000.
The taxonomy of microorganisms, including their classification, identification, and nomenclature, has developed from morphological and metabolic characterization to incorporate numerical taxonomy based on phenetic analyses, chemotaxonomy, and finally polyphasic approaches that combine phenotypic, chemotaxonomic, genotypic, and genomic information. The IJSEM recently announced that since January 2018, authors of new taxa descriptions have been asked to provide genome sequence data with descriptions of novel taxa with their manuscript submissions [4]. As such, the taxonomy of microorganisms has entered the genomics era. A genomic "gold standard" for consistent microbial species definitions is urgently needed. To meet this end, a fundamental step is to sequence the type strains of validly published prokaryotic and fungal species.
In addition to the microorganisms that can be isolated and maintained in situ, the vast majority of microorganisms cannot yet be cultivated and thus are relatively poorly studied. Cultureindependent approaches have been developed to investigate the compositions and functions of environmental and human mi-crobiomes. However, accurate taxonomic and functional predictions based on metagenomic data are dependent on the availability of high-quality reference genomic data [5]. Therefore, sequencing the genomes of type strains of recognized microbial species will provide a taxonomic context for metagenomic data analysis, which is commonly comprised of short and incomplete sequences from complex environmental communities.
Microorganisms possess extensive genomic and metabolic diversity, which makes them ideal biotechnological tools. Decoding the full genomes of the type strains of various species in order to provide reference genomes will thus enable genes to be associated with functions, such as metabolic activity, virulence, antibiotic production or resistance, biomass deconstruction, cellulose agricultural nitrogen fixation, and the liberation of environmental phosphorus. Access to microbial genomic sequences will significantly contribute to future studies in microbial biology, ecology, and biochemistry and these will, in turn, accelerate the discovery of new natural products and drugs [6].

The Current Status of the Strain Sequencing Project
Descriptions of prokaryotic species are based on living cultures, and one representative strain is designated as the nomenclatural "type." The IJSEM and the International Committee on Systematics of Prokaryotes require that the type strains of new species be deposited in at least two recognized collections in two countries. The type strains of 15,081 prokaryotic species are widely preserved in more than 130 culture collections. In mycology the trend is similar, although hitherto a fungal type specimen must be metabolically inactive.
Presently, the selection of strains for whole-genome sequencing is based predominantly on medical, ecological, or in-dustrial importance, which often leads to bias in assessing phylogenetic relationships. There are several ongoing phylogeneticbased microbial sequencing projects. The Genomic Encyclopedia of Bacteria and Archaea (GEBA), led by the US Department of Energy Joint Genome Institute (US DOE JGI), has pioneered the partnership between culture collections and sequencing projects. The GEBA project published 1,003 whole-genome sequences of type strains in 2017 as the outcome of its first stage [7]. GEBA started the new stage of the project in 2015, which has a focus on the genomes of soil, plant-associated, and newly described type strains [8].
Similarly, the US DOE JGI, in collaboration with international research teams, conducted a 5-year project to sequence 1,000 fungal genomes from across the Fungal Tree of Life [9]. The overall plan is to fill in gaps in the Fungal Tree of Life by sequencing at least two reference genomes from the more than 500 recognized families of fungi.
Many type strains of microbial species remain unsequenced. Hence, the World Federation of Culture Collections (WFCC) and the World Data Centre for Microorganisms (WDCM) have initiated an international community-led project to sequence the full genomes of microbial type strains to support continued scientific discovery and biotechnological utilization. Considering the wide distribution of type strains, cooperation across the global culture collection community is essential for success.

Emerging Enhancements of Culture Collection in the Genomic Era
Efforts made by culture collection curators to explore the diversity of microorganisms and to harness their genes, properties, and products remain insufficient. While type collections are not always large or diverse, the genome sequencing efforts of the Global Catalogue of Microorganisms (GCM) will increase access to resources in smaller collections.
The WDCM is the data center of the WFCC and the Microbial Resources Centers Network. The WDCM is working on facilitating the application of cutting-edge information technology to improve the interoperability of microbial data, promote access and use of data, and coordinate international cooperation among culture collections, scientists, and other user communities. The first stage of the GCM project, started in 2012, focused on sharing strain catalogue data from culture collections [10]. The proposed type strain sequencing project is the continuation of the GCM project as its second stage, GCM 2.0.

Project Development and Current Progress
The project was first announced during the 14th International Culture Collections Conference, held in Singapore in July 2017 in conjunction with the International Union of Microbiological Societies (IUMS) conferences. Following that, in October 2017, a ceremony was held in Beijing, China, to launch the project, at which representatives from the following culture collections It is expected that the project will be completed within 5 years, including a pilot stage in the first year. GCM 2.0 includes two core subprojects, sequencing 10,000 bacterial and archaeal type strains and sequencing of some of the fungal type strains. It will also embrace several satellite projects on specific scientific targets. GCM 2.0 is coordinated by a steering committee and five interlinked working groups: Bacteria Selection, Fungi Selection, Standard Operational Procedures, Databases, and Intellectual Property Rights and Legal Issues.
The project has established standard operational procedures for DNA extraction, sample submission, sequencing, and data processing to ensure that all genetic resources, data, and metadata associated with type strains are appropriately obtained, recorded, and stored. A project proposed by the WDCM, "AWI 20170: Specification on Data Integration and Publication in Microbial Resource Centers," which would meet International Organization for Standardization standards, is under development. The raw data and annotation results generated from this project will be published on the GCM portal. Following norms established for genome projects coming from the Bermuda Principles and Fort Lauderdale agreement, the resulting genomic data will also be made freely available in public databases, including those maintained by the National Center for Biotechnology Information, the European Molecular Biology Laboratory, and the DNA Data Bank of Japan, after completion of data analysis and annotation and ensurance that the data has met a set of quality criteria.
All validly published bacterial and archaeal type strains, as well as selected reference fungal type strains that are frequently used for functional or phylogenetic studies, will be on the list of candidates for sequencing. Each strain has documentation issued by the providing culture collection, which ensures the purity and identity of the type strain. BGI-Shenzhen will support the microbial genome sequencing and assist with the data analysis for this project. Sampling works for the pilot stage have been initiated. The project has established a global network to collect approximately 800 candidate type strain samples from American, British, Belgian, Chinese, Dutch, Japanese, Korean, Portuguese, Russian, Swedish, and Thai collections. Although extracted DNA samples are much preferred, it is also acceptable to send cultured cell samples of the strains. Importantly, under the terms of Nagoya Protocol, GCM 2.0 will respect the access and benefit-sharing regulations of all countries.
The GCM type strain sequencing project encourages all culture collections to participate in this international collaborative project. Interested parties should be willing to provide DNA for type strains held in their collections. All microbiologists and institutions from related fields are welcome to submit subprojects for genomic data-related research questions. Brief proposals, including questions to be addressed and type strains to be sequenced and analyzed, should be emailed to Dr Juncai Ma at ma@im.ac.cn. Once a proposal is granted as a subproject, the scientist(s) will be asked to lead the full genome analysis and jointly publish the generated outcomes.

Conclusion
As a collaborative network of international culture collections, GCM 2.0 will contribute to a genome-based microbial taxonomic framework, establishing high-quality complete genome sequences as the new gold standard. The resulting knowledge and tools generated through this project will not only directly facilitate the identification of microorganisms but will also improve our ability to predict new gene complexes and their functions from microbial communities. Thus, our knowledge of the hitherto undiscovered microbial diversity will be expanded, which may lead to the sustainable utilization of microbial resources for human benefit.