Database Resources of the National Genomics Data Center, China National Center for Bioinformation in 2024

Abstract The National Genomics Data Center (NGDC), which is a part of the China National Center for Bioinformation (CNCB), provides a family of database resources to support the global academic and industrial communities. With the rapid accumulation of multi-omics data at an unprecedented pace, CNCB-NGDC continuously expands and updates core database resources through big data archiving, integrative analysis and value-added curation. Importantly, NGDC collaborates closely with major international databases and initiatives to ensure seamless data exchange and interoperability. Over the past year, significant efforts have been dedicated to integrating diverse omics data, synthesizing expanding knowledge, developing new resources, and upgrading major existing resources. Particularly, several database resources are newly developed for the biodiversity of protists (P10K), bacteria (NTM-DB, MPA) as well as plant (PPGR, SoyOmics, PlantPan) and disease/trait association (CROST, HervD Atlas, HALL, MACdb, BioKA, BioKA, RePoS, PGG.SV, NAFLDkb). All the resources and services are publicly accessible at https://ngdc.cncb.ac.cn.


Introduction
The National Genomics Data Center (NGDC) is affiliated to Beijing Institute of Genomics (BIG), Chinese Academy of Sciences (CAS), and China National Center for Bioinformation (CNCB) ( 1 ).Established in 2019, CNCB-NGDC has collaborated with CAS institutions, viz., Institute of Biophysics and Shanghai Institute of Nutrition and Health, as well as formed partnerships with other organizations ( https://ngdc.cncb.ac.cn/partners ).Over the last decades, advancements in high-throughput technologies have enabled researchers to simultaneously analyze multiple layers of biological information with unprecedented speed and accuracy.Large-scale highthroughput sequencing projects have been conducted globally to study the genetic basis of diseases and unravel complex biological processes ( 2 ,3 ).Projects like the 1000 Genomes Project ( 2 ), the Cancer Genome Atlas ( 3 ), and the UK BioBank ( 4 ) have contributed to the generation of extensive genomic datasets from diverse populations and disease cohorts.These D 19 datasets have provided invaluable resources for studying genetic variations, identifying disease-associated genes, and exploring molecular mechanisms underlying complex diseases.Moreover, single-cell sequencing technologies have emerged as powerful tools to study cellular heterogeneity ( 5 ), developmental processes ( 6 ), disease mechanisms ( 7 ), and complex biological systems ( 8 ) with unprecedented resolution ( 9 ).In particular, spatial transcriptomics techniques capture the spatial information of gene expression patterns and offer a deeper understanding of tissue architecture, cell-to-cell communication, and tumor heterogeneity ( 10 ).As a result, an immense amount of multi-omics data has been generated at an everincreasing rate and scale, necessitating the development of resources that facilitate data synthesizing, interoperability and sharing.
With the rapid growth of large-scale high-throughput sequencing projects globally, CNCB-NGDC serves as a central hub for the collection, integration and curation of diverse genomics datasets.In the past year, CNCB-NGDC has been dedicated to the development of new resources and the continuous updating of existing resources, aiming to provide open access to a family of resources for advancing life and health sciences globally (11)(12)(13)(14)(15)(16)(17)(18)(19)(20)(21)(22).Importantly, several core database resources have been recommended by major publishers, which has greatly facilitated the efficient deposition and open sharing of biomedical data.Furthermore, CNCB-NGDC has established close collaborations with the International Nucleotide Sequence Database Collaboration (INSDC) ( 23 ) by mirroring the metadata and sequence data from NCBI SRA (Sequence Read Archive) ( 24 ).In this article, we provide a brief overview of new developments and recent updates in CNCB-NGDC, highlighting its core resources and services (Figure 1 ).Importantly, CNCB-NGDC databases are highly interconnected, forming a comprehensive network that allows users to seamlessly navigate between databases, access relevant information, and conduct comprehensive studies (Figure 2 ).All these resources and services play a crucial role in supporting research and are publicly available on the CNCB-NGDC homepage ( https://ngdc.cncb.ac.cn ).

Raw data & metadata
GenBase GenBase ( https:// ngdc.cncb.ac.cn/ genbase ) is an open-access data repository dedicated to archiving, searching, and sharing nucleotide sequences.It accepts various data submissions, including mRNA, genomic DNA and ncRNA as well as small genomes like organelles, viruses, plasmids and phages.Gen-Base provides a user-friendly bilingual submission portal with automatic validation and manual curation.Its standardized data structures and quality control procedures are compatible with those of GenBank ( 25 ), enabling seamless data exchange with the INSDC ( 23 ).GenBase incorporates all sequences from GenBank with daily updates, currently housing 265 969 760 nucleotide and 268 933 169 protein sequences.Meanwhile, it has received a total of 1103 direct submissions as of 14 August 2023, including 37 981 nucleotide sequences and 362 296 annotated protein sequences across 138 species.Of these, 34 477 nucleotide sequences (91%) and 340 491 annotated protein sequences (94%) have been released and are publicly accessible.Particularly, GenBase has received and released 31 312 S AR S-CoV-2 genome sequences with standard-ized annotations.In summary, GenBase is a critical resource for archiving and incorporating a large variety of nucleotide sequence data, offering free and public data services to support worldwide research activities.

OBIA
The Open Biomedical Imaging Archive (OBIA; https://ngdc.cncb.ac.cn/obia ) serves as a repository for archiving biomedical images and associated clinical data ( 26 ).OBIA adopts five data objects (Collection, Individual, Study, Series, and Image) for data organization and accepts submissions of biomedical images from all over the world.To ensure data privacy, OBIA has established a standardized de-identification and quality control process and offered two types of data accessibility: open access and controlled access.As of August 2023, OBIA has housed 937 individuals, 4136 studies, 24 701 series and 1 938 309 images covering 9 modalities and 30 anatomical sites.OBIA differentiates itself from other related databases by providing imaging data of various modalities, anatomical sites, and diseases in a common DICOM format.In addition, OBIA supports both metadata retrieval and image retrieval.Importantly, OBIA establishes internal links with NGDC's BioProject accessions and individual accessions in GSA-Human, facilitating users to easily obtain not only biomedical images, clinical data but also multi-omics data.

OPIA
The Open Plant Image Archive (OPIA, https://ngdc.cncb.ac.cn/ opia/ ) is an open archive of plant images and phenotypic traits (i-traits) derived from high-throughput phenotyping platforms ( 27 ).Currently, OPIA houses 56 datasets across 11 plants, comprising a total of 566 225 images with 2 417 186 labeled instances.It also incorporates 56 image-based i-traits derived from 18 644 individual RGB images across 3 datasets.These i-traits are annotated using the Plant Phenotype and Trait Ontology (PPTO) and cross-linked with GWAS Atlas.Additionally, each dataset in OPIA is assigned an evaluation score that considers factors such as image data volume, image resolution, and the number of labeled instances.OPIA also provides useful tools for online image pre-processing and submission.Collectively, OPIA provides open access to valuable datasets and phenotypic traits across diverse plants and thus bears great potential to play a crucial role in facilitating artificial intelligence-assisted breeding research.

Single-cell omics
CROST CROST ( https:// ngdc.cncb.ac.cn/ crost ) is a comprehensive repository of spatial transcriptomics.It contains 182 spatial transcriptomic datasets comprising 1033 high-quality samples from 5 technology platforms, 8 species and 56 diseases ( 28 ).A total of 48 043 tumor-related spatially variable genes (SVGs) are identified across these datasets.Additionally, it includes a standardized spatial transcriptome data processing pipeline, integrates deconvolution spatial transcriptomics data, and performs correlation, colocalization, intercellular communication and biological function annotation analyses.Moreover, CROST integrates transcriptomic, epigenomic, and genomic data to investigate tumor-associated SVGs, providing a comprehensive insight into their roles in cancer progression and prognosis.Furthermore, CROST provides two online tools: single-sample gene set enrichment analysis (ss-GSEA) and SpatialAP, enabling users to annotate and analyze uploaded spatial transcriptomics data.Collectively, CROST offers fresh and comprehensive insights into tissue structure and serves as a foundation for understanding multiple biological mechanisms in diseases, particularly in tumor tissues.

SMDB
The SMDB ( https:// www.biosino.org/smdb ) ( 29) is an essential database that facilitates the exploration and understanding of spatial transcriptomics (ST) data comprehensively and interactively.Its multimodal integration and customisable workspaces offer researchers a powerful and versatile platform to investigate the intricate relationship between spatial data and biological function.In 2D, SMDB enables segmenting slices and identifying gene expression boundaries.
Researchers can analyze tissue composition using loaded images and molecular clusters.In 3D, researchers can filter spots based on their specific requirements and reconstruct morphological visualizations.SMDB also provides customizable workspaces that allow for interactive exploration.SMDB includes the pre-loaded Allen Mouse Brain Common Coordinate Framework (CCFv3) from the renowned Allen Institute that serves as a valuable reference for studying the mouse brain, providing researchers with quick access to relevant information.
disease associations curated from numerous publications ( 30 ).Currently, HervD Atlas collects 57 253 curated HERV-disease associations from 238 publications, covering 19 274 HERVs (including 18 535 HERV-Terms and 739 HERV-Elements) belonging to six types.The knowledgebase also encompasses 148 ontological diseases grouped into 14 categories and 605 affected or related genes.It features an interactive knowledge graph that visually represents the relationship networks of HERV-disease associations and corresponding genes, enabling researchers to access and explore data of interest efficiently.HervD Atlas serves as a valuable resource and powerful platform with comprehensive HERV-disease knowledge, facilitating our understanding of HERV-disease associations and the development of HERVs as novel diagnostic and therapeutic strategies.

HALL
HALL (Human Aging and Longevity Landscape; https://ngdc.cncb.ac.cn/ hall/ ) is a dedicated database centering on the study of human aging and longevity ( 31 ).It offers a specialized and comprehensive collection of multi-dimensional datasets derived from various human cohorts.HALL integrates 170 cohorts from 23 countries / regions, including 1913 SNPs, 38 tissue / cell types and over 4 800 000 individuals, ranging from 1 to 119 years and with 59 cohorts including centenarians.HALL features a genome browser with 485 512 epigenomics probes, providing insights into age-related methyla-tion changes.The transcriptome of 5261 age-variant genes has been curated involving a total of 3188 human subjects across 13 tissues.HALL was built upon the foundation of the Aging Biomarker Consortium (ABC).Its comprehensive framework for monitoring age-related changes serves as a platform for developing new markers, diagnostic tools, and strategies to address aging and age-related conditions.

MACdb
MACdb ( https:// ngdc.cncb.ac.cn/ macdb/ ) is a curated knowledgebase of metabolic associations between metabolites and cancers ( 32 ).In the current implementation, MACdb has integrated 40 710 cancer-metabolite associations, encompassing 267 traits from 17 categories of cancers with high incidence or mortality.These associations are derived through meticulous manual curation of 1127 studies published in 462 publications.MACdb provides user-friendly browsing functions that allow the exploration of associations across multiple dimensions, such as metabolite, trait, study, and publication.Additionally, it constructs a knowledge graph to present an overall landscape of the relationships among cancer, trait, and metabolite.Furthermore, MACdb offers tools of NameToCid, which maps metabolite names to PubChem CIDs, and Enrichment tools, which aid in enriching the associations of metabolites with various cancer types and traits.MACdb represents an informative and practical resource for evaluating cancer- metabolite associations, with the potential to accelerate hypothesis generation and research on cancer metabolism.

NAFLDkb
NAFLDkb ( https:// www.biosino.org/nafldkb ) is a specialized knowledge base and platform for computer-aided drug design against non-alcoholic fatty liver disease (NAFLD) ( 33 ).NAFLD incorporates multi-perspective information from public resources including source data, background knowledge and candidate library.The source data includes 40

BioKA
BioKA ( https:// ngdc.cncb.ac.cn/ bioka ) is a comprehensive disease / trait biomarker ( 34-37 ) knowledgebase for animals, including model and domestic animals as well as humans ( 38 ).We curate biomarkers and integrate various annotations, such as Gene Ontology terms (GOs), protein structures, proteinprotein interaction networks, miRNA targets, metabolism details, expressions, variations, and homologous genes, into a single web platform.BioKA enables cross-species research and offers free public data services for browsing, retrieval, comparison, and downloading.Currently, BioKA houses 16 296 biomarkers associated with 951 mapped diseases / traits across 31 species from 4747 references.These include 11 925 gene / protein biomarkers, 1784 miRNA biomarkers, 1043 mutation biomarkers, 773 metabolic biomarkers, 357 cir-cRNA biomarkers and 127 lncRNA biomarkers.Furthermore, BioKA constructs an interactive knowledge network of biomarkers that includes 7320 entities and 401 208 links across 10 species.Moreover, BioKA provides detailed information on 308 breeds / strains of 13 species and homologous annotations for 8784 biomarkers across 16 species, and offers three online application tools.In summary, BioKA advances human disease research, contributes to understanding animal diseases, and supports livestock breeding.

RePoS
RePoS (Recent Positive Selection, http://bigdata.ibp.ac.cn/RePoS/) is a newly developed database that integrates and presents recent positive selection signal data for both Chinese and worldwide populations.This database aims to enhance our understanding of genes and traits that have undergone positive selection during human evolution, providing insights into our history and diseases that continue to plague us today.RePoS investigates the multi-population selection footprints of genomic sequences using SDS ( 39 ) and iHS ( 40 ) data such as NyuWa WGS ( 41 , 42 ), T OPMed ( 43 ), 1KGP ( 44 ) and UK10K ( 39 ) and elucidate phenotypic evolution associated with genomic signatures for both monogenic and polygenic traits.A total of 22.7 million non-redundant variants from five datasets were integrated.In summary, RePoS is designed to facilitate the study of human evolution and phenotype adaptation in global populations.

TargetGene
TargetGene ( https:// ngdc.cncb.ac.cn/ targetgene/ ) is a comprehensive resource of target genes for human genetic variants ( 45 ).It establishes connections between genetic variants and their target genes using multiple analytical tools, such as chromatin co-accessibility, 3D interaction, enhancer activities, and quantitative trait loci.The resource includes curated multiomics data from single-cell and bulk levels, encompassing various human tissues, cell types, developmental stages, and over a thousand genome-wide association studies (GWAS) datasets.Currently, TargetGene comprises 23 838 target genes in 45 tissues and 539 cell types inferred for 574 279 traitassociated genetic variants from 1276 GWAS datasets for various diseases.TargetGene provides user-friendly web interfaces to help users systematically identify and prioritize traitassociated target genes.In summary, TargetGene serves as a valuable resource for understanding the genetic mechanisms behind complex diseases and identifying potential drug targets.
PGG.SV PGG.SV ( https:// www.biosino.org/pggsv ) is a pioneering database leveraging next-generation and third-generation whole-genome sequencing technologies ( 46 ).The current version of PGG.SV encompasses a vast dataset of 584 277 structural variations (SVs) from 6048 samples, including 1030 long-read sequenced genomes from 177 global populations.Notably , PGG.SV offers high-quality , fine-scale SVs mapped to both GRCh37 and GRCh38 human reference genomes.This includes previously underrepresented SVs that were difficult to detect using conventional sequencing and microarray data.The database features hierarchical estimates of SV prevalence across diverse geographical populations and offers valuable annotations of SV-related genes, putative functions, and clinical implications.Moreover, it provides an easy-to-navigate interface and offers robust visualization tools for genome-wide SV mapping.

Biodi ver sity
PlantPan PlantPan ( https:// ngdc.cncb.ac.cn/ plantpan/ ) is a comprehensive database containing pan-genome analysis results of 195 genomes from 11 plant species.PlantPan offers detailed insights across five categories: species, genes, gene clusters, genomic variances and genome synteny.PlantPan includes nine graph pan-genomes, 9 127 208 genes, 694 191 gene groups, 413 000 124 genomic variations, 1 616 089 genomic variation groups, 3 345 098 genome synteny and 177 827 genome synteny groups.Each gene group is assigned functional annotations, such as GO annotation, protein functional domains, 23 types of KEGG pathways, 58 types of transcription factors, organic and inorganic resistance, and homologous genes in other species.In summary, PlantPan serves as an invaluable resource for enhancing the utilization of plant pan-genomes in molecular breeding and evolutionary studies.

NTM-DB
NTM-DB (Non-Tuberculosis Mycobacteria Database; https: // ngdc.cncb.ac.cn/ ntmdb ) is a public database that integrates the most comprehensive collection of genomic and bioinformatics resources for non-tuberculosis mycobacteria (NTM).It includes a total of 12 748 newly assembled whole-genomes and 3335 GenBank / RefSeq assemblies, covering 177 out of 190 NTM species.Notably, NTM-DB incorporates 705 ML-STs (Multi-Locus Sequence Typing), consisting of 189 type strain genomes (representing 177 species and 12 subspecies) and 181 representative genomes.The database also encompasses 33 240 drug-resistance genes, 7152 drug susceptibility tests, and 74 315 virulence genes.Furthermore, NTM-DB offers an online analytical platform for genotyping, drugresistance and virulence gene annotation, as well as pangenomic and phylogenetic analyses.Together, NTM-DB is a comprehensive and innovative platform for the NTM research community, with the potential to assist clinicians in diagnosing and treating various NTM-related diseases.

SoyOmics
SoyOmics ( https:// ngdc.cncb.ac.cn/ soyomics ) is an integrated multi-omics database for soybean designed to provide a onestop solution for big data mining ( 47 ).The current implementation features comprehensive integration of highquality omics data, including assembly genomes, graph pangenome, phenotypic data of representative germplasms, transcriptomic and epigenomic data from different tissues, organs, and accessions, as well as knowledge of quantitative trait locus and genome-wide association study (GWAS).In addition, several commonly easy-to-use toolkits are also equipped for sequence alignment (BLAST), quick-start GWAS analysis (easyGWAS), gene expression pattern analysis (ExpPattern), haplotype analysis (HapSnap), genome position transformation (VersionMap), and sequence extraction (SeqFetch).More importantly, a module named SoyArray is developed to compare divergent sites between two germplasms, which is helpful for parent selection in genetic or breeding studies.Taken together, SoyOmics is of great utility to facilitate deep mining ranging from fundamental research to molecular breeding.

The P10K database
The P10K Database ( https:// ngdc.cncb.ac.cn/ p10k/ ) is a data portal for the Protist 10 000 Genomes Project (P10K).This project was established to address the limited availability of published genomes for protist species, which play significant roles in the biosphere as diverse microscopic eukaryotic organisms separate from fungi, animals, and plants ( 48 ).The resulting P10K database serves as a comprehensive platform, compiling and disseminating genome sequences and annotations from various protist groups.Currently, the P10K database contains 2929 genomes and transcriptomes, including 1096 newly sequenced datasets by P10K and 1833 publicly available datasets.It covers approximately 45% of the protist orders, with a particular emphasis on ciliates, which account for nearly a thousand genomes / transcriptomes and represent 53% coverage.Overall, the P10K database serves as an invaluable genetic resource repository for protist research and aims to expand further by incorporating additional sequenced data and advanced analysis tools, benefiting protist studies worldwide.

MPA
MPA (Mycobacteriaceae Phenome Atlas, https://www.biosino.org/mpa/ ) is a standardized atlas for the Mycobacteriaceae phenome based on heterogeneous sources.MPA includes a total of 82 microbial phenotypic traits of 10 755 strains from 236 species and 18 subspecies in Mycobacteriaceae.These traits were further classified into five categories and 20 subcategories of polyphasic phenotypes, as well as three categories and eight subcategories of functional phenotypes.The phenotypes were searchable and comparable from the website of MPA.The application of MPA may provide novel insights into the pathogenicity mechanism and antimicrobial targets of Mycobacteriaceae.PPGR PPGR (Perennial Plant Genomes and Regulation database, https:// ngdc.cncb.ac.cn/ ppgr/ ) serves as a public database dedicated to the exploration of perennial plant genomics and gene regulation ( 49 ).This resource encompasses data derived from 60 plant species, featuring richly annotated genomic information, 836 million protein-protein and transcription factor-target interactions, along with 8975 transcriptome samples representing environmental conditions and genetic backgrounds.The primary focus of PPGR centers on genes regulating critical processes in perennial plants, such as wood production, dormancy, terpene biosynthesis, and leaf senescence.Data sources comprise experiments, literature mining, public databases, and genomic predictions.With its userfriendly suite of multi-omics tools, PPGR will significantly contributes to the broader plant science community, extending its benefits far beyond the study of woody perennial plants.

BioProject and BioSample
BioProject ( https:// ngdc.cncb.ac.cn/ bioproject ) and BioSample ( https:// ngdc.cncb.ac.cn/ biosample ) are two public repositories for biological research projects and samples, respectively.They gather descriptive metadata on biological projects and samples investigated in experiments and offer centralized access to all public projects and samples, along with crosslinks to related data resources.As of August 2023, BioProject and BioSample have amassed a total of 13 487 biological projects and 1 244 954 biological samples submitted by 6438 users from 1549 organizations (Figure 3 A).This represents a significant increase compared to the previous release in September, which had 7906 projects and 783 267 samples.Furthermore, this year, these two repositories have mirrored 709 261 projects and 34 622 211 samples from the INSDC data at NCBI.

GSA, GSA-Human and OMIX
The Genome Sequence Archive (GSA; https://ngdc.cncb.ac.cn/ gsa ) ( 50 ,51 ) is an archival database for raw sequence reads, which provides the global communities with free and open services for data submission, data storage and data sharing.GS A for Human (GS A-Human;https: // ngdc.cncb.ac.cn / gsahuman) ( 50 ), a sub-database of GSA, is a specialized data archive for human genetic omics data with controlled access and security services.As of August 2023, GSA and GSA-Human have collectively accumulated 1 032 023 experiments,  50), as a member of the GSA family, strictly adheres to the FAIR principles and provides users with a platform to publish omicsbased research outputs that are citable, shareable, and discoverable.As of August 2023, OMIX has archived 3384 submissions and 15 837 files with a size of 59.34 TB.Approximately 40% of the data files are related to human genetic resources, which are securely shared in a controlled access mode, requiring users to submit a simple application for access.

Database
Commons ( https://ngdc.cncb.ac.cn/ databasecommons ) is a global catalog of biological databases that provides easy access and retrieval to a full collection of worldwide biological databases ( 52 ).It assesses the impact of databases and offers valuable statistics and trends.Currently, it catalogues a total of 6354 databases from around the world, encompassing 9808 publications and involving about 2100 organizations.This represents growth compared to the previous version in August 2022, which included 5831 databases and 8933 publications.Most databases have been curated by expert curators.In terms of database functionality updates, Database Commons started accepting open submissions of database from various institutions and universities around the world since the second half of 2022.The databases related to current research hotspots and frontiers are particularly curated.For example, a comprehensive collection of curated long non-coding RNA databases is compiled to facilitate an extensive review of this field ( 53 ).Furthermore, databases on S AR S-CoV-2, rice, single cell, spatial omics, and immune research are newly curated.These databases can be easily accessed by clicking on the respective links located below the search box.

Genome warehouse
The Genome Warehouse (GWH; https:// ngdc.cncb.ac.cn/ gwh ) is a valuable public resource for hosting genomic sequences, annotations, and metadata ( 54 ).By August 2023, the number of submitted genome assemblies has notably increased to 66 435, compared to 24 781 assemblies in September 2022 (Figure 3 D).Among these, 19 350 genome assemblies from 1511 species have been released and published in 278 journal articles, indicating growth compared to 12 887 assemblies and 206 articles in September 2022.The recent data expansion in GWH is driven by Metagenome-Assembled Genomes (MAGs) and binned metagenomes.Notably, this update includes several enhancements such as the integration of 1 782 915 assemblies from INSDC, allowing for enhanced local searchability , browsability , and downloadability , along with detailed information pages for each assembly .Importantly , GWH is enhanced by incorporating a data request management system, which facilitates communication between data owners and applicants seeking controlled access data.Moreover, it is equipped with an advanced search system to enable categorical search and filtering, enhancing accessibility to both archived and integrated genome data.The continued expan-D 25 sion and improvements in GWH make it a valuable resource for advancing genomics research worldwide.

RCoV19
The 2019 Novel Coronavirus Resource (RCoV19; https:// ngdc.cncb.ac.cn/ncov ) ( 55-58 ) is a comprehensive platform for the integration of S AR S-CoV-2 genome data, variant monitoring, and risk pre-warning.As of August 2023, RCoV19 has integrated over 16.5 million S AR S-CoV-2 sequences and metadata, among which ∼7.7 million have been further identified as complete and high-quality genome sequences for download analysis.Additionally, it has served over 3.5 million visitors from 182 countries / regions worldwide, with more than 17 billion data downloads in total.Over the past year, RCoV19 has undergone significant improvements in functionality .Firstly , it has implemented an advanced genome data curation model with an automated integration pipeline and optimized curation rules, enabling efficient daily data updates.Secondly, RCoV19 offers a global and regional lineage evolution monitoring platform and an outbreak risk pre-warning system, providing comprehensive insights into S AR S-CoV-2 evolution and transmission patterns.Thirdly, a powerful interactive mutation spectrum comparison module allows users to analyze and compare mutation patterns, aiding in the detection of potential new lineages.Moreover, RCoV19 incorporates a comprehensive knowledgebase on mutation effects, serving as a valuable resource for retrieving information on the functional implications of specific mutations.In summary, RCoV19 is a crucial scientific resource that provides free, open access to valuable data, relevant information, and technical support in the global fight against COVID-19.

Gene expression nebulas
Gene Expression Nebulas (GEN; https:// ngdc.cncb.ac.cn/ gen ) is a data portal integrating transcriptomic profiles from both bulk and single-cell levels in various conditions across multiple species ( 59 ).The current version of GEN has undergone significant improvements and updates, particularly in ontology classification and data volume with 106 datasets and 5179 samples.GEN has systematically incorporated 34 gene expression profiling datasets related to 33 cancer types, encompassing 2768 samples.Furthermore, 30 rice-related datasets and 880 samples have been analyzed and included.Moreover, 42 gene expression profiling datasets (28 bulk and 16 scRNAseq) and 1531 samples related to 10 new species derived from 33 original high-throughput sequencing projects have been added.Compared to the previous release in August 2022, the total number of incorporated datasets has increased from 469 to 575, covering 59 609 samples and 19 231 318 cells from 44 species, including 31 animals, 10 plants, 2 protists and 1 fungus.In terms of functionality, GEN has been improved by upgrading GENToolkit to facilitate prokaryotic transcriptome data with expression profiling and multiple downstream analysis in bulk RNA-seq level.

Epigenomics
Editome disease knowledgebase Editome Disease Knowledgebase (EDK, https://ngdc.cncb.ac.cn/edk ) is a comprehensive database of editome-disease as-sociations based on literature curation and integrative analysis ( 60 ).In its current version, EDK includes a total of 75 514 editing events, consisting of 826 experimentally validated endogenous and exogenous RNA editing events, as well as 74 688 abnormal editing events.These events span across 117 different diseases and are curated from 314 publications.Compared to the previous release in January 2019, the number of experimentally validated editing events has increased significantly from 248 to 826.Furthermore, by systematically integrating and analyzing 48 disease-associated RNA-seq datasets (comprising 2536 samples across 30 tissues) from GEN ( 59 ), the updated EDK encompasses a total of 577 341 new disease-associated editing sites, resulting in 18 690 508 abnormal RNA editing events that induce A-to-I and C-to-U RNA editing.In aspect of database functionality, EDK has been significantly upgraded with the addition of two user-friendly tools: Editing Identifier and Disease Predictor, with the aim to identify RNA editing events and provide a ranked list of editome-disease associations, respectively.

EWAS open platform
EWAS Open Platform ( https:// ngdc.cncb.ac.cn/ ewas ) incorporates data, knowledge, and toolkit for epigenome-wide association studies (EWAS) ( 61 ).Compared to the previous version in August 2022, the platform has undergone significant improvements.In terms of data, it has added 13 006 standardized and batch effect-corrected samples, covering 165 tissue types, 90 distinct diseases and 45 varied fields ( 62 ).In terms of knowledge, it includes 5203 new high-quality associations covering 47 traits through manual curation ( 63 ).Furthermore, EWAS Open Platform is functionally enhanced by developing an online analysis tool for batch effect correction and thus allowing users to integrate data directly from multiple sources ( 64 ).Users can obtain methylation levels after noise reduction by uploading original methylated and unmethylated signal value files or by entering the project ID in NCBI GEO.Currently, the platform encompasses standardized methylation array data from 146 678 samples across 265 fields, integrates 647 747 EWAS associations from 1043 published studies, and offers online tools for batch effect correction, enrichment analysis, annotation, and network visualization.Collectively, EWAS Open Platform aims to advance research into the roles of DNA methylation in development, aging, and diseases.

NucMap
NucMap ( https:// ngdc.cncb.ac.cn/ nucmap ) is a comprehensive database of genome-wide nucleosome positioning map across multiple species ( 65 ).The current version of NucMap includes 2718 nucleosome positioning information across 35 species, including animals, plants, fungi, and protozoa.In addition to nucleosome positioning data, NucMap integrates various other omics information such as mRNA expression, transcription factors (TFs), histones, and methylation data.Importantly, in the past year, the functionality of NucMap has been greatly improved from the following aspects.Firstly, NucMap newly facilitates the interpretation of gene regulation in humans by pre-analyzing and integrating 160 transcriptomes and 249 histone ChIP-seq data (including 31 types of histone modifications) specifically for human-related samples.Secondly, NucMap provides information of 180 102 474 potential TF binding sites across 27 species, allowing users to combine with collected ChIP-seq and RNA-seq data to in-D 26 Nucleic Acids Research , 2024, Vol.52, Database issue fer the transcription process.Thirdly, a comparative analysis module is added to identify differential nucleosome regions, which can help users find potential regulatory regions.In summary, NucMap serves as a valuable resource for investigating the biological role of nucleosomes in genome regulation.

MethBank
The Methylation Bank (MethBank; https://ngdc.cncb.ac.cn/ methbank ) ( 66-68 ) is a comprehensive database of DNA methylation in multiple biological contexts across various species.Compared to last year, MethBank newly incorporates methylomes of two new model organisms of Arabidopsis thaliana and Populus trichocarpa , and expands methylation profiles in biological contexts, especially in terms of disease, environment, and development.Currently, MethBank systematically incorporates whole-genome single-base resolution methylomes of 2101 high-quality samples from 241 projects in 25 species, representing a 45% increase over the previous release (1449 samples from 199 projects in 23 species).To characterize DNA methylation signatures in more biological contexts, 168 416 058 methylation profiles of genes, 4 961 814 methylated CpG islands, and 60 105 424 differentially methylated regions are newly provided based on these sequencing data.In addition to the enrichment of data volume, MethBank is also significantly upgraded by integrating more featured DMGs associated with biological contexts, growing from 2124 entries to 2905 entries curated from 278 publications across 147 tissues / cell lines, 151 diseases, and 12 biological contexts.To further improve the usability of the DMR toolkit, MethBank has been updated by integrating more species and optimizing enrichment analysis.

TCOD
The Tropical Crop Omics Database (TCOD, https://ngdc.cncb.ac.cn/tcod ) is a comprehensive multi-omics platform dedicated to tropical crop research ( 69 ).The latest version of TCOD brings substantial enhancements in data volume, gene function annotation and analysis tools.Currently, TCOD contains 34 chromosome-level de novo assemblies, 1 255 004 genes, 282 436 992 unique variants, 88 transcriptomic profiles, and 13 381 germplasm items in 15 representative species, compared to 14 chromosome-level genome assemblies, 565 185 genes, 111 934 324 unique variants and 10 433 germplasm items in five tropical crops in the previous version (September 2022).Furthermore, TCOD improves its functionality by utilizing multiple databases for consistent gene functional annotation and furnishing gene homology relationships across species.In addition to the enhancement of existing tools, a series of new tools such as Primer Design, GO Enrichment, KEGG Enrichment, Synteny Viewer, and Homolog Finder have been developed and deployed in TCOD.

BIG Search
BIG Search ( https:// ngdc.cncb.ac.cn/ search ) is a distributed and scalable full-text search engine for a large number of biological resources and provides one-stop cross-database search services for the global research community.In its current version, BIG Search integrates both the NGDC internal databases and 55 partner databases ( https:// ngdc.cncb.ac.cn/ partners ), resulting in a total of 1.472 billion data entries and over 1.4 terabytes of data.Additionally, it incorporates 35 important NCBI biological databases ( 70 ) and 165 biological datasets from EBI ( 71 ) through API.BIG Search offers advanced search functions and cross-database search services for numerous data resources, providing users with a more convenient and efficient means of retrieving data.

Concluding remarks
With the exponential growth of multi-omics data, CNCB-NGDC is committed to continuously providing a comprehensive suite of newly developed and updated database resources, aiming to facilitate data submissions and offer value-added annotations and curated knowledge for the global research community.CNCB-NGDC is actively engaged in various ongoing efforts, including but not limited to, automating data submission processes, curating data, integrating and analyzing data, upgrading infrastructure for efficient storage and transmission of big data, and developing new tools and pipelines for multi-omics data deep mining.These endeavors are aimed at supporting the analysis and interpretation of big data in a more streamlined and efficient manner.As one of the major global centers in genomics and bioinformatics, CNCB-NGDC is dedicated to expanding its resources and services to provide a comprehensive range of data resources and services that support knowledge discovery for a wide array of research activities in the fields of life and health sciences.

Figure 1 .
Figure 1.The core database resources of CNCB-NGDC organized into various categories.These database resources are publicly accessible and searchable through CNCB-NGDC home page at https://ngdc.cncb.ac.cn .A full list of data resources is shown at https:// ngdc.cncb.ac.cn/ databases .
433 research articles and 1001 clinical trials.The background knowledge consists of 581 investigational drugs, 17 therapeutic strategies, 45 therapeutic targets, 17 associated diseases, 8 records of pathogenesis and 68 in vitro and in vivo models of NAFLD.The candidate library consists of 1608 repositioning candidates, 147 604 bioactive compounds, 34 419 CMap candidates and 17 704 natural products for NAFLD drug development.The relationships among drug-related entities are presented with knowledge graphs, and AI-powered tools provide chemical structure search, drug-likeness screening, knowledge-based repositioning, and research article annotation.