The 2024 Nucleic Acids Research database issue and the online molecular biology database collection

Abstract The 2024 Nucleic Acids Research database issue contains 180 papers from across biology and neighbouring disciplines. There are 90 papers reporting on new databases and 83 updates from resources previously published in the Issue. Updates from databases most recently published elsewhere account for a further seven. Nucleic acid databases include the new NAKB for structural information and updates from Genbank, ENA, GEO, Tarbase and JASPAR. The Issue's Breakthrough Article concerns NMPFamsDB for novel prokaryotic protein families and the AlphaFold Protein Structure Database has an important update. Metabolism is covered by updates from Reactome, Wikipathways and Metabolights. Microbes are covered by RefSeq, UNITE, SPIRE and P10K; viruses by ViralZone and PhageScope. Medically-oriented databases include the familiar COSMIC, Drugbank and TTD. Genomics-related resources include Ensembl, UCSC Genome Browser and Monarch. New arrivals cover plant imaging (OPIA and PlantPAD) and crop plants (SoyMD, TCOD and CropGS-Hub). The entire Database Issue is freely available online on the Nucleic Acids Research website (https://academic.oup.com/nar). Over the last year the NAR online Molecular Biology Database Collection has been updated, reviewing 1060 entries, adding 97 new resources and eliminating 388 discontinued URLs bringing the current total to 1959 databases. It is available at http://www.oxfordjournals.org/nar/database/c/.


New and updated databases
The 31st Nucleic Acids Research database issue, in customary fashion, ranges broadly across biology with a total of 180 papers.Last year's recent record of 90 new databases is matched this year (Table 1 ) while 83 papers report developments from resources previously covered by NAR .Seven further papers provide updates on databases most recently published elsewhere (Table 2 ).As usual, the Issue leads off with reports from the major database providers at the European Bioinformatics Institute (EBI), the U.S. National Center for Biotechnology Information (NCBI), and the National Genomics Data Center (NGDC) in China ( 1-3 ).This year they are joined by an article from the SIB Swiss Institute of Bioinformatics ( 4 ).The following papers on individual databases are, as usual, grouped into categories: (i) nucleic acid sequence and structure, transcriptional regulation; (ii) protein sequence and structure, proteomics; (iii) metabolic and signalling pathways, enzymes and networks; (iv) genomics of viruses, bacteria, protozoa and fungi; (v) genomics of human and model organisms plus comparative genomics; (vi) human genomic variation, diseases and drugs; (vii) plants and (viii) other topics.As usual, many papers defy easy categorisation and so it is recommended that readers browse the full list.
The 'Nucleic acid databases' section contains a report on a major new repository of nucleic acid structural information, named NAKB ( 5 ).Updated weekly, this resource is the successor of the Nucleic Acid Database ( 6 ) and provides new annotations, visualisations, search options and database links for Protein Data Bank (PDB) ( 7 ) structures that contain nucleic acids.Other interesting new databases cover specific classes of sequence: Ribocentre-switch covers sequence, struc-ture, function and applications of riboswitches ( 8 ), the UTex Aptamer Database ( 9 ) captures literature on aptamers, notably harnessing undergraduate power to increase curation capacity, and TeloBase covers telomere sequences across the tree of life, mining both the literature and sequence databases ( 10 ).For sequence information in general, the INSDC consortium ( 11 ) members all report updates (12)(13)(14): perhaps the most notable is the DDBJ paper which emphasises how their activities increasingly reach beyond nucleic acid sequences, most recently to metabolomics with the establishment of the MetaboBank ( 14 ).Elsewhere, two new databases cover fulllength sequences from long read technologies.FLIBase ( 15 ) covers isoforms in normal and cancerous human tissues, while FL-circAS focuses specifically on circular RNAs and their splice forms ( 16 ).They are joined, for the first time in NAR , by the popular circAtlas for vertebrate circular RNAs reporting a tripling of content, inclusion of clinical cancer samples and, importantly, adoption of a standardised circRNA nomenclature ( 17 ).Specific epitranscriptomic modifications are covered by updates from m6A-Atlas ( 18 ) and m7GHub ( 19 ) while RMBase ( 20 ) covers data related to 73 distinct modifications in 62 species, and the general database of RNA modifications, pathways and compounds MODOMICS adds novel modifications, both those seen experimentally in ribosome structures and artificial modifications used as experimental tools ( 21 ).
In gene expression and regulation, the foundational NCBI Gene Expression Omnibus (GEO) database returns ( 22 ) after a gap of 11 years to reflect on the gradual replacement of array-based data with next-generation sequence information, and the growing proportion of studies with a single-cell focus (also very visible in this Issue, see below).The venerable Tar-D 3  ( 23 ).Finally, major databases of transcription factor (TF) binding sites publish updates: HO-COMOCO reports its latest collection of carefully processed and curated human and mouse TF motifs ( 24 ) while JASPAR celebrates its 20th anniversary with useful increases in curated content across taxa and a reflection on the challenges to be addressed as the spread of deep learning methods across biology continues ( 25 ).
The protein section contains a 'Breakthrough Article' in the form of the NMPFamsDB paper ( 26 ).This database contains the results of mining metagenome sequences for novel families that are unrelated to those in the well-known Pfam database ( 27 ) or reference proteomes.Each entry in this veritable gold mine is comprehensively presented as a sequence alignment and (in most cases) an AlphaFold 2 model ( 28 ), alongside genome context, taxonomic mapping and geographical distribution.Equally momentous is an update article from the AlphaFold Protein Structure Database ( 29 ) which now covers around 90% of the sequence space in the UniProt database ( 30 ) with state of the art protein models.A range of improvements to the website cover, for example, a new sequence search by BLAST ( 31 ), easy access to different clusters of models, and improved interaction with content, both coordinates and the crucial PAE.Another update from the EBI covers EMDB, the Electron Microscopy Data Bank which reports continued experimental growth, responses to rapid technological change, and plans for validating data and metadata ( 32 ).
Elsewhere among protein-related databases, interesting new arrivals include inCLusive ( 33 ) capturing information on over 400 non-canonical amino acids and their incorporation into proteins, and MultifacetedProtDB ( 34 ) with > 1000 multi-functional aka moonlighting proteins.Two new databases SingPro ( 35 ) and SPDB ( 36 ) capture single-cell proteomics data and each offer sophisticated visualisation and analytical options.Elsewhere a pair of new databases cover Molecular Dynamics simulations, surely a significant growth area in the post-AlphaFold 2 era.BioExcel-CV19 ( 37 ) majors on simulations of Covid19 proteins but establishes an infrastructure designed to be reused elsewhere, while ATLAS ( 38 ) covers simulations of almost 1400 proteins selected to span most of structure space as defined by high level X-class do-mains in the ECOD hierarchical classification ( 39 ).Another notable new name is UniTmp which brings together a number of well-known resources for transmembrane proteins, including previous Database Issue regulars ( 40 , 41 ), into an integrated one-stop shop ( 42 ).A trio of well-known returning databases cover proteins that are not conventionally folded.The update from PED ( 43 ), covering ensembles of intrinsically disordered proteins (IDPs), reports a tripling of size and the new inclusion of purely computationally generated ensembles from eg.Machine Learning methods.It is complemented by a report from DisProt ( 44 ) which emphasises the importance of adoption of community standards in the protein disorder field ( 45 ) and reports an interesting bi-directional relationship with Gene Ontology ( 46 ) whose terms are used for annotation, but whose vocabulary is also expanded with input from the Dis-Prot team.IDPs commonly contain linear interaction motifs of the kind documented by ELM ( 47 ) which reports interesting new motifs in eg RNA processing and protein degradation, and the news that ELM is now integrated into InterPro ( 48 ).Finally, two databases returning after 10 or more years cover protein-ligand interactions observed in the PDB: BioLiP ( 49 ) returns with its analysis of biologically relevant bound ligands, while MESPEUS ( 50 ) focuses on bound metals, usefully including sites that lie at crystal lattice interfaces, and offering pan-PDB coordination number distributions and measures of deviation from ideal geometry.
In the section for metabolism, signalling and enzymes, two new databases focus on cell-cell communications.CellCom-muNet ( 51 ) exploits single cell RNAseq information to illuminate human and mouse cell-cell communication networks in health and disease while MACC ( 52 ) focuses on the role of metabolites and offers a bespoke visualisation tool for metabolite-cell regulatory networks.In the field of gene clusters, the popular antiSMASH database contributes an update ( 53 ) and is joined by the new ABC-HuMi ( 54 ) which considers distributions across different human microbiomes associated with different tissues.Metabolic pathways in general are amply covered by updates from three important returning databases: Reactome ( 55 ) now covers more than 11 000 human proteins (around 56%) and the paper discusses approaches to cover the remainder; WikiPathways ( 56 ) reaches almost 2000 pathways and 27 species, and benefits from an intriguing new tool to recognise terms in static pathway figures from the literature (pfocr.wikipathways.org); and PathBank ( 57 ) reports massively increased content, better pathway images and pathway enrichment tools.Finally, the major metabolomics resource Metabolights contributes an update ( 58 ) reporting its growth six-fold in studies since the last report, but also showcasing the MetaboLights Labs for exposing and distributing relevant computational workflows.
Viruses are particularly well represented in the next section they share with microbes.New arrivals VarEPS-influ ( 59 ), RVdb ( 60 ) and COV2Var ( 61 ) focus respectively on influenza, with an especial aim to understand and evaluate the risks of future epidemics; on rhinovirus, responsible for many common colds and linked to more serious conditions; and on S AR S-CoV-2 genetic variation among 13 billion genome sequences.Returning virus databases include PhageScope ( 62 ) which covers almost a million comprehensively annotated bacteriophage sequences, and ViralZone ( 63 ) the beautifully illustrated educational virology resource.Eukaryotes are covered firstly by the interesting new P10K database ( 64 ), the home of the protist 10 000 Genomes Project.This has already covered almost half of protist orders by using single cell techniques to avoid difficulties in culturing, and has revealed intriguing variability in genetic codes among ciliates.There are also the returning VEuPathDB ( 65 ) for eukaryotic pathogens and hosts which adds new data types-AlphaFold 2 structure predictions and single cell transcriptomics; and UNITE ( 66 ), the extremely popular resource for eukaryote molecular identification which expands from a fungal focus to cover all eukaryotes.Finally, two new databases map sequence data geographically: SPIRE ( 67 ) includes curated global metagenomes annotated with habitat and geography, while GDPF ( 68 ) maps prokaryotic protein families in terms of global distribution and environmental abundance.
In the human, model organism and comparative genomics section, no fewer than four papers cover aging.Open Genes ( 69 ), a new arrival, majors on easy access to carefully curated and scored (meta)data of diverse types-gene expression, methylation, protein activity, genetic variants and so on-in both human and model organism orthologs.The popular returning resource HAGR ( 70 ), brings updates to its set of six database components, and two new databases, AgeAn-noMO ( 71 ) and HALL ( 72 ) each report impressive coverage of multi-omics data relating to ageing and longevity.Elsewhere the trends towards single cell and spatial expression data are again evident.For example, CellSTAR ( 73) is a new database for single-cell transcriptomics which offers annotation of hundreds of different cell types and markers across 18 species and 139 tissues.Spatial transcriptomics is covered by two new databases, STOmicsDB ( 74 ) and CROST ( 75 ) which each cover a range of species, tissues and diseases, and offer a superb range of visualisation options.Notably, the latter offers a cancer module integrating other kinds of relevant information such as copy number variation and epigenomics.In comparative genomics, two cornerstone resources provide updates.Ensembl ( 76 ) reports efforts on its Rapid Release site to provide quick annotations for outputs of biodiversity projects such as the Darwin Tree of Life; work in the area of food security, from livestock to 'orphan crops' to pests; and continued engagement with users around new features on its new Beta website.An update on the UCSC Genome Browser ( 77 ) reports on new species browsers, new data tracks for model species, and even new forms of data visualisation such as sequence logos.Finally Monarch updates ( 78 ) on its efforts to map genes, phenotypes and diseases across model organisms.Their Knowledge Graph is suitable for Machine Learning methods, both to extract new biological insights but also for querying by their new ChatGPT plugin.Monarch is far from the only database looking at such large language models to summarise and present content to the user: it will clearly be important to understand both the opportunities and the risks, and to share best practice in the area.
As usual, cancer has a strong presence in the section on human genomic variation, diseases and drugs.COSMIC, the foundational database of variants and clinical data, returns to report new features such as a mutational signatures catalogue and an Actionability section to record clinical trials linked to particular variants ( 79 ).A valuable pair of new databases, CDS-DB ( 80 ) and ClinicalOmicsDB ( 81 ), focus on transcriptomics responses to cancer drug treatment.Another pair of new databases focus on fusion proteins, the result of gene fusion events: FusionPDB ( 82 ) catalogues and annotates fusion protein sequences, additionally taking some through to AlphaFold 2 modelling and small molecule screening; while FusionNeoAntigen ( 83 ) predicts neoantigens that span fusion gene breakpoints, testing their putative interaction with HLA molecules by explicit docking studies.A further duo of new arrivals, SCAR ( 84 ) and SORC ( 85 ) each cover spatial omics in cancer and offer similarly comprehensive resources for single cell transcriptomics data, biomarkers, cell-cell communications and so on.For drug targets, two major resources submit updates: the Therapeutic Target Database ( 86 ) features a major new focus on methods for predicting druggability while DGIdb ( 87 ) has new data sources, a new interface and improved FAIRness (i.e.findability , accessibility , interoperability and reusability).Major drug and chemical databases contributing updates are DrugBank ( 88 ), which reports a range of improvements including inclusion of mass spectrometry data and concise graphics illustrating drug mechanisms and metabolic pathways; and ChEMBL ( 89 ) which now covers 2 million unique compounds and 20 million bioactivity measurements.A number of databases address drug behaviour in the body including returning databases VARIDT ( 90 ) and INTEDE ( 91 ) for drug transporters and metabolism respectively.The former now covers the influence of microbiota, post-translational modifications and differential expression on drug transporters while the latter offers superbly detailed and -presented information on drug metabolic pathways and their intermediates.The new DRMref ( 92 ) uses scRNA-seq to identify drug resistance-related molecular mechanisms across different drugs, cell types and cancers.
Variant interpretation remains a key scientific challenge and two returning databases publish updates.The very popular CADD ( 93 ) reports improved performance from addition of new component scores, including from natural language methods, while VarCards ( 94 ) is designed especially to support clinical interpretation of variants by genetic counsellors.A trio of returning databases linking ncRNA to disease publish updates: HMDD ( 95 ) grows substantially and covers new categories of data such as exosomal or virus-encoded miRNAs; LncRNADisease ( 96 ) doubles in size and now distinguishes between causative and correlative disease relationships; and CircRNADisease ( 97 ) grows 20-fold in content and includes new information, for example on circRNA mutations and cancer.Finally, two major ontologies report updates in this section: the Human Disease Ontology celebrates 20 years ( 98 ) with a new Knowledgebase and provides in-D 5 sights into the exhaustive community discussions driving expansion of the resource; while the Human Phenotype Ontology reports similar collaborative efforts, as well as impressive progress in translating the ontology into other languages ( 99 ).
The plant database section features a particular rich crop of new databases.Two cover plant imaging: OPIA ( 100 ) has half a million images and links genomic rice and wheat data to phenotypic i-traits (characteristics from image analysis such as plant height); while the PlantPAD paper ( 101 ) offers interesting cases studies relating to their image data, such as machine learning for automated disease diagnosis.Other new resources cover crop plants: SoyMD ( 102 ) offers comprehensive multi-omics data on soybean to aid breeding efforts; TCOD ( 103 ) offers a variety of data related to 15 tropical crops; and CropGS-Hub ( 104 ) links genotype and phenotype in seven major crops.Elsewhere, PPGR ( 105 ) covers multiomics and related analytical tools for the study of woody perennial plants, including 60 genome assemblies.A pair of new databases cover plant metabolomics: RefMetaPlant ( 106 ) constructs a reference metabolome from mass spectrometry measurements of more than 150 plants across the five major phyla while PMhub ( 107 ) notably includes genomic and transcriptomic information to enable genetic analysis of metabolites.
The final section features databases not readily accommodated elsewhere.LIPID MAPS returns after a long absence ( 108 ) to support lipidomics research with manual curation of literature, a defined lipid classification, and a renewed focus on FAIR principles.The UniLectin update ( 109 ) reports a new focused HumanLectome module while the popular Vesiclepedia database ( 110 ) now extends its coverage to new classes of extracellular particle such as exomeres and supermeres.

NAR online molecular biology database collection
Since the previous update, our NAR online Molecular Biology Database Collection (freely available at http://www.oxfordjournals.org/nar/ database/ c/ ) has been extensively updated, reviewing 1060 entries.97 new resources listed in this issue were added to the database and 388 obsolete entries have been removed accordingly.Further curation efforts were made reviewing recent updates and user feedback bringing the total to 1959 databases in this new update.We encourage authors to submit their updates to XMF at xose.m.fernandez@gmail.com in plain text, ideally according to the template found in http:// www.oxfordjournals.org/nar/ database/ summary/ 1 .
In times of fast-moving technological change, we are keen to hear from the community about any changes we could make to the database collection to enhance its value.Feedback can be sent to nardatabase@gmail.com.

Table 2 .
Updated descriptions of databases most recently published elsewhere