GenBank ® is a comprehensive database that contains publicly available DNA sequences for more than 165 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library in the UK and the DNA Data Bank of Japan helps to ensure worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, go to the NCBI Homepage at http://www.ncbi.nlm.nih.gov .
Received September 15, 2004; Revised and Accepted October 5, 2004
GenBank ( 1 ) is a comprehensive public database of nucleotide sequences and supporting bibliographic and biological annotation, built and distributed by the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine (NLM), located on the campus of the US National Institutes of Health (NIH) in Bethesda, MD.
NCBI builds GenBank primarily from the submission of sequence data from authors and from the bulk submission of expressed sequence tag (EST), genome survey sequence (GSS) and other high-throughput data from sequencing centers. The US Office of Patents and Trademarks (USPTO) also contributes sequences from issued patents. GenBank incorporates sequences submitted to the EMBL Data Library ( 2 ) in the United Kingdom and the DNA Databank of Japan (DDBJ) ( 3 ) as part of a long-standing international collaboration between the three databases in which data are exchanged daily to ensure a uniform and comprehensive collection of sequence information. NCBI makes the GenBank data available at no cost over the Internet, via FTP and a wide range of web-based retrieval and analysis services, which operate on the GenBank data ( 4 ).
ORGANIZATION OF THE DATABASE
GenBank continues to grow at an exponential rate with 7.9 million new sequences added over the past 12 months. As of Release 143 in August 2004, GenBank contained over 41.8 billion nucleotide bases from 37.3 million individual sequences. Complete genomes ( http://www.ncbi.nlm.nih.gov/Genomes/index.html ) represent a growing portion of the database, with over 50 of more than 180 complete microbial genomes in GenBank deposited over the past year. The number of eukaryote genomes for which coverage and assembly are good continues to increase as well, with over 20 such assemblies now available, including that of the reference human genome.
Database sequences are classified and can be queried using a comprehensive sequence-based taxonomy ( http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html ) developed by NCBI in collaboration with EMBL and DDBJ and with the valuable assistance of external advisers and curators. Over 165 000 named species are represented in GenBank and new species are being added at the rate of over 2000 per month. About 19% of the sequences in GenBank are of human origin and 13% of all sequences are human ESTs. After Homo sapiens , the top species in GenBank in terms of number of bases are Mus musculus , Rattus norvegicus , Danio rerio , Zea mays , Oryza sativa , Drosophila melanogaster , Gallus gallus and Canis familiaris .
GENBANK RECORDS AND DIVISIONS
Each GenBank entry includes a concise description of the sequence, the scientific name and taxonomy of the source organism, bibliographic references and a table of features ( http://www.ncbi.nlm.nih.gov/collab/FT/index.html ) listing areas of biological significance, such as coding regions and their protein translations, transcription units, repeat regions, and sites of mutations or modifications.
The files in the GenBank distribution have traditionally been partitioned into ‘divisions’ that roughly correspond to taxonomic groups such as bacteria (BCT), viruses (VRL), primates (PRI) and rodents (ROD). In recent years, divisions have been added to support specific sequencing strategies. These include divisions for EST, GSS, high-throughput genomic (HTG) and high-throughput cDNA (HTC) sequences, making a total of 17 divisions. For convenience in file transfer, the larger divisions, such as the EST and PRI, are partitioned into multiple files for the bimonthly GenBank releases on NCBI's FTP site.
Expressed sequence tags
ESTs continue to be the major source of new sequence records and gene sequences, comprising over 12 billion nucleotide bases in GenBank release 143. Over the past year, the number of ESTs has increased by over 29% to a total of 23.4 million sequences representing over 740 different organisms. The top five organisms represented in the EST division are H.sapiens (5.7 million records), M.musculus (4.2 million records), Ciona intestinalis (684 000 records), R.norvegicus (617 000 records) and D.rerio (575 000 records). As part of its daily processing of GenBank EST data, NCBI identifies through BLAST searches all homologies for new EST sequences and incorporates that information into the companion database, dbEST ( http://www.ncbi.nlm.nih.gov/dbEST/index.html ) ( 5 ). The data in dbEST is processed further to produce the UniGene database ( http://www.ncbi.nlm.nih.gov/UniGene/ ) of more than 700 000 gene-oriented sequence clusters representing over 50 organisms, as described in detail previously ( 4 ).
Sequence-tagged sites (STSs) and Genome survey sequences (GSSs)
The STS division of GenBank ( http://www.ncbi.nlm.nih.gov/dbSTS/index.html ) contains over 379 000 sequences including anonymous STSs based on genomic sequence as well as gene-based STSs derived from the 3′ ends of genes and ESTs. These STS records usually include primer sequences, annotations and PCR conditions.
The GSS division of GenBank ( http://www.ncbi.nlm.nih.gov/dbGSS/index.html ) has grown over the past year by 50% to a total of 9.6 million records for over 430 organisms and comprises over 5.7 billion nucleotide bases. GSS records are predominantly single reads from bacterial artificial chromosomes (‘BAC-ends’) used in a variety of genome sequencing projects. The most highly represented species in the GSS division are Z.mays (1.8 million records), M.musculus (1.5 million records), H.sapiens (898 000 records) and C.familiaris (854 000 records). The human data has been used ( http://www.ncbi.nlm.nih.gov/genome/clone ) along with the STS records in tiling the BACs for the Human Genome Project ( 6 ).
High-throughput genomic and high-throughput cDNA sequences
The HTG division of GenBank ( http://www.ncbi.nlm.nih.gov/HTGS/ ) contains unfinished large-scale genomic records that are in transition to a finished state ( 7 ). These records are designated as Phase 0–3 depending on the quality of the data. Upon reaching Phase 3, the finished state, HTG records are moved into the appropriate organism division of GenBank. As of release 143 of GenBank, the HTG division comprised over 11 billion base pairs of sequence.
The HTC division of GenBank accommodates high-throughput cDNA sequences. HTCs are of draft quality but may contain 5′-untranslated regions (5′-UTRs) and 3′-UTRs, partial coding regions and introns. HTC sequences which are finished and of high-quality are moved to the appropriate organism division of GenBank. GenBank release 143 contained more than 319 000 HTC sequences totaling over 392 million bases. One project generating HTC data was described previously ( 8 ) and other projects are listed at http://www.ncbi.nlm.nih.gov/genome/flcdna/ .
Sequence identifiers and accession numbers
Each GenBank record, consisting of both a sequence and its annotations, is assigned a stable and unique identifier, the accession number, which remains constant over the lifetime of the record even when there is a change to the sequence or annotation. The DNA sequence within a GenBank record is also assigned a unique identifier, called a ‘gi’, that appears on the Version line of GenBank flatfile records following the accession number. A third identifier of the form ‘Accession.version’, also displayed on the Version line of flatfile records, consolidates the information present in both the gi and accession numbers. An entry appearing in the database for the first time has an ‘Accession.version’ identifier equivalent to the Accession number of the GenBank record followed by ‘.1’ to indicate the first version of the sequence for the record, e.g.:
VERSION AF000001.1 GI: 987654321
When a change is made to a sequence given in a GenBank record, a new gi number is issued to the sequence and the version extension of the ‘Accession.version’ identifier is incremented. The accession number for the record as a whole remains unchanged and the older sequence remains available under the old ‘Accession.version’ identifier and gi.
A similar system tracks changes in the corresponding protein translations using ‘Accession.version’ identifiers comprised of a protein accession number, e.g. AAA00001, followed by a version number. These identifiers appear as qualifiers for CDS features in the Features portion of a GenBank entry, e.g. /protein_id=‘AAA00001.1’ Protein sequence translations also receive their own unique gi number, which appears as a second qualifier on the CDS feature, e.g.: /db_xref=‘GI:1233445’.
Whole Genome Shotgun (WGS) sequence and identifiers
WGS sequences appear in GenBank as sets of WGS contigs, many of them bearing annotations, originating from a single sequencing project. These sequences are issued accession numbers consisting of a four-letter project ID, followed by a two-digit version number, and a six-digit contig ID. Hence, the WGS accession number ‘AAAA01072744’ is assigned to contig number ‘072744’ of the first version of project ‘AAAA’. WGS sequencing projects have contributed over 4 000 000 contigs to GenBank and these primary sequences have been used to construct some 237 000 large-scale assemblies of scaffolds and chromosomes. WGS project contigs for H.sapiens , C.familiaris , Pan trodlodytes , Drosophila , Saccharomyces and more than 100 other organisms and environmental samples are available. For a complete list of WGS projects with links to the data, see http://www.ncbi.nlm.nih.gov/Genbank/WGSprojectlist.html .
BUILDING THE DATABASE
The data in GenBank, and the collaborating databases EMBL and DDBJ, is submitted primarily by individual authors to one of the three databases, or by sequencing centers as batches of EST, STS, GSS, HTC, WGS or HTG sequences. Data are exchanged daily with DDBJ and EMBL so that the daily updates from the NCBI servers incorporate the most recently available sequence data from all sources.
Virtually all records enter GenBank as direct electronic submissions ( http://www.ncbi.nlm.nih.gov/Genbank/index.html ), with the majority of authors using the BankIt or Sequin programs. Many journals require authors with sequence data to submit the data to a public database as a condition of publication.
GenBank staff can usually assign an accession number to a sequence submission within 2 working days of receipt, and do so at a rate of almost 700 per day. The accession number serves as confirmation that the sequence has been submitted and allows readers of articles in which the sequence is cited to retrieve the data. Direct submissions receive a quality assurance review that includes checks for vector contamination, proper translation of coding regions, correct taxonomy and correct bibliographic citations. A draft of the GenBank record is passed back to the author for review before it enters the database. Authors may ask that their sequences be kept confidential until the time of publication. Since GenBank policy requires that deposited sequence data be made public when the sequence or accession number is published, authors are instructed to inform the GenBank staff of the publication date of the article in which the sequence is cited in order to ensure a timely release of the data. Although only the submitting scientist is permitted to modify sequence data or annotations, all users are encouraged to report lags in releasing data or possible errors or omissions to GenBank at firstname.lastname@example.org .
NCBI works closely with sequencing centers to ensure timely incorporation of bulk data into GenBank for public release. GenBank offers special batch procedures for large-scale sequencing groups to facilitate data submission, including the program ‘tbl2asn’, described at http://www.ncbi.nlm.nih.gov/Sequin/table.html .
Third Party Annotation (TPA)
TPA records are designed to support the reporting of published, experimentally confirmed sequence annotation by a scientist other than the original submitter of the primary sequence record in DDBJ/EMBL/GenBank. TPA sequences may be created by assembling a number of primary sequences. The format of a TPA record (e.g. BK000016) is similar to that of a conventional GenBank record but includes the label ‘TPA:’ at the beginning of each Definition Line and the keywords ‘Third Party Annotation; TPA’ in the Keywords field. The Comment field of TPA records lists the primary sequences used to assemble the TPA sequence; the Primary field provides the base ranges of the primary sequences that contribute to the TPA sequence.
TPA submissions to GenBank may be made using either BankIt, or Sequin but TPA sequences are not released to the public until their accession numbers or sequence data and annotation appear in a peer-reviewed biological journal. For more information on TPA, see http://www.ncbi.nlm.nih.gov/Genbank/tpa.html .
Removal of 350 kb sequence length limit on GenBank records
In 1995, the DDBJ/EMBL/GenBank International Nucleotide Sequence Collaboration databases agreed to a 350 kb limit on the size of most database sequence records in order to conform to the limitations on sequence length of existing molecular biology software. Exceptions were made in the cases of HTG sequence, assemblies of WGS project data and large eukaryotic genes. The large records that were broken into multiple 350 kb segments to conform to the standard were represented in the GenBank ‘CON’ division as sets of assembly instructions to allow the transparent display and download of the full record using tools such as NCBI's Entrez. Owing to the greater ability of current software programs to efficiently handle long sequences, the 350 kb limit was removed by the Database Collaborators as of June 2004. Although the removal of the limit has immediately allowed many genomes, such as bacterial genomes, to be represented in GenBank as single sequences, it will still be desirable from the standpoint of data transfer and analysis to break some very long sequences, such as portions of eukaryotic genomes, into smaller segments. In these cases, CON division records for the entire sequence will continue to contain assembly instructions to allow the seamless display and download of the sequence.
About one-third of author submissions are received through NCBI's web-based data submission tool, BankIt ( http://www.ncbi.nlm.nih.gov/BankIt ). Using BankIt, authors enter sequence information directly into a form, edit as necessary and add biological annotation, such as coding regions or mRNA features. Free-form text boxes, list boxes and pull-down menus allow the submitter to further describe the sequence without having to learn formatting rules or restricted vocabularies. BankIt validates submissions, flagging many common errors and checks for vector contamination using a variant of BLAST called Vecscreen, before creating a draft record in GenBank flat file format for the submitter to review. BankIt is the tool of choice for simple submissions, especially when only one or a small number of records are to be submitted ( 7 ). BankIt can also be used by submitters to update their existing GenBank records.
Sequin and tbl2asn
NCBI also offers a standalone multi-platform submission program called Sequin ( http://www.ncbi.nlm.nih.gov/Sequin/index.html ) that can be used interactively with other NCBI sequence retrieval and analysis tools. Sequin handles simple sequences such as a cDNA, as well as segmented entries, phylogenetic studies, population studies, mutation studies, environmental samples and alignments for which BankIt and other web-based submission tools are not well-suited. Sequin has convenient editing and complex annotation capabilities and contains a number of built-in validation functions for quality assurance. In addition, Sequin is able to accommodate large sequences, such as that of the 5.6 Mb Escherichia coli genome, and read in a full complement of annotations via simple tables. Versions for Macintosh, PC and Unix computers are available via anonymous FTP at ‘ ftp.ncbi.nih.gov ’ in the ‘sequin’ directory. Once a submission is completed, submitters can Email the Sequin file to the address: email@example.com .
Submitters of large, heavily annotated genomes may find it convenient to use ‘tbl2asn’, referenced above under ‘Direct submission’, to convert a table of annotations generated via an annotation pipeline, into an ASN.1 record suitable for submission to GenBank.
RETRIEVING GENBANK DATA
The ENTREZ system
The sequence records in GenBank are accessible via Entrez ( http://www.ncbi.nlm.nih.gov/Entrez/ ), a robust and flexible database retrieval system that covers over 20 biological databases containing DNA and protein sequence data, genome mapping data, population sets, phylogenetic sets, environmental sample sets, gene expression data, the NCBI taxonomy, protein domain information, protein structures from the Molecular Modeling Database, MMDB ( 9 ) and MEDLINE references via PubMed. The Entrez sequence databases are taken from a variety of sources and therefore include more sequence data than is available within GenBank alone.
BLAST sequence-similarity searching
Sequence-similarity searches are the most frequent and basic type of analysis performed on the GenBank data. NCBI offers the BLAST ( http://www.ncbi.nlm.nih.gov/BLAST/ ) family of programs to locate regions of similarity between a query sequence and database sequences ( 10 , 11 ). BLAST searches may be performed on NCBI's website, or using a set of standalone programs distributed by FTP. BLAST is discussed in more detail ( 4 ).
Obtaining GenBank by FTP
NCBI distributes the GenBank releases in the traditional flat-file format as well as in the Abstract Syntax Notation (ASN.1) format used for internal maintenance. The full bimonthly GenBank release and the daily updates, which also incorporate sequence data from EMBL and DDBJ, are available by anonymous FTP from NCBI at ( ftp.ncbi.nih.gov ) as well as from two mirror sites, at the San Diego SuperComputer Center ( ftp://genbank.sdsc.edu/pub/ ) and at the University of Indiana ( ftp://bio-mirror.net/biomirror/genbank/ ). The full release in flat-file format is available as compressed files in the directory, ‘genbank’ with a non-cumulative set of updates contained in ‘daily-nc’. A script is provided in the ‘tools’ directory of the GenBank FTP site to convert a set of daily updates into a cumulative update.
GenBank, National Center for Biotechnology Information, Building 38A, Room 8S-803, 8600 Rockville Pike, Bethesda, MD 20894, USA. Tel: +1 301 496 2475; Fax: +1 301 480 9241.
http://www.ncbi.nlm.nih.gov/ (NCBI Home Page), firstname.lastname@example.org (submission of sequence data to GenBank), email@example.com (revisions to GenBank entries and notification of release of ‘confidential’ entries), firstname.lastname@example.org (general information about NCBI and services).
If you use the GenBank database in your published research, we ask that this paper be cited.