-
PDF
- Split View
-
Views
-
Cite
Cite
Krishnamurthy Subramanian, Bryan Payne, Felix Feyertag, David Alvarez-Ponce, The Codon Statistics Database: A Database of Codon Usage Bias, Molecular Biology and Evolution, Volume 39, Issue 8, August 2022, msac157, https://doi.org/10.1093/molbev/msac157
- Share Icon Share
Abstract
We present the Codon Statistics Database, an online database that contains codon usage statistics for all the species with reference or representative genomes in RefSeq (over 15,000). The user can search for any species and access two sets of tables. One set lists, for each codon, the frequency, the Relative Synonymous Codon Usage, and whether the codon is preferred. Another set of tables lists, for each gene, its GC content, Effective Number of Codons, Codon Adaptation Index, and frequency of optimal codons. Equivalent tables can be accessed for (1) all nuclear genes, (2) nuclear genes encoding ribosomal proteins, (3) mitochondrial genes, and (4) chloroplast genes (if available in the relevant assembly). The user can also search for any taxonomic group (e.g., “primates”) and obtain a table comparing all the species in the group. The database is free to access without registration at http://codonstatsdb.unr.edu.
Introduction
Most amino acids are encoded by multiple synonymous codons. Despite encoding for the same amino acid, some synonymous codons are used significantly more often than others, a phenomenon known as codon usage bias. Species significantly differ in their codon preferences—for instance, glutamic acid is preferentially encoded by GAG in human, whereas the same amino acid is preferentially encoded by GAA in Escherichia coli (Ikemura 1982; Sharp et al. 2010). In addition, genes within any given genome differ in their patterns of codon usage. In particular, gene expression levels significantly correlate with gene-specific metrics of codon usage such as the Effective Number of Codons (ENC; Wright 1990), the Codon Adaptation Index (CAI; Sharp and Li 1987), or the frequency of optimal codons (Fop; Ikemura 1985) (e.g., Gouy and Gautier 1982).
Codon preferences can be affected by a number of factors, including the genome’s nucleotide composition (e.g., AT-rich genomes tend to use codons ending in A or T) and translational selection (codons that are translated by highly abundant tRNAs are translated faster and with fewer errors; e.g., Ikemura 1985; Hershberg and Petrov 2008).
Understanding codon preferences across the different species and genes is important not only to understanding genome evolution, but can also inform tasks such as heterologous expression, gene prediction, or phylogenetic inference (e.g., Gustafsson et al. 2004; Christianson 2005; Rota-Stabelli et al. 2013). In addition, the patterns of codon usage of viruses tend to be similar to those of their host species (e.g., Shackelton et al. 2006). Despite the relevance of maintaining species- and gene-specific codon usage information, existing databases have not been updated in a long time, focus on specific taxa, and/or do not provide gene-specific metrics (Nakamura et al. 2000; Hilterbrand et al. 2012; Athey et al. 2017).
Implementation
For each of the species with reference or representative genomes in the RefSeq database (release 207), we chose one full assembly (in order of preference, the one labeled as “reference,” the one with the highest assembly level, or the most recent one) and retrieved the corresponding coding sequences (CDSs) file. Using that file as input, a number of tables were pre-computed using an R pipeline. To avoid codon redundancy, only one CDS per gene was used (if multiple were available, the longest one was chosen). The web interface was created using PERL CGI.
For each species, we computed the total frequency of each codon, and used the information to compute the Relative Synonymous Codon Usage (RSCU) of each codon. For each gene, we computed the GC content for the entire CDS (GC), the GC content at third codon positions (GC3), the ENC, and the RSCU for each codon.
For species with over 1,000 genes, we also compared genes inferred to be highly expressed (bottom 10% ENC values) with genes inferred to be lowly expressed (top 10% ENC values). Codons with significantly higher RSCU values in the highly expressed gene set (according to a Mann–Whitney U test) were considered preferred/optimal. We then computed the Fop for each gene. The highly expressed gene set was also used as reference to compute the CAI of each gene.
The Codon Statistics Database
We present the Codon Statistics Database, an online database that contains codon usage information for all species with reference or representative genomes in RefSeq (over 15,000). The user can search for any species or taxonomic group by taxonomic ID (e.g., “9606”), scientific name (e.g., “Homo sapiens”), or common name (e.g., “human”), and select an option from a drop-down menu.
If a species is selected, the user is directed to a table that lists, for each codon, the encoded amino acid, the total count in the genome, the RSCU, and whether the codon is preferred or unpreferred (fig. 1). The user can visualize and download equivalent tables for (1) all nuclear genes (default option), (2) nuclear genes encoding ribosomal proteins (this subset is included since such proteins are often highly expressed and thus subjected to strong codon bias), (3) mitochondrial genes, and (4) chloroplast genes (if such gene sets are available in the relevant genome assembly). For viruses, only one option including all genes is available. Additionally, for each gene set, the user can download a tab-delimited file (.tsv) listing the following statistics for each gene: GC, GC3, ENC, CAI, and Fop.

Species summary. Codon statistics corresponding to all human nuclear genes are shown.
If a taxonomic group with multiple species is selected (e.g., “7215,” “Drosophila,” or “fruit flies”), the user is presented with a table comparing all the species in the group (fig. 2). The user has the option to visualize either codon counts or RSCU values. Preferred codons in each species are marked with asterisks.

Taxonomic group summary. Codon preferences for species in the genus Drosophila are shown.
Acknowledgments
We are grateful to Alejandra Nores for assistance with web design, and to the Office of Information Technology of the University of Nevada, Reno for computational resources. This work was supported by Nevada INBRE (funded by grant P20GM103440, National Institute of General Medical Sciences, National Institutes of Health) and by the National Science Foundation (grant MCB 1818288).
Data Availability
All data used in this work are derived from the RefSeq database.
References
Author notes
Krishnamurthy Subramanian and Bryan Payne contributed equally to this work.