DBM-DB: the diamondback moth genome database

The diamondback moth Genome Database (DBM-DB) is a central online repository for storing and integrating genomic data of diamondback moth (DBM), Plutella xylostella (L.). It provides comprehensive search tools and downloadable datasets for scientists to study comparative genomics, biological interpretation and gene annotation of this insect pest. DBM-DB contains assembled transcriptome datasets from multiple DBM strains and developmental stages, and the annotated genome of P. xylostella (version 2). We have also integrated publically available ESTs from NCBI and a putative gene set from a second DBM genome (KONAGbase) to enable users to compare different gene models. DBM-DB was developed with the capacity to incorporate future data resources, and will serve as a long-term and open-access database that can be conveniently used for research on the biology, distribution and evolution of DBM. This resource aims to help reduce the impact DBM has on agriculture using genomic and molecular tools. Database URL: http://iae.fafu.edu.cn/DBM/


Introduction
The diamondback moth (DBM), Plutella xylostella (L.), has a worldwide distribution and is one of the most destructive insect pests of cruciferous food crops (1,2). Annual pest management costs for controlling DBM are approximately US$2 billion; however, if yield losses attributed to insect damage are included, overall estimates escalate to US$4-5 billion (3,4). Effective integrated pest management strategies rely on the rotation of insecticide sprays, although biological control can be remarkably effective against DBM (2,3,5). Overreliance or overuse of insecticides can have negative consequences on DBM control, including rapid development of insecticide resistance (6,7) and the suppression of beneficial parasitoid populations.
Although a global pest, DBM is also an excellent system for studies on comparative genomics, ecological entomology, morphogenesis, insecticide resistance, migration, phylogenetic evolution and interactions with host plants and/or natural enemies (4). Through sequencing the DBM genome and stage-specific transcriptomes, it is hoped new mechanisms for control will be identified, along with a greater understanding of this insect's biology. Nextgeneration sequencing technology has driven major advances in DBM genomics. Baxter et al. constructed a sequence-based genetic linkage map of the DBM genome using restriction-site associated DNA sequencing (RAD-Seq) (8). Subsequently, several DBM transcriptomes were sequenced by different organizations (9)(10)(11), and, in 2013, the DBM draft genome (Fuzhou-S strain) was publicly ß The Author(s) 2014. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/ licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. released (12). The genome was sequenced using the Illumina platform with a strategy that combined whole genome shotgun data (WGS) with 100 800 sequenced fosmid clones. Recently, the genome sequence of a second DBM strain (Bt-toxin susceptible strain PXS) was generated using the Roche 454 platform and data released at KONAGAbase (13).
Here, we present the DBM genome database (DBM-DB), an organism-specific database that coordinates the genomic resources available for this insect. The database provides researchers with user-friendly access to the genome sequence of the Fuzhou-S strain and related genomic and transcriptomic sequence data. DBM-DB provides a centralized database for the DBM research community, which can access it using a simple and intuitive interface. It also provides a platform for DBM research scientists to manually check gene model annotations and submit information detailing missing genes and/or misannotated genes to our centre (dbm@iae.fjau.edu.cn).

Genome assembly version 2
The Fuzhou-S genome was sequenced using the Illumina platform, and de novo assembled with custom software (Rabbit) that incorporated 100 800 fosmid clones and whole genome shotgun data that were both sequenced to a depth of >200X (12). As two divergent haplotypes may be retained within an assembly, we used the Poisson distribution-based K-mer statistic (12) to identify allelic regions containing >40% unique K-mers. Masking these redundant genomic regions with ''n'' characters produced the DBM genome assembly version 2. This version release included 1819 scaffolds with an N50 of 737 kb, of which 171 scaffolds were assigned to 31 linkage groups (8,12). The statistics of our DBM genome version 2 were summarized and compared with the DBM genome as described in KONAGAbase ( Table 2).

Official gene set version 1
The DBM whole-genome gene prediction was performed using a combination of approaches ( Figure 1). First, genes were obtained using de novo prediction with Augustus, Genescan and SNAP tools that generated 19 073 gene objects. Second, homology prediction was conducted against four insect species, including Drosophila melanogaster, Tribolium castaneum, Anopheles gambiae and Bombyx mori. Gene models generated through de novo and homology prediction were integrated using GLEAN (14); then the transcriptomes generated from RNA-seq were integrated to produce the Official Gene Set version 1 (OGSv1) containing 18 071 genes (denoted as ''Px+number'', for example Px018071) (12). The 18 071 predicted DBM genes were annotated using BLAST tools to predict gene function via homology from Swissprot and TrEMBL datasets in the UniProt database. Other gene annotations were conducted using Gene Ontology (GO) (http://www.geneontology.org/) (15,16), Kyoto Encyclopedia of Genes and Genomes (KEGG) (http://www.genome.jp/kegg/) (17,18) and InterPro (http:// www.ebi.ac.uk/interpro/) (19) databases. As a result, functional information for 15 195 (84.08%) of the DBM OGSv1 was obtained.
Transcriptome de novo assembly RNA-transcriptome datasets were generated from six different DBM samples, including eggs, larvae, pupae and adults of the insecticide-susceptible Fuzhou-S reference strain and larvae of chlorpyrifos-and fipronil-resistant strains (CRS, FRS). The samples were sequenced and de novo assembled into 171 262 non-redundant sequences (unigenes), of which 38 255 were functionally annotated. In OGSv1, 16 150 genes were expressed with the values of RPKM (reads per kb per million reads) !1 (9,12). A summary of the unigenes generated from our transcriptome datasets is presented alongside the unigene dataset described in KONAGAbase (Table 3).  (Table 4).

Database organization
The DBM-DB is an extensive online database that catalogues DBM genomic data, published by You et al.   and He et al. (9). It was rationally structured in a userfriendly and web-based mode, containing four primary components of Search, Overview, BLAST and GBrowse, which are interlinked with the Gene Information ( Figure 2).

Gene information
The Gene Information held within DBM-DB and can be readily accessed using the four online components: Overview, Search, BLAST and GBrowse, as shown in Figure 2. A custom PHP script was developed to generate a dynamic HTML page for the overall information of each gene in OGSv1, and the MySQL database language was used as a tool to manage and store the datasets of DBM-DB. Information on each of the 18 071 OGSv1 genes can be found in the Gene Information component, which contains the scaffold location, Uniprot similarity description, Gene Ontology (GO) term, KEGG pathway annotation, protein domain annotation, CDS sequence, protein sequence and gene sequence (including introns) in FASTA format.
In gene expression data generated by RNA-seq are provided as a foundation for the study of gene differential expression. The gene location is linked to GBrowse, which enables gene structure visualization and provides Uniprot, GO, KEGG and InterPro databases accession numbers where available. Each gene structure can be downloaded in GFF3 format from the gene information page. Nucleotide and protein sequences in FASTA format can also be obtained through links provided (Figure 2).

Overview
A total of 171 scaffolds were assigned to 28 of 31 linkage groups, which represent different chromosomes (8,12). The Overview component in DBM-DB contains information listing the scaffolds that have been assigned to specific linkage groups. The Linkage Groups List option enables users to browse all scaffolds with linkage group assignment, and the All Scaffolds Information List enables users to browse or search for specific scaffolds (Figure 2). Furthermore, the All Scaffolds Information List provides data outlining the KONAGAbase unigenes were assembled the EST/mRNA sequences from NCBI, the ESTs from midgut, egg and testes, and the RNA-seq contigs of the fourth instar DBM larvae.

BLAST server
In order to facilitate sequence homology searches, we implemented the basic local alignment search tool (BLAST) (21). Users can search against DBM sequences including genomic scaffolds, transcriptomic unigenes and OGSv1 CDS or proteins. The scaffolds, unigenes and gene CDS sequences can be searched using nucleotide sequences with blastn or tblastx options. Blastp and blastx can also be conducted to search against the database of protein sequences using protein and nucleotide sequences, respectively. In addition, we developed a set of PHP scripts to call the program of BLAST and customize BLAST output, on which the subject ID of DBM-DB is linked to the corresponding Gene Information component.

Genome visualization
The genome browser (GBrowse) is a tool that integrates databases and interactive web pages for visualizing genome information (22). GBrowse can display a specific DBM scaffold with the following: (i) the corresponding annotation and structure of our OGSv1 genes; (ii) homologous, functionally annotated unigenes; (iii) DBM ESTs from NCBI and (iv) the putative PXS gene set from KONAGAbase. Users can therefore view and navigate genomic scaffolds, which include information for gene annotations, gene structure (based upon OGSv1), ESTs and PXS genes. This enables users to simultaneously view independent datasets when assessing gene models. CDS and gene tracks are linked to the Gene Information component, and external links to GeneBank and KONAGAbase are available by clicking the EST or PXS gene alignment tracks ( Figure 2).

Download page
In the download HTML page, both FTP and HTTP links are provided for users to download entire datasets, as required.
The FTP site of DMB-DB (ftp://iae.fafu.edu.cn/pub) contains genomic scaffolds (draft genome version 1 and version 2) and predicted OGSv1 gene sequences in FASTA format and gene structure in gff3 format. Gene annotation is also provided, including gene functional description, KEGG, GO and InterPro domain. DBM transcriptomes from egg, larvae, pupae and adult tissues are available for download, along with the combined de novo assembled transcriptome (All-Unigene assembly version 1) in FASTA format plus their expression information. In addition, some useful files are available, which include alignments between scaffolds/ fosmid contigs and different DBM sequences (ESTs, PXS genes, functionally annotated unigenes).

System implementation
DBM-DB was developed under the Linux system using several common software packages including PHP, Apache web server, MySQL database management and Perl FastCGI ( Figure 3). Several custom PHP scripts were developed to make the database flexible, interactive and intuitive so that users could readily access and obtain the  information they need either for molecular analysis or practical application. In addition, the generic Genome Browser (GBrowse) package, a component of the Generic Model Organism Project (GMOD), was used for genome data visualization, which allows users to obtain the information on gene structures based on the DBM genome assembly. In order to search against the DBM genome, the local BLAST tool was installed in the DBM-DB system.

Future work
DBM-DB provides a large-scale set of the genomic data and a convenient tool for further research on genomics, genetics and molecular biology of P.xylostella and other species of insects. This database was designed with the room to accommodate and house future data that will be generated, and efforts will be made to regularly update and upgrade the data resources. We are aiming to improve access of both transcriptome and genome data in the future. Future resources to be developed include digital gene expression profiling of different developmental stages or tissues, data supporting microRNAs expression and the meta-genomics of DBM midguts. Genome resources will be updated when appropriate, including improving scaffolds, assigning additional scaffolds to chromosomes using genetic mapping, more precise gene prediction and functional annotation and the upcoming information on DBM phylogeography. Further, DBM sequences from NCBI database as well as DBM-related publications will also be integrated into DBM-DB. To further support the capability of DBM-DB to serve the research community, new web tools are being developed to allow more efficient and effective use of the DBM genomic information-housed DBM-DB.