An update of KAIKObase, the silkworm genome database

Abstract KAIKObase was established in 2009 as the genome database of the domesticated silkworm Bombyx mori. It provides several gene sets and genetic maps as well as genome annotation obtained from the sequencing project of the International Silkworm Genome Consortium in 2008. KAIKObase has been used widely for silkworm and insect studies even though there are some erroneous predicted genes due to misassembly and gaps in the genome. In 2019, we released a new silkworm genome assembly, showing improvements in gap closure and covering more and longer gene models. Therefore, there is a need to include new genome and new gene models to KAIKObase. In this article, we present the updated contents of KAIKObase and the methods to generate, integrate and analyze the data sets. Database URL: https://kaikobase.dna.affrc.go.jp


Introduction
Bombyx mori L., the domesticated silkworm, has been living with human beings for thousands of years, forming sericulture industry and providing us materials for clothes and artwork. It has lost its ability to move, fly and forage during the domestication process, making it an ideal experimental animal in the laboratory. As a consequence, for a long time, silkworm has been widely studied for revealing genetic mechanisms of insect physiology (1-5) as a model organism. It is also a useful reference and platform for studying lepidopteran pests. Lepidopteran pests cause great damages to crops and vegetables and those developing pesticide resistance become acute problems. Therefore, there is an urge need to reveal the mechanism of pesticide resistance for pest monitoring and management. With the help of silkworm, genetic basis of resistance to several pesticides has been identified in other lepidopteran species (6)(7)(8)(9) and molecular diagnosis method could be developed based on the discovery (10). Moreover, in recent years, its ability in producing bulk silk proteins in cocoons makes it an ideal protein factory to produce proteins of interests with low prices through genetic engineering. Transgenic silkworm has been applied widely, including the production of antibodies, drugs and cosmetic materials such as collagen (11,12). The silk itself has also attracted a great attention as a material for diverse biomedical usages (13)(14)(15)(16). Precise genomic and genetic information will thus be needed for making full usages of silkworm with genetic engineering technologies. As a consequence, a database of silk genomic resource not only can be the basis of silkworm studies but also facilitates researches of diverse fields.
The high-quality genome sequences and genetic maps of silkworm were first released independently by a Japanese group and a Chinese group in 2004 (17,18) and then an upgraded version was released by the collaboration of these two groups in 2008 (19). KAIKObase (20) was then constructed based on the genome of 2008, providing a wide range of knowledge including physical and genetic linkage maps, as well as gene structures and annotations. It also provides services for keyword, position and similar sequence for the researchers of entomology, pest management, biomaterials and so on. Since the launch in 2009, KAIKObase keeps a steady and indispensable knowledge base of silkworm genetics for researchers with almost 1 million access per year.
Since it started to provide services, KAIKObase has been updated several times, from integrating other silkwormrelated resources (version 2), changing to Chado database (version 3.0.0), adding full-length cDNA sequence data set (21) (version 3.1.0), to the current update in 2013 (version 3.2.2) of adding gene description pages including sequences, expression, automated functional annotation and orthologous genes of related insect and pest species. However, of all these versions of KAIKObase, the genome assembly was not updated. Finally, in 2019, an improved genome assembly generated using PacBio long reads and Illumina short reads was released (22). Many gaps were closed in the new genome, and more genes with longer average length were predicted in the new genome than in the genome of 2008 (16 880 genes of 1551 bp on average in the new genome versus 14 623 genes of 1224 bp on average in genome of 2008). Therefore, there is a need to update KAIKObase using the latest information of genome and genes. In addition, information used for gene annotation is out of date because there was no update for annotation since its last update in 2013. In the past several years, databases that are widely used for functional annotation of genes, such as National Center for Biotechnology Information (NCBI) non-redundant (nr) protein database (23), InterPro (24) and Gene Ontology (GO) (25), had been updated frequently, and the amount of data has increased tremendously. As a result, annotating genes using the latest genetic information is also needed for the genes in KAIKObase.
For the latest KAIKObase (version 4) introduced here, we used the improved silkworm genome as the reference genome. Gene structure of newly predicted protein-coding genes (hereafter, new gene models) and gene contigs assembled from the transcriptome data (26) (hereafter, reference transcripts) are visible from the genome browser. Detailed annotations are available for each predicted gene. We updated the sequence search service by adding more gene sequences (new gene models and reference transcripts) to search against. We also manually curated genes related to detoxification and target genes of pesticides, as well as genes related to silk production, aiming to provide accurate gene information to the users. Throughout these updates, we anticipate that KAIKObase will still be an irreplaceable database of silkworm genome and genetics in the next decades. In addition to the above updates, we also provide a list of manually curated genes because they are frequently investigated. Among these genes, 246 are related to detoxification [52 ATP-binding-cassette (ABC) transporters, 87 carboxylesterases (COEs), 23 glutathione S-transferases (GSTs) and 84 cytochrome P450 genes (CYPs)], which were already curated (22), while 16 target genes of pesticides and 7 genes related to silk production were curated here (Table 1 and Supplementary Table 1).

Overview of the update
We prepared several downloadable files for the users, including the new genome assembly sequences and sequences of new gene models that are also available in SilkBase (http://silkbase.ab.a.u-tokyo.ac.jp), the corres pondence between new gene models and Gene set A, and functional annotation of the new gene models and curated genes. All of these files are available through the following

Genetic markers on new genome
To reflect the chromosomal positions of SNP markers, BAC-end sequences and FPCs on the new genome assembly, we mapped these sequences using BLASTN (version 2.2.30+) with threshold e-value of 1e-200 (for SNP markers and FPCs) and 0.1 (for BAC-end sequences). An SNP marker or FPC is successfully mapped if the query sequence can be found in only one chromosomal position and the aligned region covers more than 80% of the query sequence. All of the 1532 SNP markers were mapped onto the new genome assembly, while 4726 of 4754 FPCs (99.4%) could be mapped. The BAC-end sequences are considered to be successfully mapped if the aligned region covers more than 50% of the query sequence. Approximately, 97% of the BAC-end sequences (133 242 out of 137 219) could be mapped onto the new genome assembly.

Description page of new gene models
We created 'description pages' (Figure 2) for 16 880 new gene models to provide their detailed information, includ-ing chromosomal positions, nucleotide and amino acid sequences, corresponding gene accession(s) in the previous KAIKObase, assignment of domains and motifs, orthologous genes in closely related insects including famous lepidopteran pests and expression patterns. The methods and results for generating information are introduced as follows.
Corresponding gene accession in the previous KAIKObase

Expression pattern
The transcriptome data (26)

Manually curated gene families
We curated several predicted gene models manually to provide verified intron-exon structures and sequences for our users. The curation was focused on genes related to pesticide resistance and silk production which have drawn much attention for their applications in wide ranges of fields. We collected the complete coding sequences of 16 target genes of pesticides and 6 genes related to silk production (fibroin and sericin) from NCBI nr nucleotide database for gene curation. Exonerate (version 2.2.0) (44) was used to align the complete coding sequences onto the genome assembly to identify the correct gene models of these sequences by the alignment model of est2genome. Nine of the 16 target genes, all of the 3 genes from sericin and 1 of the gene from fibroin were mapped onto the genome as different gene models from the predicted ones (Table 1). Three target genes of pesticides, BmTargetGene-01, BmTargetGene-02 and BmTargetGene-03, were mapped onto the same positions as curated ABC transporters BmABC-39, BmABC-34 and BmABC-30, respectively. We also retrieved intronexon structure of a newly identified sericin protein (sericin 4) from the work of Dong et al. (45) and compared it to the reference transcript MSTRG.2610.1 for the curation of sericin 4. These curated genes are accessible in the genome browser in an independent track, and their description pages were created, containing the positional information, sequences, functional annotations and orthologous genes as the description pages of new gene models except for the information of expression patterns. The method to identify orthologous genes in other species is the same as mentioned above. We further investigated orthologs of four detoxificationrelated gene families in different species (Supplementary  Tables 1 and 2). ABC transporters are ubiquitous across the all the phyla and are fundamental to import essential nutrients and export toxins (46,47), as our data showed they are highly conserved among all of the lepidopteran species and even among insects and animals. COEs, GSTs and CYPs are less conserved and relatively species specific between different lepidopterans from our data, which may indicate that they could contribute to the detoxification of different targets in each species as they are well-known to be involved in detoxifying a wide range of xenobiotics (48)(49)(50)(51). The 16 target genes of pesticides are relatively conserved among all the species compared with other genes (Supplementary Table 2), since most of the target genes possess essential functions for the cells as transporters, receptors or channels. Although these genes are conserved between insects and human, and the effects of pesticides are generally less toxic to mammals than insects because of their specificity to insects (52,53).
Sericin and fibroin are essential for silkworm in silk production, being the coat and the core of the silk, respectively. Orthologs of sericin proteins could not be identified in other species, showing that these proteins are very specific in silkworm. On the other hand, orthologs of silkworm fibroin proteins can be found in other lepidopteran species (Supplementary Table 2). Fibroin P25 protein is orthologous among silkworm and all of the six lepidopteran species, while light-chain fibroin (L-fibroin) are missing in both ecotypes of S. frugiperda and heavy-chain fibroin (H-fibroin) are missing in D. plexippus and corn ecotype of S. frugiperda (Supplementary Table 1). However, we searched MonarchBase (32) and LepidoDB (http://bipaa.genouest.org/is/lepidodb/ spodoptera_frugiperda/) and found that there are genes annotated as H-fibroin in D. plexippus (http://monarch base.umassmed.edu/tools3/Get_gene.cgi?id=DPOGS2041 88) and S. frugiperda (several genes such as GSSPF G00007524001-PA, SFRURICE0000006288-PA, etc. can be accessed from the search form at https://bipaa. genouest.org/sp/spodoptera_frugiperda_pub/ with keyword of 'fibroin heavy chain'). Since the property of the silk is largely determined by H-fibroin (54), D. plexippus and S. frugiperda may have different silk property from silkworm. We also noticed that S. frugiperda harbors L-fibroin which has homologs in another Spodoptera species, Spodoptera. litura, again suggesting its different silk property. It may be worthy to investigate the fibroin proteins in D. plexippus and S. frugiperda for the search of new silk materials.

Conclusion and future perspectives
KAIKObase has supported researches of silkworm and insects since 2009 and is now providing the latest genetic and genomic information for the scientific community. The new genome and gene models showed more accurate sequences and gene sets than old ones, and the genes were annotated with the latest information. The updated KAIKObase will continue to contribute to the researches of silkworm and insects, as well as the sericulture industry and biotechnological applications.
The decreasing cost of sequencing makes it easy to collect large-scale sequence data in a short time. Therefore, it is expected that high-quality genome assemblies of various silkworm lineages will be determined to broaden our knowledge of silkworm. Meanwhile, the use of genetic markers obtained using next-generation sequencing such as restriction-site associated DNA sequencing and genotyping by sequencing becomes a popular and preferable method in a wide range of fields, including population genotyping, quantitative trait loci (QTL)-mapping and breeding (55). Our future objectives for updating KAIKObase will include collecting more silkworm genomes and genetic markers as population-level data to keep it as a comprehensive and indispensable repository for silkworm research.

Supplementary data
Supplementary data are available at Database Online.