DNA transposon piggyBac (PB) is a newly established mutagen for large-scale mutagenesis in mice. We have designed and implemented an integrated database system called PBmice (PB Mutagenesis Information CEnter) for storing, retrieving and displaying the information derived from PB insertions (INSERTs) in the mouse genome. This system is centered on INSERTs with information including their genomic locations and flanking genomic sequences, the expression levels of the hit genes, and the expression patterns of the trapped genes if a trapping vector was used. It also archives mouse phenotyping data linked to INSERTs, and allows users to conduct quick and advanced searches for genotypic and phenotypic information relevant to a particular or a set of INSERT(s). Sequence-based information can be cross-referenced with other genomic databases such as Ensembl, BLAST and GBrowse tools used in PBmice offer enhanced search and display for additional information relevant to INSERTs. The total number and genomic distribution of PB INSERTs, as well as the availability of each PB insertional LINE can also be viewed with user-friendly interfaces. PBmice is freely available at http://www.idmshanghai.cn/PBmice or http://www.scbit.org/PBmice/ .
Large-scale mutagenesis is critical to functional characterization of the mammalian genomes. In mice, gene targeting, chemical mutagenesis and insertional mutagenesis are available to achieve this goal. The International Knockout Mouse Consortium (IKMC) has recently been formed to systematically produce null alleles for individual genes in mice ( 1–3 ). This approach focuses on annotated protein-coding regions but provides limited insight into the rest of the genome. In contrast, ENU ( N -ethyl- N -nitrosourea) based chemical mutagenesis has the capacity to introduce mutations throughout the entire genome ( 4–11 ). Several large-scale ENU mutagenesis studies have been conducted in the recent years and produced a large collection of mutant mice with interesting phenotypes. However, the majority of these mutant strains remain to be mapped to the single gene level. Insertional mutagenesis has been shown to be one of the most efficient means to generate a large number of identifiable mutants throughout the entire genome of several model organisms ( 12–17 ). With an appropriate insertion vector, insertional mutagenesis can produce a large number of mutations at low cost with a high speed. Each insertion site is tagged with the vector sequence and thus can be easily mapped.
Retroviruses have been used in mouse genetics for more than two decades ( 18 , 19 ). They led to discovery of a number of proto-oncogene and tumor suppressor loci ( 19 ). However, successful application of retroviruses in large-scale insertional mutagenesis remains limited to gene trapping in ES cells ( 20 , 21 ). In recent years, the progress of introducing transposon tools into mammalian genetic studies has offered new opportunities for mutagenesis in mice ( 22–30 ). Several transposons, such as Sleeping Beauty (SB), Minos , Tol2 , Mos1 and piggyBac (PB), were shown to be active in mammalian cells and/or in mice. Among them, the PB transposon appears to be a promising option. Appreciation of PB as an important genetic manipulation tool in various organisms including mice has been growing rapidly in recent years, as accumulating evidence has indicated that PB elements are capable of efficient transposition in animals from diverse genera of insects, flatworms, mice, as well as mammalian cell lines ( 28 , 29 , 31–38 ). PB elements, originally found in the cabbage looper moth Trichoplusia ni , are DNA transposons carrying a unique functional transposase (PBase) of 594 amino acids ( 39–41 ). It has been reported to be more active than Tol2 , Mos1 and SB in mammalian cells ( 30 ), and has been shown to be effective mutagen for insertional mutagenesis in mice.
Currently, a genome-wide PB mutagenesis project is underway in the Institute of Developmental Biology and Molecular Medicine (IDM) at Fudan University in Shanghai, China. To collect and disseminate the large amount of genetic and phenotypic information from this project, we have designed and implemented an integrated database system called PBmice (PB Mutagenesis Information CEnter), which provides a user interface to query and display data obtained from PB insertions (INSERTs) in the mouse genome and their characterizations.
SYSTEM DESIGN AND IMPLEMENTATION
Experimental data in PBmice are derived from the on-going project of PB insertional mutagenesis in mice at IDM. Each PB transposition event produces a PB insertion, which is termed as an INSERT in PBmice and carried by a mouse line (LINE). The flanking genomic sequence of an INSERT is characterized by inverse PCR and determined for its location in the mouse genome by WU-BLAST 2.0 search against the Ensembl mouse genome sequences (Ensembl release 45, based on GenBank m36 mouse assembly, http://www.ensembl.org/Mus_musculus/index.html ) ( 42 ). Expression level of the gene hit by an INSERT is determined by real-time RT-PCR. Other information about the INSERT, such as the mouse strain and LINE carrying the INSERT and the PB construct used to generate the INSERT, are also collected in the database. When a trapping vector is used to generate the INSERT, the expression pattern of the reporter gene of the vector, an indication of the expression pattern of the endogenous gene affected in the INSERT, is also examined and collected in the database. Associated information of a LINE carrying one or more INSERTs includes the existence of live animals and storage status, as well as its phenotyping data. When an INSERT hits a functional genomic unit, such as an annotated protein-coding gene, a gene for a microRNA, or a regulatory element, this interruption of genomic function may be shown as phenotypes in the hosting LINE. Such characterizations are also recorded and documented in PBmice. In addition, other information, such as gene annotation data, is downloaded from Ensembl and GenBank.
A Java tool has been developed for PBmice to integrate gene data from Ensembl and GenBank are incorporated together. MGI-ID of a gene is used to integrate the information of the same gene downloaded from Ensembl and GenBank in PBmice. Gene information contains gene names with synonyms, descriptions, genomic locations, transcript information and IDs linked to existing databases, such as Ensembl, GenBank and Mouse Genome Informatics (MGI) ( http://www.informatics.jax.org/ ), for links to its detailed information in those databases. Additionally, users can obtain protein information from ExPASy ( http://us.expasy.org/ ) and gene function information from the Gene Ontology Database ( http://www.geneontology.org/ ). The integrated items are converted into data files in GFF format ( http://www.sanger.ac.uk/Software/formats/GFF/ ) for retrieving and displaying with GBrowse ( http://www.gmod.org/wiki/index.php/Gbrowse ), allowing for a more straightforward and user-friendly view.
INSERTs and INSERT-related data form the central core of PBmice. This system provides detailed information about an INSERT and associated information of the LINE carrying the INSERT and the gene hit by the INSERT.
To search for an INSERT of interest, either the quick search or the advanced search method can be used ( Figure 1 ). The quick search allows a user to conduct a simple search just with the name of an INSERT, a LINE, a strain or a phenotype. A default entry results in a complete list of all INSERTs. The partial-word-match method is embedded in the query engine as well. Advanced search is provided for users with more defined interests to narrow down the search area, such as (i) to search for the existence of INSERTs in terms of their relative positions to a known INSERT or Gene; (ii) to search for INSERTs in a certain genomic area on a chromosome of interest; (iii) to search for INSERTs associated with the expression pattern characterized for a certain organ in an animal of certain developmental stage or (iv) to search for INSERTs associated with a certain phenotype. The first and second search criteria can be independently combined with the rest of search criteria but not with each other during an advanced search.
The INSERT Detail interface presents the detailed information of an INSERT after a successful search, including chromosomal location and flanking genomic sequence of the INSERT, the expression level of the gene that is hit by the INSERT, and the PB construct used to generate the INSERT ( Figure 2 ). Of all the information displayed in the interface, the Chromosome Location Prox item refers to the genomic position immediately preceding the INSERT, while the Chromosome Location Dist item refers to the position immediately following the INSERT. The Gene hit item points to the information of the gene with its Ensembl ID, which is clickable for users to obtain detailed information of this gene. The Orientation item of this INSERT is defined as following: the orientation of PB is defined as from PBL to PBR (29), while the chromosome orientation is defined by Ensembl. If the orientation of PB is the same as the orientation of the inserted chromosome, the orientation of the INSERT is denoted as ‘+’, otherwise, the orientation of this INSERT will be ‘−’. The DNA Sample item shows the name of the DNA sample used for mapping the INSERT. The Expression Level item indicates the expression level of the gene hit by the INSERT in the mutant. The GBrowse item offers an annotated view of a 40 kb genomic fragment with the position of the INSERT in the middle and is linked to the GBrowse tool embedded in PBmice for a user-adjustable view of the annotation of the genomic fragment around the INSERT. The LINE item contains the name of the mouse LINE that harbors this INSERT. The rest of the information categories titles are self-explanatory.
In the PBmice database, the information of a LINE is associated with the INSERT(s) it carries. The LINE Detail interface reveals the properties of a LINE carrying one or more INSERTs, including the status, expression pattern and phenotyping data of the LINE ( Figure 3 ). In this interface, the Status item indicates the storage status of the LINE (live animal, frozen sperm and/or frozen embryo) and related information for strain distribution. When a LINE carries trapping vector-related INSERT(s), the Expression pattern item will indicate any information that was characterized to some extent for the expression of the reporter protein. The Phenotype item provides clickable categories of phenotypes characterized for this LINE. The detailed phenotypic data contains not only the description of the phenotype and assays used to depict the phenotype, but also the sample information, such as the heterozygosity or homozygosity, the developmental stage, the sex of the animals or the organ examined. Users can also retrieve related INSERTs/LINEs based on similarity of phenotypes through advanced search ( Figure 1 b).
PBmice also provides information of a gene when hit by one or more INSERT(s). Both the Gene Information interface and the Transcript interface exhibit the identity and associated information of a gene ( Figure 4 a and b). The Gene Information interface presents the name and synonyms, the description and the genomic location of a gene and the links to outside databases, such as Ensembl, MGI and GenBank ( Figure 4 a). The Transcript interface displays links to existing databases for users to retrieve detailed information about each related transcript, protein and gene function information of the gene as shown in Figure 4 a ( Figure 4 b). UniProt-ID(s), EMBL-ID(s) and GenBank-Protein-ID(s) are linked to information about the protein expressed by this gene stored in ExPASy containing UniProt Knowledge base (Swiss-Prot and TrEMBL) through the Multi-Protein Survey System (MPSS) developed by Bioinformation Center, Shanghai Institutes for Biological Sciences ( http://www.biosino.org/MPSS/index.jsp ). GO-ID(s) are linked to the associated gene function information from the Gene Ontology database. UNIGENE-ID(s) are linked to the associated transcripts information from GenBank.
Taking the INSERT AF0-47T6 that hits into the gene Pkd2 as an example to navigate PBmice. Pkd2 was cloned in 1996 and proved to contribute to autosomal dominant polycystic kidney disease (ADPKD) when mutated ( 43 ). Patients suffering from ADPKD grow cysts in their kidneys eventually leading to kidney failure. ADPKD is the most common inherited disease affecting about half a million Americans and more than 1.5 million Chinese ( 44 , 45 ). To search for any INSERT existing that hits Pkd2 , either quick search or advanced search can be used ( Figure 1 ). Input Pkd2 in the search box in the Quick Search interface ( Figure 1 a) will lead to a Search Result interface displaying the following information ( Figure 5 ): (i) INSERT Name AF0-47T6, which is clickable to display the INSERT Detail interface for AF0-47T6 ( Figure 2 ); (ii) Mouse strain FVB/NJ carrying AF0-47T6; (iii) Chromosome 5 on which AF0-47T6 is located; (iv) LINE Name Pkd2 carrying AF0-47T6, which is also a link to exhibit the LINE Detail interface for Pkd2 ( Figure 3 ) and (v) Gene hit by AF0-47T6 in its Ensembl ID ENSMUSG00000034462, which leads to the Gene Information and Transcript interface for the Pkd2 gene ( Figure 4 a and b). In the LINE Detail interface for Pkd2, two types of phenotypes have been characterized for this LINE, including detailed information that LINE Pkd2 has phenotypes similar to those of the previously described Pkd2 knockout mice ( 29 , 46 ). Advanced Search can also be used to retrieve any INSERT relevant to Pkd2 ( Figure 1 b). Currently, three options are available in the Advanced Search interface to reach this goal: (i) search for any INSERT within 0 kb of the Gene Pkd2 ; (ii) search for any INSERT on Chromosome 5 within 104 699 752–104 746 120 bp and (iii) search for any INSERT associated with any phenotype falling into either category Gross Anatomy or category Growth/Size/Lethality. Submission of the query using option 1 will lead to an intermediate interface to let users choose the right gene of interest to continue, because more than one gene name contains the four characters, pkd2. Submission of the query using options 2 or 3 directly leads to a Search Result interface similar as shown in Figure 5 .
BLAST and GBrowse tools are provided for enhanced search and user-friendly access to additional information relevant to each INSERT. The BLAST tool allows users access to any INSERT that hits into a mouse genomic DNA fragment of interest with the sequence information of this DNA fragment. The GBrowse tool offers users not only a simple and direct view of the genomic location of any INSERT in PBmice and the annotation of the mouse genome with user-defined criteria, but also links to the INSERT Detail interface in PBmice and the Gene Report interface in Ensembl. The Statistics interface presents the total number and distribution of INSERTs on the mouse chromosomes. A pie chart categorizes the genomic locations of all INSERTs into introns, exons, intergenic regions or ND (undetermined). Introduction and step-by-step search examples are provided in the Help interface of PBmice.
AVAILABILITY AND FUTURE DIRECTIONS
PBmice is freely available at http://www.idmshanghai.cn/PBmice , or http://www.scbit.org/PBmice/ . It is maintained by the IDM at Fudan University. All data sources used by PBmice are reviewed and updated quarterly. During the next 3 years, development of this system will include: (i) new INSERTs will be released from the on-going genome-wide PB mutagenesis project that aims to create 10 000 insertions in mice; (ii) corresponding changes will be made following any substantial change in the mouse genome sequence information from new releases of linked databases such as Ensembl. In particular, the position of the existing INSERTs will be remapped with a new round of Blast search to provide updated information for users; (iii) a dynamic data management system will be incorporated into PBmice so that collaborators have the opportunities to access to primary data prior to its release and (iv) a policy and practical mechanism will be developed to allow dissemination of mutant strains and relevant reagents for non-profit academic research.
This research was supported by grants from Chinese Key Projects for Basic Research (973) (Grant No. 2006CB806700, 2005CB321905 and 2002CB512801), Hi-tech Research and Development Project (863) (Grant No. 2007AA022012), National Natural Science Foundation of China (Grant No. 60473124), Shanghai Pujiang Program (Grant No. 05PJ14024) and Program for New Century Excellent Talents in University. We thank Rick Yan from Plano, Texas (TASM) for reading the early version of the manuscript. Funding to pay the Open Access publication charges for this article was provided by Hi-tech Research and Development Project (863) (Grant No. 2007AA022012).
Conflict of interest statement . None declared.