MICdb3.0: a comprehensive resource of microsatellite repeats from prokaryotic genomes

The MICdb is a comprehensive relational database of perfect microsatellites extracted from completely sequenced and annotated genomes of bacteria and archaea. The current version MICdb3.0 is an updated and revised version of MICdb2.0. As compared with the previous version MICdb2.0, the current release is significantly improved in terms of much larger coverage of genomes, improved presentation of queried results, user-friendly administration module to manage Simple Sequence Repeat (SSR) data such as addition of new genomes, deletion of obsolete data, etc., and also removal of certain features deemed to be redundant. The new web-interface to the database called Microsatellite Analysis Server (MICAS) version 3.0 has been improved by the addition of powerful high-quality visualization tools to view the query results in the form of pie charts and bar graphs. All the query results and graphs can be exported in different formats so that the users can use them for further analysis. MICAS3.0 is also equipped with a unique genome comparison module using which users can do pair-wise comparison of genomes with regard to their microsatellite distribution. The advanced search module can be used to filter the repeats based on certain criteria such as filtering repeats of a particular motif/repeat size, extracting repeats of coding/non-coding regions, sort repeats, etc. The MICdb database has, therefore, been made portable to be administered by a person with the necessary administrative privileges. The MICdb3.0 database and analysis server can be accessed for free from www.cdfd.org.in/micas. Database URL: http://www.cdfd.org.in/micas


Introduction
Microsatellites, also known as Simple Sequence Repeats or Short Tandem Repeats, are the tandem repetitions of nucleotide motifs of size 1-6 bp (1). They are ubiquitous in nature and are found in almost all organisms ranging from viruses to humans (2). Microsatellites are distributed throughout the genomes and are found in both coding and non-coding regions (3). These repeats are of interest for many researchers owing to their unique nature, significance and application in various fields. Microsatellite regions more frequently undergo mutations (point mutations as well as change of repeat number) than the other genomic regions (4). Mutations in microsatellites in the coding regions and noncoding regions are known to affect the processes of transcription and translation and have also been implicated in several diseases (5)(6)(7). Microsatellites are the most widely used genetic markers and are also applied in various fields such as DNA fingerprinting, linkage analysis, forensics, paternity studies, etc. (8,9). During the past decade, microsatellites have This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/ licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Till date, many organism-specific microsatellite databases (10)(11)(12)(13)(14)(15)(16)(17)(18) including MICdb (19) are being used widely by researchers. MICdb is a relational database of perfect microsatellites extracted from known prokaryotic genomes developed by us. MICdb is linked to a graphical interface called Microsatellite Analysis Server (MICAS) using which the database is queried (20). So far MICdb and MICAS have been upgraded two times and recently these were upgraded further to MICdb3.0 and MICAS3.0 by adding new genomes as well as by adding new tools and interfaces for search and analysis. The new database holds microsatellite data extracted from the completely sequenced prokaryotic genomes that are published in NCBI repository. It has to be noted that there might be other genomes that are sequenced but not yet available at NCBI and such genomes do not form part of MICdb database. MICAS3.0 has been developed in such a way that it can be integrated into other genome databases and for this we will provide the necessary assistance. The following sections describe the various enhancements of MICdb3.0 compared with the earlier versions.

Database Construction
The microsatellites were identified and extracted from the completely sequenced prokaryotic genomes downloaded from the NCBI genome repository (ftp://ftp.ncbi.nlm.nih. gov/genomes/Bacteria) using IMEx (21,22) with the following parameters (repeat type: perfect; minimum repeat number: mono:6, di: 3, tri: 2, tetra:2, penta:2, hexa:2). Following Saunders et al. (23) we extracted the perfect repeats with tract lengths of at least 6 bp. IMEx was chosen, as this performs better than many other available tools for microsatellite identification (24). To incorporate data into MICdb, which was constructed using MySQL (www.mysql.com), the output files of IMEx were parsed using computer programs developed in C & Java. The database is composed of 27 tables.
MICAS3.0, the web-interface to MICdb3.0, provides three different data access modules-'Browse' (search by alphabetical order of genomes), 'Advanced Search' (search by user criteria) and 'Pair-wise Comparison of Genomes' (compare genomes for microsatellite distribution and densities). This server has been developed using HTML and CSS. The server side scripting has been done using PHP and AJAX. Both MICAS and MICdb have been hosted on a Linux Server containing Apache web-server and special care has been taken to ensure the interactivity and user-friendliness of the system.

Features and Enhancements
MICdb3.0 and MICAS3.0 have been loaded with many useful features that facilitate the users in analysing microsatellites on-the-fly. The following sub-sections describe the various features and enhancements of the new versions of MICAS and MICdb.

Updated genome repository
The current version of MICdb hosts the microsatellite data of 5043 prokaryotic sequences that include 4772 bacterial (including 2118 plasmid sequences) and 271 archaeal genome sequences. The earlier versions of MICdb contained microsatellite data of few genomes. MICdb1.0 (19) hosted only 83 genomes on a whole, whereas MICdb2.0 hosted data of 487 genomes (178 bacterial genomes + 288 viral genomes + 21 archaeal genomes). The current version MICdb3.0 hosts repeat data of >5000 prokaryotic sequences that can be updated regularly. MICdb3.0, like its previous version, does not host the repeat data of viruses, as a separate and exclusive microsatellite database exists for viral genomes named Viral Microsatellite Database (VMD) (17). The MICdb database can be updated from time to time using the admin module.

Visualization module
MICAS, the web-interface of MICdb, has been powered with a dynamic visualization module that can generate high-quality graphs and charts to depict the distribution and frequencies of various microsatellites found in the queried genomes. The user can get the summary of each genome (Figure 1) in the form of pie and bar charts.
The list of SSRs are displayed neatly in tabular format with details such as the repeat motif, iterations, start and end co-ordinates of each SSR, nucleotide composition of each SSR, a link to the protein information if the SSR falls in coding regions and an option to design primer separately for each SSR. Clicking on the co-ordinates will display the complete SSR sequence along with a flanking sequence and summary information of that SSR (Figure 2). An option to export the total SSRs into various formats (.xls, .csv and .txt) has also been provided using which users can download the SSRs and further use them in their analysis.

Advanced search module
To facilitate users for getting repeats based on specific search criteria, MICAS3.0 has been provided with an advanced search module. Using the advanced search module, users can select a particular genome of interest and also specify his/her search criteria and filter repeats accordingly. Advanced search module can filter repeats of a particular size (mono, tri, tetra, etc.,); can get repeats of a particular pattern (CAG, Poly A etc.); can set the minimum repeat number of each motif size; and can filter repeats of only coding or non-coding regions. Moreover, the advanced search module allows you to define the output format (HTML, Excel, CSV or Text) and also sort results based on motif, motif-size or tract length (Figure 3).

Results export module
Researchers usually extract microsatellite data of a particular genome and use it for further statistical analysis. Hence, an option to download the results in usable formats has been provided. The results of user queries to MICdb can be exported into different formats such as Text, CSV and Excel. The graphs generated by the visualization module of MICAS can also be downloaded in different image formats such as PNG, JPEG, SVG as well as in PDF format. An option to print the output graphs has also been provided.

Admin module
As the number of genomes getting sequenced is increasing rapidly, most of the microsatellite databases are not updated and are outdated. To avoid this problem, MICdb3.0 is equipped with an admin module, a graphical user interface to update the microsatellite data of new genomes from time to time. The MICdb admin needs to login (with a valid user id and password) to the admin module for management of microsatellite and genome data in the database. The admin module can be used to add microsatellite data of new genomes as and when they become available at the NCBI genome repository, edit SSR data of an existing genome as well as delete the unwanted or redundant data from the database. The homepage of admin module displays the list of newly added/modified genomes of NCBI FTP server that are not present in MICdb with update buttons against those genomes. A single click will automatically download the FNA and PTT files of that genome to the MICAS server, submit the files to IMEx for SSR extraction and finally insert the records into the database. It has to be noted that as we use the annotations available at NCBI, they might include errors. Because annotations may also get updated at the NCBI, any such update is identified and the data are updated automatically by the admin module. The Edit feature of admin module has been provided to edit or make corrections to the meta-data and the microsatellite data of a genome. Similarly, the unwanted and redundant data in the database can be deleted directly using the delete option of admin module. A snapshot of the admin module can be found in the Figure 5.

Funding
Computing infrastructure used for this work has been supported by the core grant of CDFD.