-
PDF
- Split View
-
Views
-
Cite
Cite
Veenu Aishwarya, P. C. Sharma, UgMicroSat db : database for mining microsatellites from unigenes , Nucleic Acids Research, Volume 36, Issue suppl_1, 1 January 2008, Pages D53–D56, https://doi.org/10.1093/nar/gkm811
- Share Icon Share
Abstract
Microsatellites, also known as simple sequence repeats (SSRs) or short tandem repeats (STRs), have extensively been exploited as molecular markers for diverse applications. Recently, their role in gene regulation and genome evolution has also been discussed widely. We have developed UgMicroSat db (Unigene MicroSatellite d ata b ase), a web-based relational database of microsatellites present in unigene sequences covering 80 genomes. UgMicroSat db allows microsatellite search using multiple parameters like microsatellite type (simple perfect, compound perfect and imperfect), repeat unit length (mono- to hexa-nucleotide), repeat number, microsatellite length and repeat sequence class. Microsatellites can also be retrieved by specifying EST, cDNA, CDS identity or by using Gene Index, GenBank, UniGene IDs. The database also provides information about trinucleotide repeats encoding various amino acids. Such codon repeats can be searched by specifying characteristics of coded amino acids like charge (basic, acidic or neutral), polarity (polar or non-polar), and their hydrophobic or hydrophilic nature. The nucleotide sequences of the target UniGenes are also provided to facilitate primer designing for PCR amplification of the desired microsatellite. UgMicroSat db is available at http://ipu.ac.in/usbt/UgMicroSatdb.htm .
INTRODUCTION
Microsatellites represent arrays of 1–6 bp tandem repeats in DNA. These sequences have proved very useful as molecular markers in diverse areas of genetic research including genome characterization and mapping ( 1 ). Recently, microsatellites have also been implicated to play some role in gene regulation and genome evolution ( 2–4 ). Association of trinucleotide repeats with many human diseases has been reported over the years ( 5 ). Some important examples include Huntington's disease and spinocerebellar ataxia (SCA) caused by expansion of CAG repeats ( 6 , 7 ), and oculopharyngeal muscular dystrophy caused by GCG expansion ( 8 ). Some other reports suggest association of trinucleotide repeats with various forms of cancer ( 9 , 10 ). (A) n repeats too have been assigned both cancer causing ( 11–13 ) and tumor-suppressive functions ( 14 ). Interestingly, trinucleotide repeats are now also being considered as potential therapeutic agents that act by triggering RNAi pathway as they are highly specific in silencing the mutant transcripts containing complementary repeats ( 15 ). In Drosophila, expansion of CAG repeats in homeobox gene DLX6 leads to cell death ( 16 ).
The importance of microsatellites has been appreciated in plant systems also. Microsatellites derived from EST sequences (EST–SSRs) have found immense utility in various research projects in recent years ( 17 ). EST–SSR markers have been preferred over genomic-SSR markers for plant improvement programmes owing to their higher interspecific transferability rate. Moreover, EST–SSRs are proposed to be the better candidates for gene tagging. EST based microsatellite markers have been developed for apricot and grape ( 18 ), barley ( 19 ) and wheat ( 20 ), to name a few. However, unlike animal systems, significance of microsatellites in transcriptional activities has not been well documented in plant systems.
Considering vast utility of microsatellites in the fields of medicine and agriculture, many research groups have attempted to characterize their abundance, distribution and genomic localization using in silico methods ( 21–23 ). Such tools enable research scientists to exploit microsatellites for different applications with greater efficiency and specificity. EST–SSR databases have been developed and released in public domain by different groups. Examples of such databases include PlantSSR database ( http://www.genome.clemson.edu/projects/ssr/ ) developed by Clemson University Genomics Institute (CUGI) and COS–EST–SSRs for cereals and legumes ( http://intranet.icrisat.org/gt1/ssr/ESTSSRClustersubmit.asp ). CMD (Cotton Microsatellite Database; 24 ) and SilkSatDb ( 25 ) provide microsatellite data from genomic sequences as well as EST sequences. Satellog ( 26 ) catalogs a number of disorders associated with mutations in microsatellite sequences near or within genes in humans. InSatDb ( 27 ) provides a compilation of microsatellites in five insect species. Such databases offer useful resources for various research activities aiming towards development of microsatellite markers and also for investigations focusing on deciphering the functional roles of microsatellites.
The databases and resources described above and elsewhere ( 28 ) remain specific and limited in their content and purpose. In particular, microsatellites within the transcriptionally active regions of the genome have not received the desired attention over a wide range of taxa. Such information is, however, necessary for undertaking various transcriptome based experimental studies. Therefore, there is a need to develop a platform for mining genic microsatellites to ensure, firstly their better utilization as molecular markers, secondly to understand fundamental questions concerning their abundance, distribution and evolution and thirdly to attribute putative function, if any, to these repeats. Considering this requirement, we have developed UgMicroSat db (Unigene MicroSatellite database), a relational database that provides information on microsatellites present within the unigenes across 80 eukaryotic genomes. A classified range of search options facilitate a user friendly and specific extraction interface. The database is so designed that unigene sequences harboring microsatellite(s) of interest can be extracted and further used, for example, in cross amplification experiments. We hope that information retrieved from the database may be helpful in opening new frontiers of basic and applied research on microsatellites.
CONSTRUCTION OF DATABASE
UgMicroSat db provides a catalogue of microsatellites occurring in unigene sequences of eukaryotic organisms belonging to a wide range of taxonomic groups. UniGene sequences were downloaded from The National Center for Biotechnology Information (NCBI) and scanned using a simple sequence repeat mining tool called MISA ( 19 ) that extracted microsatellite motifs and wrote them in tab delimited text files. This raw microsatellite information was processed using a set of C++ codes and Perl scripts. VMI_PRCS, a C++ code, processed statistics like size and position of microsatellites within the unigene sequences. VUG_PRCS, another C++ code, processed unigene IDs and sequences in the desired format. The data was reassembled using a Perl script VDATA_ASMBL and a data file was created. The data file was further formatted and then imported as a table in a MS-ACCESS database. Similar approach was adopted for all the individual sets of unigene sequences of different species. The overall scheme of database construction is explained in Figure 1 . The quick retrieval of information from UgMicroSat db has been ensured by creating small, specific sub-databases for different groups of organisms. Furthermore, within each sub-database, individual organism has been represented by a separate table. A parent database that indexes all the sub-databases and the tables therein maintains fast, efficient and precise communication with these sub-databases. The graphical user interface was constructed using Active Server Pages (ASP). The overall architecture of the database has been outlined in Figure 2 .


ACCESSING DATABASE
UgMicroSat db can be accessed to extract simple (perfect) repeats and compound (perfect and imperfect) repeats. Microsatellites can be mined using various search options viz. repeat unit length (mononucleotide to hexanucleotide), repeat sequence (motif search), microsatellite length, host cDNA, CDS, EST name, GenBank ID, gene index number or UniGene ID and microsatellite class search ( 29 ). For trinucleotide repeats, the database also gives data for codon repeats i.e. repeats that code for amino acids. The option ‘amino acid codon search’ allows search for repeats that code for all the 20 amino acids. Further, search can also be made for codon repeats on the basis of characteristics of amino acids like charge (basic, acidic and neutral), polarity (polar and non-polar) and hydrophobicity or hydrophilicity. The database also allows batch download, such that a user can download all the microsatellites mined in response to a particular query in a text file. The database is linked with NCBI for the retrieval of detailed information on unigene sequences based on their genbank IDs. Finally, the database allows the user to design primers for PCR amplification of the specific microsatellite locus by providing the selected unigene sequences harboring the particular microsatellite, and is also linked to Primer3, a primer designing tool ( 30 ). A quick help mode is provided with examples on how to fill in the search options for easy reference and navigation for the user. The search options are extensively explained with the help of some case studies on the database website. The database can serve as an immense source of information in understanding the microsatellite dynamics in the transcriptionally active regions. For example, information pertaining to a trinucleotide repeat CTC, whose length is between 10 and 20 base pair, present in platelet-derived growth factor alpha polypeptide (PDGFA) of Homo sapiens can easily be searched as shown in Figure 3 . The output also gives the GenBank IDs and gene index number along with the unigene IDs. Further, the details of the unigene and the localization of the microsatellite (start and end positions) are provided and the microsatellites are marked in lower capitals.

User interface for UgMicroSat db , showing (A) various input options, (B) output, and (C) Unigene sequence (only the start sequence, is shown) with microsatellite sequence (CTC)5 and position (52–66).
UTILITY OF THE DATABASE
UgMicroSat db is likely to be accessed by biologists engaged in research with diverse objectives in both plant and animal systems primarily to develop molecular markers and also to understand the functional significance of microsatellites in regulating gene expression and genome evolution. UgMicroSat db offers an important platform for a detailed comparative analysis of microsatellite repeats in genic regions for a wide range of species. The comprehensive options to search for simple and compound microsatellites and to identify the codon repeats in the genic regions allow users to explore new avenues of investigations on these repeats. The availability of unigene sequences for different aspects like designing primers for PCR amplification of desired motifs will facilitate studies on mutability, microsatellite abundance and variability across genomes, etc. Association of these microsatellites with a particular disease or phenotype may be explored by identifying their expansion and contraction possibilities in a given population. Microsatellite data can also be used to investigate various anomalies and disorders using candidate gene approach. Further, such information can be used to design synthetic oligonucleotides representing complementary repeats to be used in RNA interference based silencing to target mutant genes. Such approaches hold much therapeutic value. The database can largely be used to develop EST–SSR markers for various research programmes, particularly on genome mapping and gene tagging. Apart from hosting an extensive form of microsatellite data within the genes, UgMicroSat db is unique in a way as compared to the previously developed databases as it hosts microsatellite data covering a large number of organisms including both lower and higher forms of plants and animals. Relative mining of imperfect repeats of such a diverse range of organisms may provide tools to study the dynamics of microsatellites and also their association with similar or different types of repeats. To conclude, UgMicroSat db will serve as an important starting point whereby extracted information serves as an important input in designing experiments in new directions elucidating novel roles and functions of microsatellites in the unexplored transcriptomes.
FUTURE PERSPECTIVES
At present, UgMicroSat db hosts data on microsatellites occurring in unigene sequences of 80 genomes. UgMicroSat db team aims to update the database commensurating with updation of the NCBI unigene database. The flexible design of the database makes it feasible to increase the size of database to virtually any size without compromising with its fast data retrieval rate.
AVAILABILITY
UgMicroSat db can be freely accessed from http://ipu.ac.in/usbt/UgMicroSatdb.htm
ACKNOWLEDGEMENTS
Funding to pay the Open Access publication charges for this article was waived by the Oxford University Press.
Conflict of interest statement . None declared.
Comments