DemaDb: an integrated dematiaceous fungal genomes database

Many species of dematiaceous fungi are associated with allergic reactions and potentially fatal diseases in human, especially in tropical climates. Over the past 10 years, we have isolated more than 400 dematiaceous fungi from various clinical samples. In this study, DemaDb, an integrated database was designed to support the integration and analysis of dematiaceous fungal genomes. A total of 92 072 putative genes and 6527 pathways that identified in eight dematiaceous fungi (Bipolaris papendorfii UM 226, Daldinia eschscholtzii UM 1400, D. eschscholtzii UM 1020, Pyrenochaeta unguis-hominis UM 256, Ochroconis mirabilis UM 578, Cladosporium sphaerospermum UM 843, Herpotrichiellaceae sp. UM 238 and Pleosporales sp. UM 1110) were deposited in DemaDb. DemaDb includes functional annotations for all predicted gene models in all genomes, such as Gene Ontology, EuKaryotic Orthologous Groups, Kyoto Encyclopedia of Genes and Genomes (KEGG), Pfam and InterProScan. All predicted protein models were further functionally annotated to Carbohydrate-Active enzymes, peptidases, secondary metabolites and virulence factors. DemaDb Genome Browser enables users to browse and visualize entire genomes with annotation data including gene prediction, structure, orientation and custom feature tracks. The Pathway Browser based on the KEGG pathway database allows users to look into molecular interaction and reaction networks for all KEGG annotated genes. The availability of downloadable files containing assembly, nucleic acid, as well as protein data allows the direct retrieval for further downstream works. DemaDb is a useful resource for fungal research community especially those involved in genome-scale analysis, functional genomics, genetics and disease studies of dematiaceous fungi. Database URL: http://fungaldb.um.edu.my


Introduction
The kingdom fungi is made up of large eukaryotic organisms consisting of more than 100 000 species, including unicellular yeasts and multicellular fungi known as moulds and mushrooms (1). Dematiaceous fungi (brown-pigmented) are a large and heterogeneous group of moulds that produce melanin pigment in the cell wall of fungal hyphae or the conidia (2). Dematiaceous fungi occupy a plethora of niches, being found in soil, wood, as well as associated with plants as endophytes, saprophytes, parasites or plant pathogens (3)(4)(5). Until 2008, >130 species from 70 genera of dematiaceous fungi have been implicated in a wide range of human diseases, such as eumycetoma, chromoblastomycosis and phaeohyphomycosis (6). Alternaria spp., Bipolaris spp., Cladophialophora bantiana, Curvularia spp., Exophiala spp., Fonsecaea pedrosoi, Madurella spp., Scedosporium prolificans, Neoscytalidium dimidiatum and Wangiella dermatitidis are among the most important human pathogens commonly found in the tropical and subtropical climates (2). Additional reported cases worldwide further expand the existing long list of potential pathogens.
From 2008 to 2015, we have isolated a total of 437 dematiaceous fungi in the Mycology Unit of University Malaya Medical Centre (UMMC), Malaysia. These clinical isolates were recovered from superficial skin samples, nails, subcutaneous tissues, and nasopharyngeal secretion, blood, and tissue biopsies (7). Among these isolates, one strain of Pyrenochaeta unguis-hominis and two strains of Nigrospora oryzae demonstrated potential multidrug resistance features. In addition, several of the isolates are rare human pathogens or nonreported human pathogens, such as Bipolaris papendorfii (8), Daldinia eschscholtzii (9), Pyrenochaeta sp. (10), Ochroconis sp. (11) and Cladosporium sphaerospermum (12). Phylogenetic relationship of these dematiaceous fungi has been described by Yew et al. (7). The internal transcribed spacer (ITS)-based phylogenetic analysis resolved them into four distinct classes of Dothideomycetes, Sordariomycetes, Eurotiomycetes and one unclassified cluster (7).
The rapid advancement of Next-Generation Sequencing technologies has led to the sequencing of many fungal genomes, paving the way to decipher their biology and the underlying mechanisms of fungal pathogenicity and multidrug resistance. Several web-based analyses are also available for annotation of genes predicted from highthroughput genomic data to gain insight into the fungal living system machinery (13). However, the true challenge is to integrate the multiple sources of genomics data into useful information (14). In this work, we design the DemaDb that enables mycologists to access easily and analyse the genomics data using a genome browser. Currently, a total of eight genomes (B. papendorfii, D. eschscholtzii, P. unguis-hominis, C. sphaerospermum, Ochroconis mirabilis, Herpotrichiellaceae sp. and Pleosporales sp.) have been integrated into the DemaDb. Considering that dematiaceous fungal genomes will be generated from future projects, it is essential to manage and integrate the data generated from different analyses in a more organized manner. The current version of DemaDb is freely available at fungaldb.um.edu.my and an improved version with additional genomes is forthcoming. Figure 1 reveals the database schema of the DemaDb. DemaDb is built using a typical LAMP (Linux, Apache, MySQL and PHP) stack as the back-end components, along with Javascript and CSS (Cascading Style Sheet) for the front-end Web User Interface. The data are processed and stored in the MySQL relational database. The genome information is stored under 'Genome Information' table that records all the detailed information such as taxonomy classification, sequencing statistics, assembly statistics, pictures, references and external links. The information of gene models, annotations and genome browser links for each fungal genome is stored under 'Gene characterization' tables. 'Annotation and Classification' table is used to store various detail annotation, including Pfam, InterProScan, Gene Ontology (GO), EuKaryotic Orthologous Groups (KOG), Kyoto Encyclopedia of Genes and Genomes (KEGG), Carbohydrate-Active enzymes (CAZymes), peptidases, secondary metabolites and virulence factors.

Usage and Utility of Basic Data
A total of eight dematiaceous fungi that isolated from various samples (skin scraping, blood and nasopharyngeal secretion) were collected from University of Malaya between 2011 and 2014 ( Table 1). The genomes of these rare and potential human pathogens were then sequenced and deposited in DemaDb. DemaDb is a web database that built based on a relational database system that contains dematiaceous fungal genomic profiles. The genome in DemaDb has its individual profile page with strain characteristics and genomic details. The information included on this page comprises the strain details, colonial characteristics on Sabouraud Dextrose Agar, microscopic morphology, taxonomic classification, assembly statistics and gene models ( Figure 2A). The individual fungal profiles are linked to the NCBI taxonomic browser and The Catalogue of Life (15) to provide information on taxonomic hierarchy, distribution and ecological environment. Users can also explore the genome sequencing statistics in a single-genome context to gain insight into each sequencing platforms, read type, library size, genome size and sequencing coverage ( Figure 2B). An overview of genome statistics ( Table 2) and gene models (Table 3) of all genomes provides a basic comparative genomic analysis.

Uniform Functional Annotation
Application of the same annotation pipeline to all genomes is necessary for data integration in the DemaDb comparative genomic framework. All the raw data were pre-processed, assembled and functionally annotated using our pipeline. The compilation pipeline is provided in Figure  3. Protein-coding gene models were predicted from repeatmasked genome using GeneMark-ES version 2.3e (16). The annotation of protein-coding gene models was completed using BLAST (Basic Local Alignment Search Tool) alignments of fungal genomes against NCBI non-redundant (nr) protein and SwissProt databases. Individual rRNAs and tRNAs were identified using RNAmmer v1.2 (17) and tRNAscan-SE v1.3.1 (18), respectively. All putative proteins were then functionally annotated. Pfam protein families database (19) and InterproScan 5 (20) were used to identify functional domains and sites in all predicted protein models. GO and KEGG metabolic pathways matches were carried out using local BLAST2GO tools (21). All the predicted proteins were also ascribed to 21 different functional groups based on KOG (22) for additional functional interpretation. The CAZymes was annotated by submitting the predicted protein models to the databases of automated Carbohydrate-active enzyme ANnotation (dbCAN) (23). The peptidases were identified by mapping all protein models against MEROPS database (24). Genomic mapping of fungal secondary metabolite clusters was performed using web-based SMURF (Secondary Metabolite Unknown Regions Finder) (www.jcvi.org/smurf/) (25). The putative virulence factor was predicted using PHI-base (The Pathogen-Host Interaction Database) (26).
The number of genes listed in the categories of KEGG, EC, GO, KOG, Pfam, Interpro, CAZyme, secondary metabolite, peptidase and virulence factor is available for comparison either among genomes in the DemaDb or other genomes outside the DemaDb (Tables 4). Each category is linked to a       detailed annotation report for every predicted protein in the individual genome. For example, users can explore peptidase families, the range of peptidase, active site residues, ligands for catalytic metal ions and E-value for the match in all predicted peptidases (Tables 4). In-depth multidimensional analysis can be performed using these data to obtain clues about their fungal lifestyle, adaptability, mating development, mechanisms underlying pathogenicity and drugs resistance. The details for every predicted gene, including gene ID, NCBI nr annotation, SwissProt annotation, GO annotation, amino acid sequence, protein size, domain sites, EC number, as well as the detailed functional annotation reports are available in the DemaDb (Figure 4). Functional annotations for Herpotrichiellaceae sp. UM 238 and Pleosporales sp. UM 1110 are in progress and will be included in an updated version of DemaDb.

Genome Browser
Generic Genome Browser (GBrowse) developed by GMOD (27) is integrated into DemaDb to allow graphical web visualization of the genomic data. The DemaDb genome browser is accessible through the links at each genome project portal page and genome's gene annotation & classification page, allowing the users to explore the genes of interest in a single-genome context. It also provides the navigation of the genomic regions for all eight fungal genomes in DemaDb, which can be freely switched through the drop-down menu of the data source box. Landmark or Region textbox acts as a universal search box, in which user could search by entering several types of inputs, including, but not limited to, range, gene name, chromosome name and description. It displays the predicted features of gene models along with their functional description ( Figure 5). Single click on each gene track features will bring up a page showing additional annotation data retrieved from DemaDb database. By clicking on the gene ID, users will be brought to the detail description page of the annotation and amino acid sequence of the gene. The mRNA/CDS tracks are linked to additional gene details describing the gene structures and sequences ( Figure 5). Additional tracks such as the DNA/GC Content track, 6-frame translation track and frame usage track can be displayed and configured using the toolbar.

Pathway Browser
The DemaDb Pathway Browser based on the KEGG pathway database enables users to analyse molecular interaction and reaction networks for all KEGG annotated genes. Pathway Browser provides a summary of metabolic pathways of each fungus and number of genes in each pathway. Users can select the pathways that they wish to analyse without the need for laborious search for all genes in every KEGG pathway. The top 15 hit pathways are shown in the Pathway Browser ( Figure 6[TQ2]). Individual elements such as EC number and genes that are involved in specific pathway are linked to the corresponding pages for additional details. Users can also freely search for pathway map for all KEGG annotated genes ( Figure 6).

Downloadable Files
All assembly, nucleotides and proteins data can be downloaded in FASTA format to accommodate users who wish to pre-process the data based on additional parameters or conduct large-scale comparative genomic analysis.

Future Perspective
DemaDb will be updated whenever a new genome and its detailed analysis are completed. Details for each release will be displayed in the homepage. Users are also encouraged to deposit and integrate their fungal genomes into DemaDb, which will be coordinated by the principal investigator of DemaDb. The users can send their assembled sequences (fasta format), predicted genes (fna format) or functional annotated proteins (faa format) directly to the principal investigator where the contact details are available in the 'Contact' section. The quality of the raw data will be accessed and reviewed, and, if approved, will be annotated and deposited into the DemaDb. Such collective works are important for comparative genomic analysis to gain a better understanding of dematiaceous fungi functional diversity and evolution. Up to date, we have produced a total of 398 ITS sequences from various dematiaceous fungi recovered from different anatomical sites. This large set of data will be archived into an updated version of DemaDb. Apart from the ITS sequences, isolation source as well as macroscopic and microscopic characteristics of these isolates will be incorporated to provide comprehensive information of each isolate. Also, a taxonomic engine will be constructed to allow users to identify their fungal isolate by conducting a similarity search against our curated database, and an ITS-based phylogenetic tree will be generated for taxonomic classification. This extensive collection of data would serve as a referral point for fungi isolated in tropical countries, such as Malaysia. As some fungal species appeared to be geographically restricted (28), this would provide a glimpse of idea on the commonly encountered dematiaceous fungal pathogen in Malaysia. To the best of our knowledge, there is no integrated view of dematiaceous fungal gene expression profiles obtained from RNA-sequencing (RNA-seq). Currently, we are performing RNA-seq of these clinical isolates. In the future, the DemaDb database will collect these RNA-seq data and use a standardized method to identify the gene expression levels. Integration of RNA-seq data into the genome browser will be one of the new features in the upcoming DemaDb database. Conflict of interest. None declared. Figure 6. The layout of Pathway Browser for B. papendorfii UM 226 genome. The total number of genes and EC numbers for each pathway is displayed in the Pathway Browser page. Single click on each EC Number or gene will bring up a page showing additional information of genes that are involved in the specific pathway. Each KEGG annotated gene is also linked to a pathway map. The coloured EC numbers indicate that the genes are mapped to that EC number in the pathway.