Abstract

The methylotrophic yeast Hansenula polymorpha is a recognised model system for investigation of peroxisomal function, special metabolic pathways like methanol metabolism, of nitrate assimilation or thermostability. Strain RB11, an odc1 derivative of the particular H. polymorpha isolate CBS4732 (synonymous to ATCC34438, NRRL-Y-5445, CCY38-22-2) has been developed as a platform for heterologous gene expression. The scientific and industrial significance of this organism is now being met by the characterisation of its entire genome. The H. polymorpha RB11 genome consists of approximately 9.5 Mb and is organised as six chromosomes ranging in size from 0.9 to 2.2 Mb. Over 90% of the genome was sequenced with concomitant high accuracy and assembled into 48 contigs organised on eight scaffolds (supercontigs). After manual annotation 4767 out of 5933 open reading frames (ORFs) with significant homologies to a non-redundant protein database were predicted. The remaining 1166 ORFs showed no significant similarity to known proteins. The number of ORFs is comparable to that of other sequenced budding yeasts of similar genome size.

Introduction

Yeasts constitute an important group of industrial microorganisms. Its long tradition of human use, the overwhelming knowledge of its genetics and physiology made the baker's yeast Saccharomyces cerevisiae a eukaryotic model organism for basic research and industrial applications [1]. In 1996, it was the first eukaryotic organism for which the complete genome sequence was established [2]. The initial focus on S. cerevisiae has been extended by investigations of a range of alternative yeast species. As a consequence, the number of fully or partially sequenced budding yeast genomes has continued to grow. Among others, a comparative genomic exploration of 13 species was conducted selected from hemiascomycetous yeasts [3].

The methylotrophic yeast Hansenula polymorpha (syn. Pichia angusta) is one of the most important industrially applied non-conventional yeasts [4,5]. H. polymorpha is a ubiquitous yeast species occurring naturally in spoiled orange juice, maize meal, in the gut of various insect species and in soil. It grows as white to cream, butyrous colonies and does not form filaments [6]. H. polymorpha isolates are homothallic and reproduction occurs vegetatively by budding. H. polymorpha belongs to the fungal family of Saccharomycetaceae, subfamily Saccharomycetoideae [6,7]. Most research has been performed with three basic strains designated as H. polymorpha DL-1, CBS4732 and NCYC495, respectively. These strains are of independent origin and unclear relationship and exhibit different features, including different chromosome numbers. Depending on strain and separation conditions, between two and seven chromosomes can be distinguished [8,9]. Strain CBS4732 (syn. ATCC34438, NRRL-Y-5445; CCY38-22-2) was originally isolated from soil irrigated with waste water from a distillery in Pernambuco, Brazil [10]. Its odc1 derivatives LR9 [11] and RB11 [12] have been developed as hosts for heterologous gene expression [12]. Recombinant compounds produced in these hosts include enzymes like the feed additive phytase [13,14], anticoagulants like hirudin and saratin [15–17] and an efficient vaccine against hepatitis B infection [18–20]. The significance of H. polymorpha in basic research stems largely from studies focussed on peroxisome homeostasis [21] and nitrate assimilation [22]. Although much is known about the physiology, biochemistry and ultra structure of this yeast (for review see monograph on H. polymorpha[4]), little information is available about the genomic structure and function [23]. Several groups worldwide have initiated studies on its genome several years ago. Included in the comparative genome analysis on 13 hemiascomycetous yeasts mentioned above part of the H. polymorpha (P. angusta) genome sequence was established using a partial random sequencing strategy with a coverage of 0.3 genome equivalents. Using this approach, about 3 Mb of sequencing raw data of the H. polymorpha genome was obtained [3]. We performed a genome analysis aimed at a higher coverage and using a BAC-to-BAC approach. This work now culminated in the comprehensive genome analysis of this organism. A first description of the data generated is provided in this study. Access to the genome data can be granted upon request (G.G.) and after signing a Material Transfer Agreement. The access has already been granted to six academic groups working on various aspects of functional genomics of H. polymorpha.

The present paper describes the results of the sequencing and characterisation of 8.733 Mb assembled into 48 contigs. The sequence covers over 90% of the estimated total genome content of 9.5 Mb located on six chromosomes ranging in size between 0.9 and 2.2 Mb [23]. The established sequence contains 5933 ORFs.

Materials and methods

Construction of the genomic BAC library

For the sequencing of H. polymorpha strain RB11, an odc1 derivative of wild-type strain CBS4732 was selected [12]. For the construction of the genomic BAC library of H. polymorpha, the vector pBACe3.6 was used and prepared according to Osoegawa et al. [24]. H. polymorpha cells from a 50 ml YPD (1% yeast extract, 2% peptone, 2% glucose) culture were washed twice with TSE buffer (25 mM Tris–HCl, 300 mM sucrose, 25 mM EDTA, pH 8) and resuspended in TSE buffer. Then, agarose plugs from these cells were prepared according to the Bio-Rad manual of the Chef DR II pulsed-field gel electrophoresis system (PFGE system) using 1.5% low melting point agarose. Pre-electrophoresis was carried out on a Bio-Rad PFGE system. Partial digestion of genomic DNA was carried out according to Osoegawa et al. [25] using Sau3AI for restriction. Gel electrophoresis was carried out on a Bio-Rad PFGE system according to conditions given at Rod Wing's homepage (Clemson University, Genomics Institute, construction of BAC libraries protocol: 6 V cm−1, 90 s pulse, 13°C 18 h). Agarose digestion with gelase, ligation and transformation were carried out using the same protocol. Subsequent electroporation of DH10B cells (Invitrogen) was again carried out according to Osoegawa et al. [25], and bacteria were plated onto 2×YT plates supplemented with chloramphenicol as selecting agent. Clones obtained from that procedure were picked and used to inoculate 1.2 ml of 2×YT supplemented with chloramphenicol. These bacterial cultures were used to prepare glycerol stocks in 96-well microtitre plate format as resource for all subsequent work.

Construction of shotgun libraries from BAC DNA

Large-scale preparations of BAC DNA were carried out using the Large-Construct kit from Qiagen (Qiagen GmbH, Hilden, Germany; cat. no. 12462). After sonification and enzymatic repair of the ends, fragments of desired size (usually 1.2–1.5 kb) were isolated from a 1% preparative agarose gel using the MinElute Gel Extraction kit (Qiagen, cat. no. 28604) and inserted into a SmaI-digested and alkaline phosphatase-treated pUC19 vector [26]. Ligation was carried out with the Rapid Ligation kit (Roche) according to the manufacturer's protocol. The ligation mixture was then desalted using a QIAquick kit (Qiagen, cat. no. 28304) according to the instructions of the supplier with the exception of the elution step. This was carried out with ddH2O. 1/10 volume of the eluted DNA was used for transformation of competent Escherichia coli DH10B cells using a Genepulser II device (Bio-Rad). 1 ml Luria–Bertani (LB) medium [26] was added and incubated for 1 h at 37°C. 1/200 and 1/20 volumes of the transformed cells were plated onto Petri dishes containing LB agar, ampicillin, X-Gal and isopropyl thiogalactose (IPTG) [26] and grown overnight at 37°C to determine the yield of recombinant clones. Usually the transformation rate was greater than 108 transformants per μg vector DNA and the white:blue ratio was approximately 10:1 or better.

Plasmid preparation of shotgun clones

For subsequent DNA sequencing, plasmid DNA from white colonies was isolated after growth in 1.2 ml 2×YT cultures containing ampicillin for 24 h at 37°C and shaking at 220 rpm. Plasmid purification of shotgun clones was carried out using the REAL Prep 96 kit (Qiagen, cat. no. 26173).

DNA sequencing

DNA sequencing reactions were set up using BigDye Terminator v 2.0 cycle sequencing chemistry (Applied Biosystems, cat. no. 4314416) and purified using the DyeEx 96 (Qiagen, cat. no. 63183). Sequencing data were generated using ABI Prism 3700 sequence analyzers.

Sequence assembly

Base calling and quality checks were carried out using Phred [27]. Sequences were assembled with Phrap and editing was performed after import into gap4. BAC assemblies and raw data were visualised and edited using the STADEN package (version 4.5; developed by Roger Staden et al.; http://www.mrc-lmb.cam.ac.uk/pubseq/staden_home.html).

Automated bioinformatic annotation

Fully automated annotation was carried out using the ConSequence™ software system provided by Qiagen (based on Pedant-Pro™ from Biomax Informatics AG) [28].

Results and discussion

Genome sequencing

A BAC library with approximately >17× coverage was constructed in pBACe3.6 and characterised by end-sequencing and restriction digestion. Insert sizes of BAC clones ranged from below 50 to over 100 kb per clone. A total of 2880 BAC clones were generated with an average insert size of 65 kb. 4892 BAC end sequences were generated with 483 bases average read length (phred20). BAC-end sequencing success rate was 85.5%. In total, 213 BAC clones were selected for analysis, out of which 188 BACs representing the minimal tiling path were selected for shotgun sequencing, BAC-by-BAC. Sequencing coverage of BACs was 8.27-fold on average (Fig. 1). The number of BACs with one contig only was 162, with two contigs 15, with three contigs 9 and BACs with four contigs were 2.

1

Summary of sequencing statistics.

1

Summary of sequencing statistics.

Genome assembly

The BAC library constructed covers the genome 18-fold. 4892 BAC-end sequences from those clones yielded approximately 2.4 Mb of raw data, covering 25% of the genome (at 1×). On average, every 2 kb one BAC-end sequence is located on the genome, suggestive of an estimating genome size of about 9.78 Mb. Pulsed-field gel electrophoresis of H. polymorpha RB11 chromosomes revealed six bands and the sum of the molecular masses of chromosomal DNA bands suggested a genome size of about 9–10 Mb [5] (Table 1). Mapping the end sequences onto the growing and eventually final genomic sequence showed a very even distribution of those end sequences with no local clustering, underlining the good random cloning of large genomic sub-fragments into this BAC library. The only exception were clones and end sequences falling into the rDNA region of the genome. There were no further large repetitive regions noticed. Smaller repeat regions have all been resolved for each individual BAC. Further, no repeats within BAC/BAC overlapping regions, potentially confounding a correct BAC-to-BAC assembly, were found. In addition to the BAC-to-BAC assembly based on overlapping regions, all BAC-end sequences with their forward/reverse constraints per clone as well as sizing information for individual BAC clones were used to layer a BAC map on top of the resulting assemblies. The consistency of the assembly was checked on the back of that BAC map for each BAC/BAC overlap and assembly. No discrepancies were detected between a single BAC/BAC overlap assembly and the BAC map backbone.

1

Overview of genome organisation and assembled sequences in supercontigs

Chromosome karyotype Size (Mb) Chromosome marker Sequencing supercontig Size (bp) 
0.95 URA3; CPY (PRC1); GAP 968 770 
II 1.25 rDNA (5.8S, 18S, 26S) 983 699 
III 1.5 HARS1 1 220 583 
IV 1.7 PEP4 (PRA1); TPS1 1 290 524 
1.9 MOX 1 306 376 
VI 2.2 FMD 1 494 936 
   218 529 
   1 250 065 
9.5  8 733 482 
Chromosome karyotype Size (Mb) Chromosome marker Sequencing supercontig Size (bp) 
0.95 URA3; CPY (PRC1); GAP 968 770 
II 1.25 rDNA (5.8S, 18S, 26S) 983 699 
III 1.5 HARS1 1 220 583 
IV 1.7 PEP4 (PRA1); TPS1 1 290 524 
1.9 MOX 1 306 376 
VI 2.2 FMD 1 494 936 
   218 529 
   1 250 065 
9.5  8 733 482 

The genome was assembled into 48 contigs and could be logically joined using clones physically bridging known gaps to eight supercontigs with a unique total size of 8.733 Mb from the six known chromosomes with assigned gene markers to electrophoretically separated chromosomes [5] (Table 1 and Fig. 2). Sequence overlaps between individual BACs with a total size of 1.521 Mb (approximately 15% of the total sequence generated) were used to measure the sequencing accuracy. It was determined to 99.998% or fewer than 1.75 errors in 100 kb. As the same technologies, expertise and work scheme were applied for all sequencing work, we conclude from this analysis that more than 90% of the total genome was sequenced with this high accuracy of 99.998%. The estimated 10% of the genome not yet sequenced includes telomeric regions, approximately 45–50 additional rDNA repeats (with a total of approximately 0.3 Mb only), and small gaps, some of which are indicated as boxes in Fig. 2. These results indicated that using end sequencing as a way to map the BAC clones allowed for high accuracy and eventual direct alignment onto the assembled genomic contigs as well as sequence comparisons between all sequences obtained (BACs but also shotgun sequences from three different shotgun libraries with inserts in the 1, 3 and 6–8 kb range) during the course of the project.

2

Overview of supercontigs. The framed numbers within a stretch of BACs representing the respective supercontigs indicate the approximated size of a particular gap between neighbouring ends.

2

Overview of supercontigs. The framed numbers within a stretch of BACs representing the respective supercontigs indicate the approximated size of a particular gap between neighbouring ends.

Genome organisation

The Pedant-Pro™ Sequence Analysis Suite was used for gene identification. Out of the sequenced 8.73 Mb, 5933 ORFs have been extracted for proteins longer than 80 amino acids. ORFs whose sequence is entirely contained within another reading frame have been excluded from the analysis. 70 shorter ORFs (<80 amino acids) with significant BLAST similarities have been extracted manually. 4767 ORFs show significant similarities to a non-redundant protein database. Out of the 4767 ORFs with similarities, 4109 showed significant similarity to ORFs from S. cerevisiae. The remaining 1166 ORFs have no significant similarities to known sequences. 410 ORFs are shorter than 100 amino acids. The numbers are not comparable due to different automatic gene-prediction methods and due to the different genomes. Only after an in-depth analysis will an evaluation of the number of questionable ORFs be possible and will maybe reduce the number of ORFs shorter than 100 amino acids. Calculation of the gene density and protein length, taking into account the gene numbers, showed an average length of 1472 bp and an average protein length of 437 amino acids. No experiments have been performed so far for the evaluation of these predicted numbers.

Introns have been identified by homology to known proteins and confirmed by using GeneWise [29]. In a preliminary analysis 91 intron-containing genes were identified in this way. These include all genes identified previously [3] as intron-containing genes. 80 tRNAs were identified, corresponding to all 20 amino acids. From approximately 50 rRNA clusters [5], seven clusters have been fully sequenced. All clusters are completely identical and have a precise length of 5033 bp. Although representing only 10% of the estimated total number of rDNA repeats to be present in H. polymorpha, the seven fully sequenced rDNA repeats are absolutely identical.

The main functional categories and their distribution in the gene set are automatically predicted for: transposable elements, 1%; energy, 5%; cellular communication, signal transduction mechanism, 6%; protein synthesis, 6%; cell rescue, defense and virulence, 9%; cellular transport and transport mechanisms, 12%; cell cycle and DNA processing, 12%; protein fate (folding, modification, destination) 12%; transcription, 14%; and metabolism, 23% (Fig. 3). Localisation was assigned to 2858 ORFs.

3

Functional comparison of S. cerevisiae and H. polymorpha gene content (general functional categories).

3

Functional comparison of S. cerevisiae and H. polymorpha gene content (general functional categories).

Comparison with S. cerevisiae sequences

The comparative genomic analysis of closely related organisms allowed us to identify species-specific genes and permitted us to estimate the rates of sequence divergence of the derived proteins. Comparing the genomic organisation of S. cerevisiae to that of H. polymorpha reveals differences and similarities at different levels (Table 2 and Figs. 3 and 4). The overall H. polymorpha genome exhibits a GC content of 47.9% compared to 38.1% found for the S. cerevisiae genome. The amino acid composition properties are essentially driven by GC content. The size of the genome of S. cerevisiae is 13.5 Mb (sequenced non-redundant genome length 12 156 kb) in comparison to the 9.5 Mb (sequenced non-redundant genome length 8733 kb) of H. polymorpha. For the comparison of H. polymorpha to S. cerevisiae we have used the MIPS comprehensive yeast genome database CYGD [30]. It includes 6449 genes. Out of these, 471 genes are marked as questionable. As the exact gene number of S. cerevisiae is still under debate [3] in the literature, we have taken all MIPS genes into account for the comparisons. S. cerevisiae contains 6449 ORFs with an average distance of 1885 bp in comparison to 5933 ORFs in H. polymorpha with an average distance of 1472 bp. The gene density in H. polymorpha appears higher than that in S. cerevisiae when correlating the number of ORFs in the two organisms with the size of the respective genomes. An exhaustive synteny analysis has been performed between H. polymorpha and S. cerevisiae. It revealed up to eight syntenic proteins in both organisms. Six clusters were found to contain six syntenic proteins; two clusters were found to contain seven syntenic proteins and one cluster contains eight syntenic proteins (Table 3).

2

Comparisons of the S. cerevisiae and H. polymorpha genomes

 S. cerevisiae H. polymorpha 
Genome size (Mb) 13.5 ∼9.5 
Sequenced non-redundant genome length (bp) 12 156 307 8 733 442 
GC content (%) 38.1 47.9 
Number of ORFs (with similarities) 6449 (5978) 5933 (4767) 
Average ORF distance (bp) 1885 1472 
Average protein length (aa) 471 437 
Number of tRNAs 278 80 
 S. cerevisiae H. polymorpha 
Genome size (Mb) 13.5 ∼9.5 
Sequenced non-redundant genome length (bp) 12 156 307 8 733 442 
GC content (%) 38.1 47.9 
Number of ORFs (with similarities) 6449 (5978) 5933 (4767) 
Average ORF distance (bp) 1885 1472 
Average protein length (aa) 471 437 
Number of tRNAs 278 80 
4

Functional comparison of S. cerevisiae and H. polymorpha gene content (functional categories of metabolism).

4

Functional comparison of S. cerevisiae and H. polymorpha gene content (functional categories of metabolism).

3

Synteny analysis between H. polymorpha and S. cerevisiae

H.p. BAC H.p. ORF BLAST E value S.c. ORF S.c. Description S.c. Chr. 
cqbh_00 orf129 7.00E−24 ypr185w APG13 – protein required for the autophagic process 16 
cqbh_00 orf158 4.00E−57 ypr186c PZF1 – TFIIIA (transcription initiation factor) 16 
cqbh_00 orf155 2.00E−39 ypr187w RPO26 – DNA-directed RNA polymerase I, II, III 18 kDa subunit 16 
cqbh_00 orf135 0.0 ypr189w SKI3 – antiviral protein 16 
cqbh_00 orf121 4.00E−69 ypr190c RPC82 – DNA-directed RNA polymerase III, 82 kDa subunit 16 
cqbh_00 orf117 6.00E−50 ypr191w QCR2 – ubiquinol-cytochrome-c reductase 40 kDa chain II 16 
cqgr.00 orf129 1.00E−42 ylr403w SFP1 – zinc finger protein 12 
cqgs.00 orf143 1.00E−101 ylr405w similarity to Azospirillum brasilense nifR3 protein 12 
cqag_00 orf148 3.00E−44 ylr406c RPL31B – 60S large subunit ribosomal protein L31.e.c12 12 
cqhn.00 orf161 4.00E−10 ylr407w hypothetical protein 12 
cqgr.00 orf168 0.0 ylr409c strong similarity to Schizosaccharomyces pombeβ-transducin 12 
cqhm.00 orf177 0.0 ylr410w VIP1 – strong similarity to S. pombe protein Asp1p 12 
cqan_00 orf362 2.00E−12 yjr086w STE18 – GTP-binding protein γ subunit of the pheromone pathway 10 
cqan_00 orf357 5.00E−21 yjr088c weak similarity to S. pombe hypothetical protein SPBC14C8.18c 10 
cqan_00 orf324 1.00E−123 yjr090c GRR1 – required for glucose repression and for glucose and cation transport 10 
cqan_00 orf304 3.00E−77 yjr091c JSN1 – suppresses the high-temperature lethality of tub2-150 10 
cqan_00 orf248 1.00E−32 yjr092w BUD4 – budding protein 10 
cqan_00 orf231 1.00E−14 yjr093c FIP1 – component of pre-mRNA polyadenylation factor PF I 10 
cqan_00 orf230 1.00E−24 yjr094w-a RPL43B – 60S large subunit ribosomal protein 10 
cqga.00 orf27 5.00E−38 ygr091w PRP31 – pre-mRNA splicing protein 
cqga.00 orf19 1.00E−167 ygr092w DBF2 – ser/thr protein kinase related to Dbf20p 
cqga.00 orf15 4.00E−46 ygr093w similarity to hypothetical S. pombe protein 
cqga.00 orf30 0.0 ygr094w VAS1 – valyl-tRNA synthetase 
cqga.00 orf42 3.00E−28 ygr095c RRP46 – involved in rRNA processing 
cqga.00 orf56 7.00E−45 ygr096w similarity to bovine Graves disease carrier protein 
cqfq.00 orf75 4.00E−30 ygl191w COX13 – cytochrome-c oxidase chain VIa 
cqfq.00 orf72 1.00E−145 ygl190c CDC55 – ser/thr phosphatase 2A regulatory subunit B 
cqfq.00 orf70 8.00E−33 ygl189c RPS26A – 40S small subunit ribosomal protein S26e.c7 
cqfq.00 orf68 4.00E−40 ygl187c COX4 – cytochrome-c oxidase chain IV 
cqfq.00 orf64 5.00E−36 ygl185c weak similarity to dehydrogenases 
cqfq.00 orf60 3.00E−87 ygl184c STR3 – strong similarity to Emericella nidulans and similarity to other cystathionine β-lyase and Cys3p 
cqav_00 orf272 8.00E−24 ygl111w weak similarity to hypothetical protein S. pombe 
cqav_00 orf276 2.00E−55 ygl110c similarity to hypothetical protein SPCC1906.02c S. pombe 
cqav_00 orf330 1.00E−25 ygl106w MLC1 – Myo2p light chain 
cqav_00 orf315 8.00E−69 ygl105w ARC1 – protein with specific affinity for G4 quadruplex nucleic acids 
cqav_00 orf294 2.00E−62 ygl103w RPL28 – 60S large subunit ribosomal protein L27a.e 
cqav_00 orf292 2.00E−14 ygl102c questionable ORF 
cqav_00 orf263 1.00E−100 ygl100w SEH1 – nuclear pore protein 
cqbp_00 orf17 2.00E−46 ydr447c RPS17B – ribosomal protein S17.e.B 
cqbp_00 orf21 1.00E−125 ydr448w ADA2 – general transcriptional adapter or co-activator 
cqbp_00 orf26 5.00E−67 ydr449c similarity to hypothetical protein S. pombe 
cqbp_00 orf75 2.00E−63 ydr450w RPS18A – ribosomal protein S18.e.c4 
cqbp_00 orf68 5.00E−15 ydr451c YHP1 – strong similarity to Yox1p 
cqbp_00 orf189 1.00E−135 ydr452w PHM5 – similarity to human sphingomyelin phosphodiesterase (PIR:S06957) 
cqaq_00 orf216 2.00E−22 ydr362c TFC6 – TFIIIC (transcription initiation factor) subunit, 91 kDa 
cqaq_00 orf202 2.00E−75 ydr365c weak similarity to Streptococcus M protein 
cqaq_00 orf191 3.00E−23 ydr367w similarity to hypothetical protein SPAC26H5.13c S. pombe 
cqaq_00 orf165 2.00E−96 ydr372c similarity to hypothetical S. pombe protein 
cqaq_00 orf180 1.00E−139 ydr375c BCS1 – mitochondrial protein of the CDC48/PAS1/SEC18 (AAA) family of ATPases 
cqaq_00 orf251 1.00E−104 ydr380w ARO10 – similarity to Pdc6p, Thi3p and to pyruvate decarboxylases 
cqaq_00 orf245 1.00E−20 ydr381w YRA1 – RNA annealing protein 
cqaq_00 orf236 2.00E−20 ydr382w RPP2B – 60S large subunit acidic ribosomal protein 
cqdw.p1 orf217 7.00E−82 ydr061w similarity to E. coli modF and photorepair protein phrA 
cqdw.p1 orf208 0.0 ydr062w LCB2 – serine C-palmitoyltransferase subunit 
cqdw.p1 orf271 1.00E−28 ydr067c similarity to YNL099c 
cqdw.p1 orf228 4.00E−90 ydr069c DOA4 – ubiquitin-specific protease 
cqdw.p1 orf257 5.00E−34 ydr071c similarity to Ovis aries arylalkylamine N-acetyltransferase 
cqdw.p1 orf250 7.00E−57 ydr072c IPT1 – mannosyl diphosphorylinositol ceramide synthase 
H.p. BAC H.p. ORF BLAST E value S.c. ORF S.c. Description S.c. Chr. 
cqbh_00 orf129 7.00E−24 ypr185w APG13 – protein required for the autophagic process 16 
cqbh_00 orf158 4.00E−57 ypr186c PZF1 – TFIIIA (transcription initiation factor) 16 
cqbh_00 orf155 2.00E−39 ypr187w RPO26 – DNA-directed RNA polymerase I, II, III 18 kDa subunit 16 
cqbh_00 orf135 0.0 ypr189w SKI3 – antiviral protein 16 
cqbh_00 orf121 4.00E−69 ypr190c RPC82 – DNA-directed RNA polymerase III, 82 kDa subunit 16 
cqbh_00 orf117 6.00E−50 ypr191w QCR2 – ubiquinol-cytochrome-c reductase 40 kDa chain II 16 
cqgr.00 orf129 1.00E−42 ylr403w SFP1 – zinc finger protein 12 
cqgs.00 orf143 1.00E−101 ylr405w similarity to Azospirillum brasilense nifR3 protein 12 
cqag_00 orf148 3.00E−44 ylr406c RPL31B – 60S large subunit ribosomal protein L31.e.c12 12 
cqhn.00 orf161 4.00E−10 ylr407w hypothetical protein 12 
cqgr.00 orf168 0.0 ylr409c strong similarity to Schizosaccharomyces pombeβ-transducin 12 
cqhm.00 orf177 0.0 ylr410w VIP1 – strong similarity to S. pombe protein Asp1p 12 
cqan_00 orf362 2.00E−12 yjr086w STE18 – GTP-binding protein γ subunit of the pheromone pathway 10 
cqan_00 orf357 5.00E−21 yjr088c weak similarity to S. pombe hypothetical protein SPBC14C8.18c 10 
cqan_00 orf324 1.00E−123 yjr090c GRR1 – required for glucose repression and for glucose and cation transport 10 
cqan_00 orf304 3.00E−77 yjr091c JSN1 – suppresses the high-temperature lethality of tub2-150 10 
cqan_00 orf248 1.00E−32 yjr092w BUD4 – budding protein 10 
cqan_00 orf231 1.00E−14 yjr093c FIP1 – component of pre-mRNA polyadenylation factor PF I 10 
cqan_00 orf230 1.00E−24 yjr094w-a RPL43B – 60S large subunit ribosomal protein 10 
cqga.00 orf27 5.00E−38 ygr091w PRP31 – pre-mRNA splicing protein 
cqga.00 orf19 1.00E−167 ygr092w DBF2 – ser/thr protein kinase related to Dbf20p 
cqga.00 orf15 4.00E−46 ygr093w similarity to hypothetical S. pombe protein 
cqga.00 orf30 0.0 ygr094w VAS1 – valyl-tRNA synthetase 
cqga.00 orf42 3.00E−28 ygr095c RRP46 – involved in rRNA processing 
cqga.00 orf56 7.00E−45 ygr096w similarity to bovine Graves disease carrier protein 
cqfq.00 orf75 4.00E−30 ygl191w COX13 – cytochrome-c oxidase chain VIa 
cqfq.00 orf72 1.00E−145 ygl190c CDC55 – ser/thr phosphatase 2A regulatory subunit B 
cqfq.00 orf70 8.00E−33 ygl189c RPS26A – 40S small subunit ribosomal protein S26e.c7 
cqfq.00 orf68 4.00E−40 ygl187c COX4 – cytochrome-c oxidase chain IV 
cqfq.00 orf64 5.00E−36 ygl185c weak similarity to dehydrogenases 
cqfq.00 orf60 3.00E−87 ygl184c STR3 – strong similarity to Emericella nidulans and similarity to other cystathionine β-lyase and Cys3p 
cqav_00 orf272 8.00E−24 ygl111w weak similarity to hypothetical protein S. pombe 
cqav_00 orf276 2.00E−55 ygl110c similarity to hypothetical protein SPCC1906.02c S. pombe 
cqav_00 orf330 1.00E−25 ygl106w MLC1 – Myo2p light chain 
cqav_00 orf315 8.00E−69 ygl105w ARC1 – protein with specific affinity for G4 quadruplex nucleic acids 
cqav_00 orf294 2.00E−62 ygl103w RPL28 – 60S large subunit ribosomal protein L27a.e 
cqav_00 orf292 2.00E−14 ygl102c questionable ORF 
cqav_00 orf263 1.00E−100 ygl100w SEH1 – nuclear pore protein 
cqbp_00 orf17 2.00E−46 ydr447c RPS17B – ribosomal protein S17.e.B 
cqbp_00 orf21 1.00E−125 ydr448w ADA2 – general transcriptional adapter or co-activator 
cqbp_00 orf26 5.00E−67 ydr449c similarity to hypothetical protein S. pombe 
cqbp_00 orf75 2.00E−63 ydr450w RPS18A – ribosomal protein S18.e.c4 
cqbp_00 orf68 5.00E−15 ydr451c YHP1 – strong similarity to Yox1p 
cqbp_00 orf189 1.00E−135 ydr452w PHM5 – similarity to human sphingomyelin phosphodiesterase (PIR:S06957) 
cqaq_00 orf216 2.00E−22 ydr362c TFC6 – TFIIIC (transcription initiation factor) subunit, 91 kDa 
cqaq_00 orf202 2.00E−75 ydr365c weak similarity to Streptococcus M protein 
cqaq_00 orf191 3.00E−23 ydr367w similarity to hypothetical protein SPAC26H5.13c S. pombe 
cqaq_00 orf165 2.00E−96 ydr372c similarity to hypothetical S. pombe protein 
cqaq_00 orf180 1.00E−139 ydr375c BCS1 – mitochondrial protein of the CDC48/PAS1/SEC18 (AAA) family of ATPases 
cqaq_00 orf251 1.00E−104 ydr380w ARO10 – similarity to Pdc6p, Thi3p and to pyruvate decarboxylases 
cqaq_00 orf245 1.00E−20 ydr381w YRA1 – RNA annealing protein 
cqaq_00 orf236 2.00E−20 ydr382w RPP2B – 60S large subunit acidic ribosomal protein 
cqdw.p1 orf217 7.00E−82 ydr061w similarity to E. coli modF and photorepair protein phrA 
cqdw.p1 orf208 0.0 ydr062w LCB2 – serine C-palmitoyltransferase subunit 
cqdw.p1 orf271 1.00E−28 ydr067c similarity to YNL099c 
cqdw.p1 orf228 4.00E−90 ydr069c DOA4 – ubiquitin-specific protease 
cqdw.p1 orf257 5.00E−34 ydr071c similarity to Ovis aries arylalkylamine N-acetyltransferase 
cqdw.p1 orf250 7.00E−57 ydr072c IPT1 – mannosyl diphosphorylinositol ceramide synthase 

Overall, 80 nuclear tRNA genes were identified in the H. polymorpha genome sequence (Table 4), in comparison to S. cerevisiae where 278 tRNA genes have been found. Despite these differences, both yeasts have nearly the same amount of different tRNA species, in H. polymorpha 40, in S. cerevisiae 41. The lower number of tRNA genes in H. polymorpha is consistent with the tRNA analysis of RST sequences from Pichia sorbitophila[3], a close relative of H. polymorpha. One-third of the P. sorbitophila genome was found to contain 23 nuclear tRNA genes only. The estimated number for the complete P. sorbitophila genome (∼70) is thus comparably low.

4

Nuclear tRNA genes identified in the H. polymorpha genome

tRNA species Anticodon H. polymorpha S. cerevisiae 
tRNA-Ala AGC 11 
tRNA-Ala UGC 
tRNA-Arg ACG 
tRNA-Arg CCG 
tRNA-Arg CCU 
tRNA-Arg UCU 11 
tRNA-Asn GUU 10 
tRNA-Asp GUC 16 
tRNA-Cys GCA 
tRNA-Gln CUG 
tRNA-Gln UUG 
tRNA-Glu CUC 
tRNA-Glu UUC 
tRNA-Gly CCC 
tRNA-Gly GCC 16 
tRNA-Gly UCC 
tRNA-His GUG 
tRNA-Ile AAU 13 
tRNA-Ile UAU 
tRNA-Leu AAG 13 
tRNA-Leu CAA 10 
tRNA-Leu CAG 
tRNA-Leu GAG 
tRNA-Leu UAA 
tRNA-Leu UAG 
tRNA-Lys CUU 14 
tRNA-Lys UUU 
tRNA-Met CAU 10 
tRNA-Phe GAA 10 
tRNA-Pro AGG 
tRNA-Pro UGG 10 
tRNA-Ser AGA 11 
tRNA-Ser CGA 
tRNA-Ser GCU 
tRNA-Ser UGA 
tRNA-Thr AGU 11 
tRNA-Thr CGU 
tRNA-Thr UGU 
tRNA-Trp CCA 
tRNA-Tyr GUA 
tRNA-Val AAC 14 
tRNA-Val CAC 
Total  80 278 
Different tRNAs  40 41 
tRNA species Anticodon H. polymorpha S. cerevisiae 
tRNA-Ala AGC 11 
tRNA-Ala UGC 
tRNA-Arg ACG 
tRNA-Arg CCG 
tRNA-Arg CCU 
tRNA-Arg UCU 11 
tRNA-Asn GUU 10 
tRNA-Asp GUC 16 
tRNA-Cys GCA 
tRNA-Gln CUG 
tRNA-Gln UUG 
tRNA-Glu CUC 
tRNA-Glu UUC 
tRNA-Gly CCC 
tRNA-Gly GCC 16 
tRNA-Gly UCC 
tRNA-His GUG 
tRNA-Ile AAU 13 
tRNA-Ile UAU 
tRNA-Leu AAG 13 
tRNA-Leu CAA 10 
tRNA-Leu CAG 
tRNA-Leu GAG 
tRNA-Leu UAA 
tRNA-Leu UAG 
tRNA-Lys CUU 14 
tRNA-Lys UUU 
tRNA-Met CAU 10 
tRNA-Phe GAA 10 
tRNA-Pro AGG 
tRNA-Pro UGG 10 
tRNA-Ser AGA 11 
tRNA-Ser CGA 
tRNA-Ser GCU 
tRNA-Ser UGA 
tRNA-Thr AGU 11 
tRNA-Thr CGU 
tRNA-Thr UGU 
tRNA-Trp CCA 
tRNA-Tyr GUA 
tRNA-Val AAC 14 
tRNA-Val CAC 
Total  80 278 
Different tRNAs  40 41 

The identification of relevant genes of the mating system and pheromone signal transduction pathway are shown in Table 5. Data analyses indicate that H. polymorpha contains several genes attributed to the regulation of mating, such as STE3, STE6, GPA1, STE18, CDC42, STE50 and STE11. These data suggest that a conserved mitogen-activated protein kinase pathway might regulate mating in H. polymorpha. In addition, the data analyses indicate that H. polymorpha contains a gene that corresponds to the mating type regulatory protein gene at the HMR locus of Kluyveromyces lactis (HMRa1). The cryptic mating type loci like HMRa1 in S. cerevisiae and K. lactis act as reservoirs of mating type information in mating type switching in homothallic yeast strains. The function of this homologue in H. polymorpha remains unknown.

5

Mating-specific genes in H. polymorpha

Hp_ORF AA length BLAST hit AA length BLASTP score Function 
BJ_37 215 Kl_YCR097w 126 154 mating-type regulatory protein, silence copy at HMR locus 
BO_26 433 Sc_STE3 470 509 pheromone a-factor receptor 
CA_130 1227 Sc_STE6 1290 1167 ATP-binding cassette transporter protein 
BI_65 700 Sc_STE11 738 690 pheromone response 
AG_50 398 Sc_STE50 364 223 pheromone response 
AN_362 127 Sc_STE18 110 126 G protein γ subunit 
AY_145 295 Sc_GPA1 472 130 G protein α subunit 
AL_42 197 Sc_CDC42 192 248 G protein 
Hp_ORF AA length BLAST hit AA length BLASTP score Function 
BJ_37 215 Kl_YCR097w 126 154 mating-type regulatory protein, silence copy at HMR locus 
BO_26 433 Sc_STE3 470 509 pheromone a-factor receptor 
CA_130 1227 Sc_STE6 1290 1167 ATP-binding cassette transporter protein 
BI_65 700 Sc_STE11 738 690 pheromone response 
AG_50 398 Sc_STE50 364 223 pheromone response 
AN_362 127 Sc_STE18 110 126 G protein γ subunit 
AY_145 295 Sc_GPA1 472 130 G protein α subunit 
AL_42 197 Sc_CDC42 192 248 G protein 

Acknowledgments

Erika Wedler, Kathleen Balke, Nicole Lokmer, and Dörte Möstl are acknowledged for their excellent technical work during the entire DNA sequencing phase of the project.

References

[1]
Joseph
R.
(
1999
)
Yeasts: production and commercial uses
. In:
Encyclopedia of Food Microbiology, Vol. 3
  (
Robinson
R. K.
Batt
C. A.
Patel
P.D.
Eds.), pp.
2335
2341
.
Academic Press
,
San Diego, CA.
[2]
Goffeau
A.
Barrell
B.G.
Bussey
H.
Davis
R.W.
Dujon
B.
Feldmann
H.
Galibert
F.
Hoheisel
J.D.
Jacq
C.
Johnston
M.
Louis
E.J.
Mewes
H.W.
Murakami
Y.
Philippsen
P.
Tettelin
H.
Oliver
S.G.
(
1996
)
Life with 6000 genes
.
Science
 
274
,
563
567
.
[3]
Feldmann
H.
(Ed.) (
2000
)
Génolevures. Genomic exploration of the hemiascomycetous yeasts
.
FEBS Lett.
 
487
,
1
150
.
[4]
Gellissen
G.
(
2000
)
Heterologous protein production in methylotrophic yeasts
.
Appl. Microbiol. Biotechnol.
 
54
,
741
750
.
[5]
Gellissen
G.
(Ed.) (
2002
)
Hansenula polymorpha - Biology and Applications
 .
Wiley-VCH
,
Weinheim
.
[6]
Barnett
J.A.
Payne
R.W.
Yarrow
D.
(
2000
)
Yeasts: Characteristics and Idendification
 ,
3rd
edn.
Cambridge University Press
Cambridge
.
[7]
Middelhoven
W.J.
(
2002
)
History, habitat, varability, nomenclature and phylogenetic position of Hansenula polymorpha
. In:
Hansenula polymorpha - Biology and Applications
  (
Gellissen
G.
Ed.), pp.
1
7
.
Wiley-VCH
,
Weinheim
.
[8]
Marri
L.
Rossolini
G.M.
Satta
G.
(
1993
)
Chromosome polymorphism among strains of Hansenula polymorpha
.
Appl. Environ. Microbiol.
 
59
,
939
941
.
[9]
Lahtchev
K.
(
2002
)
Basic genetics of Hansenula polymorpha
. In:
Hansenula polymorpha - Biology and Applications
  (
Gellissen
G.
Ed.), pp.
8
20
.
Wiley-VCH
,
Weinheim
.
[10]
Morais
J.O.F.
Maia
M.H.D.
(
1959
)
Estudos de microorganismos enconcentrados em leitos de despéjos de caldas de destilarias de Pernambuco. II. Uma nova espécie de Hansenula, H. polymorpha
.
Anais de Escola Superior de Qimica, Universidade do Recife
 
1
,
15
20
.
[11]
Roggenkamp
R.
Hansen
H.
Eckart
M.
Janowicz
Z.
Hollenberg
C.P.
(
1986
)
Transformation of the methylotrophic yeast Hansenula polymorpha by autonomous replication and integration vectors
.
Mol. Gen. Genet.
 
202
,
302
308
.
[12]
Suckow
M.
Gellissen
G.
(
2002
)
The expression platform based on H. polymorpha strain RB11 and its derivatives - history, status and perspectives
. In:
Hansenula polymorpha - Biology and Applications
  (
Gellissen
G.
Ed.), pp.
105
123
.
Wiley-VCH
,
Weinheim
.
[13]
Mayer
A.F.
Hellmuth
K.
Schlieker
H.
Lopez-Ulibarri
R.
Oertel
S.
Dahlems
U.
Strasser
A.W.M.
van Loon
A.P.G.M.
(
1999
)
An expression system matures: a highly efficient and cost-effective process for phytase production by recombinant strains of Hansenula polymorpha
.
Biotechnol. Bioeng.
 
63
,
373
381
.
[14]
Papendieck
A.
Dahlems
U.
Gellissen
G.
(
2002
)
Technical enzyme production and whole-cell biocatalysis: application of Hansenula polymorpha
. In:
Hansenula polymorpha - Biology and Applications
  (
Gellissen
G.
Ed.), pp.
255
271
.
Wiley-VCH
,
Weinheim
.
[15]
Avgerinos
G.C.
Turner
B.G.
Gorelick
K.J.
Papendieck
A.
Weydemann
U.
Gellissen
G.
(
2001
)
Production and clinical development of a Hansenula polymorpha-derived PEGylated hirudin
.
Sem. Thromb. Hemostas.
 
27
,
357
371
.
[16]
Barnes
C.S.
Krafft
B.
Frech
M.
Hofmann
U.R.
Papendieck
A.
Dahlems
U.
Gellissen
G.
Hoylaerts
M.F.
(
2001
)
Production and charcaterization of saratin, an inhibitor of von Willebrand's factor-dependent platelet adhesion to collagen
.
Sem. Thromb. Hemostas.
 
27
,
337
347
.
[17]
Bartelsen
O.
Barnes
C.S.
Gellissen
G.
(
2002
)
Production of anticoagulants in Hansenula polymorpha
. In:
Hansenula polymorpha - Biology and Applications
  (
Gellissen
G.
Ed.), pp.
211
228
.
Wiley-VCH
,
Weinheim
.
[18]
Janowicz
Z.A.
Melber
K.
Merckelbach
A.
Jacobs
E.
Harford
N.
Comberbach
M.
Hollenberg
C.P.
(
1991
)
Simultaneous expression of the S and L surface antigens of hepatitis B and formation of mixed particles in the methylotrophic yeast, Hansenula polymorpha
.
Yeast
 
7
,
431
433
.
[19]
Schaefer
S.
Piontek
M.
Ahn
S.-J.
Papendieck
A.
Janowicz
Z.A.
Gellissen
G.
(
2001
)
Recombinant hepatitis B vaccines - characterization of the viral disease and vaccine production in the methylotrophic yeast, Hansenula polymorpha
. In:
Novel Therapeutic Proteins - Selected Case Studies
  (
Dembowsky
K.
Stadler
P.
Eds.), pp.
245
274
.
Wiley-VCH
,
Weinheim
.
[20]
Schaefer
S.
Piontek
M.
Ahn
S.-J.
Papendieck
A.
Janowicz
Z.A.
Timmermans
I.
Gellissen
G.
(
2002
)
Recombinant hepatitis B vaccines - disease characterization and vaccine production
. In:
Hansenula polymorpha - Biology and Applications
  (
Gellissen
G.
Ed.), pp.
175
210
.
Wiley-VCH
,
Weinheim
.
[21]
Van der Klei
I.J.
Veenhuis
M.
(
2002
)
Hansenula polymorpha: a versatile model organism in peroxisome research
. In:
Hansenula polymorpha - Biology and Applications
  (
Gellissen
G.
Ed.), pp.
76
94
.
Wiley-VCH
,
Weinheim
.
[22]
Siverio
J.M.
(
2002
)
Biochemistry and genetics of nitrate assimilation
. In:
Hansenula polymorpha - Biology and Applications
  (
Gellissen
G.
Ed.), pp
21
40
.
Wiley-VCH
,
Weinheim
.
[23]
Waschk
D.
Klabunde
J.
Suckow
M.
Hollenberg
C.P.
(
2002
)
Characteristics of the Hansenula polymorpha genome
. In:
Hansenula polymorpha - Biology and Applications
  (
Gellissen
G.
Ed.), pp.
95
104
.
Wiley-VCH
,
Weinheim
.
[24]
Osoegawa
K.
de Jong
P.J.
Frengen
E.
Ioannou
P.A.
(
1999
)
Construction of bacterial artificial chromosome (BAC/PAC) libraries
.
Current Protocols in Human Genetics
 .
5.15.1
5.15.33.
[25]
Osoegawa
K.
Woon
P.Y.
Zhao
B.
Frengen
E.
Tateno
M.
Catanese
J.J.
de Jong
P.J.
(
1998
)
An improved approach for construction of bacterial artificial chromosome libraries
.
Genomics
 
52
,
1
8
.
[26]
Sambrook
J.
Fritsch
E.F.
Maniatis
T.
(
1989
)
Molecular Cloning, A Laboratory Manual
 .
Cold Spring Harbor Laboratory Press
,
Cold Spring Harbor, NY
.
[27]
Ewing
B.
Hillier
L.
Wendl
M.C.
Green
P.
(
1998
)
Base-calling of automated sequencer traces using phred
.
Genome Res.
 
8
,
175
194
.
[28]
Frishman
D.
Albermann
K.
Hani
J.
Heumann
K.
Metanomski
A.
Zollner
A.
Mewes
H.W.
(
2001
)
Functional and structural genomics using PEDANT
.
Bioinformatics
 
17
,
44
57
.
[29]
Birney
E.
Durbin
R.
(
2000
)
Using GeneWise in the Drosophila annotation experiment
.
Genome Res.
 
10
,
547
548
.
[30]
Mewes
H.W.
Frishman
D.
Gûldener
U.
Mannhaupt
G.
Mayer
K.
Mokrejs
M.
Morgenstern
B.
Munsterkotter
M.
Rudd
S.
Weil
B.
(
2002
)
MIPS: a database for genomes and protein sequences
.
Nucleic Acids Res.
 
30
,
31
34
.