The entire genome of the bacterium Mycoplasma pneumoniae M129 has been sequenced. It has a size of 816 394 base pairs with an average G+C content of 40.0 mol%. We predict 677 open reading frames (ORFs) and 39 genes coding for various RNA species. Of the predicted ORFs, 75.9% showed significant similarity to genes/proteins of other organisms while only 9.9% did not reveal any significant similarity to gene sequences in databases. This permitted us tentatively to assign a functional classification to a large number of ORFs and to deduce the biochemical and physiological properties of this bacterium. The reduction of the genome size of M.pneumoniae during its reductive evolution from ancestral bacteria can be explained by the loss of complete anabolic (e.g. no amino acid synthesis) and metabolic pathways. Therefore, M.pneumoniae depends in nature on an obligate parasitic lifestyle which requires the provision of exogenous essential metabolites. All the major classes of cellular processes and metabolic pathways are briefly described. For a number of activities/functions present inM.pneumoniaeaccording to experimental evidence, the corresponding genes could not be identified by similarity search. For instance we failed to identify genes/proteins involved in motility, chemotaxis and management of oxidative stress.
The bacterium Mycoplasma pneumoniae has a genome size of ∼800 kb and completely lacks a cell wall. The bacterium is surrounded by a cytoplasmic membrane only, which contains cholesterol as an indispensable component. Mycoplasma pneumoniae is a human pathogen, causing ‘atypical pneumonia’ ( 1 ) usually in older children and young adults. As a surface parasite, it attaches to the host's respiratory epithelium by means of a differentiated terminal structure termed attachment organelle or tip structure. For a long time, research activities mainly focused on pathogenicity-related topics such as studies on cytadherence ( 2 ), vaccination and diagnosis ( 3 ). Mycoplasma pneumoniae was not considered as an organism suitable for basic studies partly because of its fastidious growth requirements and partly because of the lack of established standard genetic tools like conjugation or transformation with self-replicating vectors ( 4 ). These disadvantages can be compensated now to a large extent by the methods of molecular biology.
Morowitz pointed out in 1984, that mycoplasmas would be suitable candidates for defining the genetic constitution of a minimal self-replicating cell ( 5 ). The advantage of these bacteria for such studies ( 6 , 7 ), mainly due to their small genome size, was so obvious that several initiatives were started to sequence five different mycoplasma genomes: Mycoplasma genitalium ( 8 , 9 ), M.pneumoniae ( 10 ), Mycoplasma capricolum ( 11 ), Mycoplasma mycoides ( 12 ) and a species from the related genus Ureaplasma, Ureaplasma urealyticum ( 13 ). So far, only the complete sequence of the M.genitalium genome has been published ( 9 ) which, with 580 070 bp, is the smallest bacterial genome known so far. In the genus Mycoplasma, M.pneumoniae and M.genitalium are the closest related species. We report in this publication the complete nucleotide sequence of the genome of M.pneumoniae , which thus provides information on a second small bacterial genome. All M.pneumoniae genes which had been already sequenced were reanalyzed except for the P1 operon ( 14 ). Our sequencing strategy, early results and a detailed description ofM.pneumoniae as an experimental system have been recently published ( 10 ).
Materials And Methods
The strain Mycoplasma pneumoniae M129 (ATTC 29342) in the 18th broth passage was used to construct an ordered cosmid library containing the complete genome ( 15 ). This cosmid library was the basis for the DNA sequence analysis. We selected this specific bacterial strain because it has been used in cytadherence and pathogenicity studies ( 2 , 16 , 17 ). The strain in the 20th broth passage was still infectious in hamsters (H. Brunner, unpublished data).
Using the enzymatic dideoxy chain-termination method ( 18 ), the sequence data for this study were exclusively generated on a fluorescent-based sequence-gel reader (Model 373A, Applied Biosystems). Sequencing strategies and methods were as described in Hilbert et al. ( 10 ).
Computer assisted analysis
Sequence assembly, map drawing and multiple alignments were done with the Lasergene program package (DNA STAR).
Other analyses were performed with the HUSAR (Heidelberg Unix Sequence Analysis Resources) program package release 4.0 at the German Cancer Research Center, Heidelberg, Germany. This package is based on the GCG program package version Unix-8.1 of the Genetics Computer Group, Wisconsin. For searching the DNA and protein databases [SWISS-PROT ( 19 ) and PIR ( 20 )] the FASTA ( 21 ) and BLAST ( 22 ) programs (BLASTX, BLASTN and BLASTP) were used. Conserved motifs in proteins and peptides were identified by using the program PROSITE ( 23 ). Open reading frames (ORFs) were calculated by the program FRAMES allowing AUG (or GUG, UUG) as start codons using the Mycoplasma translation table where UGA codes for tryptophan ( 24 ). The G+C content was calculated by the program WINDOW . Codon usage was performed with the program CODONFREQUENCY .
The programs TopPred 1.1.1 (Manuel G. Carlos, Ecole Normale Superieure, Laboratoire de Genetique Moleculaire, Paris, France) and PSORT ( 25 ) ( http://psort.nibb.ac.jp/ ) were used for the prediction of transmembrane domains and the membrane topology of proteins.
Each ORF analysis is accessible as a File Maker Pro (Claris) database which can be accessed at our world wide web (www) site ( http://zmbh.uni-heidelberg.de/M_pneumoniae ). It contains, besides genome and cosmid position of each ORF/gene, data about expression, availibility of antibodies, comments, literature, prosite patterns, amino acid composition and database search homology scores. All the annotations in this paper were done on the basis of the highest score values.
The complete M.pneumoniae sequence has been annotated in GenBank (NCBI) with the accession number U00089.
Results And Discussion
The strategy and methodology for sequencing the complete genome has been described by us recently ( 10 ). A total of2 415 202 nucleotides primary sequence data were provided by 6385 sequencing reactions. Each strand of the genome was completely sequenced at least once. The direct sequencing approach, combining primer walking with a limited shotgun strategy based on a complete cosmid and plasmid library considerably facilitated the assembly of the individual sequences to the entire genome sequence. The average redundancy of the sequencing was 2.95 (calculated for both strands). This very low redundancy was achieved by the use of 5095 oligonucleotides.
The complete M.pneumoniae genome has a size of 816 394 bp and a G+C content of 40.0 mol%. Altogether 677 open reading frames (ORFs) and 39 genes coding for various RNA species were predicted. All ORFs were sorted into categories according to their proposed functions ( Tables 1 and 2 ; Fig. 1 ). Only 333 ORFs (49.2%) were functionally assigned, based on significant sequence similarities to genes or proteins from other organisms with known functions (e.g. ribosomal proteins) or at least known categories of function (e.g. proteins involved in cytadherence). Significant similarities to proteins without known function from other bacteria, mostly M.genitalium , were shown for 181 proposed ORFs (26.7%). We also included in this group those M.pneumoniae proteins which were identified in protein extracts of M.pneumoniae by monospecific antibodies or by the N-terminal amino acid sequences of enriched proteins ( 26 , 27 ). The group of ORFs without significant similarity or without indication for their in vivo expression comprised 109 members (16.1%); 42 of them carry characteristic motifs, which are not sufficient for defining a function. Examples of such motifs are the leucine zipper (29 cases; refered to all predicted ORFs), the typical prokaryotic lipoprotein sequence pattern (46 cases) or ATP-and GTP-binding sites (73 cases). In addition all predicted gene products were analyzed by programs for structure predictions, e.g. coiled/coiled structures (29 cases) or transmembrane segments (275 cases). The latter result supports the analysis of cell fractionation experiments which indicate that the membrane fraction contains ∼50% of the total proteins estimated by SDS-PAGE. About 8% of the genome is composed of repetitive DNA elements RepMP1, RepMP2/3, RepMP4 and RepMP5, while only 67 of all predicted ORFs (9.9%) code for a product without any similarity to a known RNA or protein.
Finally, 58 gene families were defined comprising 298 proteins with at least two but frequently with more paralogs; these are proteins with similarities within the same species (see www pages).
The proposed ORFs are not equally distributed over the genome. A lower coding density coincides with regions of lower or higher G+C content than the average. There are regions with a G+C content of up to 56 mol%. These regions code almost exclusively for the gene P1 and gene ORF6 of the P1 operon, the repetitive DNA sequences RepMP4, RepMP2/3, RepMP5 and tRNAs (for details see www pages).
The P1 protein, the main adhesin, is essential for adherence of M.pneumoniae to its host cell ( 28 ) and the ORF6 gene product which is only found as a cleavage product, namely a 40 and 90 kDa protein, instead of the expected 130 kDa protein, is involved in an as yet unknown manner in cytadherence ( 14 ). Gene P1 contains a copy each of RepMP2/3 and RepMP4 and gene ORF6 one of RepMP5 ( 29 ). In addition, several copies of each of these repetitive DNA sequences can easily be recognized by their relative high G+C content ( Fig. 2 ).
At the other extreme is the proposed origin of replication around nucleotide position 205 000 (pcosMPK05, dnaA region), with a G+C content of only 26 mol% ( 10 ).
Other regions with a low G+C content do not show a similar obvious coding pattern, but proposed ORFs coding for lipoproteins or the hsd modification/restriction system are frequently located in these regions.
The total length of all coding regions is 724 174 bp. The average coding density of 88.7% was calculated for the M.pneumoniae genome which gives an average gene size of 1011 bp. Similar
MG is the name of the corresponding ORF in M.genitalium ( 9 ). coding densities have been also estimated for the smallerMgenita- lium genome ( 9 ) and for the genome of Haemophilus influenzae which is more than twice as large ( 30 ). The length of the proposed proteins in M.pneumoniae ranges from 37 (4.3 kDa) to 1882 (209.4 kDa) amino acids ( Fig. 3 ). One of the largest proteins is the cytadherence accessory protein HMW2 (F10_orf1818) and the smallest identified protein is the 37 amino acid ribosomal protein L36 (GT9_orf37). For practical reasons we introduced at the beginning of the sequence analysis a cut-off point of 100 amino acids for proposed proteins unless we found smaller proteins such as some of the ribosomal proteins during the initial BLASTX homology search. All intergenic or non coding regions were reanalyzed with a cut-off point of 50 amino acids and searches were done for specific small proteins. However, we cannot exclude the possibility that some of the smaller proteins, not showing similarities to known proteins from other organisms, have been missed in our analysis.
The codon usage of M.pneumoniae is summarized in Table 3 . We compared it for all proposed genes, for the subsets of genes with a low G+C (content below 35 mol%) and high G+C content (between 50 and 56 mol%) and for all 50 ribosomal protein genes (42.8 mol%) as an example for frequently translated genes. Codon usage of the low and high G+C content subfractions is clearly influenced by the DNA composition, favouring either codons with G/C or A/T at the third position. The codon usage pattern differs also for the complete genome and for genes which are frequently expressed like the ones coding for ribosomal proteins.
The most frequently used codons are AUU (Ile, 4.6%); AAA (Lys, 4.6%); UUU (Phe, 4.3%); GAA (Glu, 4.2%) and UUA (Leu, 3.9%) and the most common amino acids are Leu (10.3%), Lys (8.5%), Ile (6.6%), Ala (6.6%) and Val (6.5%). The high value for Lys is in agreement with the relative high percentage of proposed proteins with calculated isoelectric points between pH 9 and 12 ( Fig. 4 ). The least frequently used codons are UGC (Cys, 0.2%); CGA (Arg, 0.25%); AGG (Arg, 0.29%); AGA (Arg, 0.4%) and UGU (Cys, 0.55%).
All M.pneumoniae gene products were classified ( Table 1 and 2 ), with some minor modifications, in accordance with criteria introduced for Escherichia coli ( 31 ) and adapted for the classification of putative genes from H.influenzae . We added ‘cytadherence associated proteins’ to the category of cell envelope-surface structures, since evidence is mounting, that M.pneumoniae possesses a cytoskeleton-like organization which stabilizes the bacterium and protects it against osmotic lysis ( ). The category of transport and binding proteins was altered by subdivision into three groups namely, into PTS-, ABC-and other transport systems. To facilitate the orientation on the gene map we added a list which contains all proposed ORFs and RNAs in nummerical order ( Table 4 ).
More details on this very general analysis will be made public on the www ( http://www.zmbh.uni-heidelberg.de/M_pneumoniae ).
DNA replication and repair
The central enzyme for DNA replication in bacteria is the DNA polymerase III holoenzyme ( 32 ), which consists of 10 subunits in E.coli , a DNA polymerase subunit α and nine accessory proteins (υ, ε, τ, γ, δ, δ′, χ, ψ and β). Mycoplasma pneumoniae codes for two potential α subunits (the gene name in the literature is either dnaE or polC). Both proposed a subunits, A19_orf872 and B01_orf1443, differ in length and also in their degree of similarity to the α subunits from E.coli and Bacillus subtilis . The protein from B01_orf1443 shares the highest similarity with the a subunit from Gram-positive bacteria including the motif for a 3′–5′ exonuclease activity which is typical for these bacteria. In contrast, the orf A19_orf872 is most similar to the α subunit from E.coli and does not contain a 3′–5′ exonuclease domain. The 3′–5′ exonuclease activity in E.coli is encoded by a separate gene (dnaQ), which has not been found in M.pneumoniae . Of the other subunits which build the DNA polymerase III holoenzyme in E.coli ( 32 ) only the subunits β (dnaN), δ′(holB), γ and τ (dnaX) are present in M.pneumoniae , indicating a simplified replication complex compared with the Gram-negative bacteria E.coli and H.influenzae . Presently, it cannot be excluded that other proteins replace these subunits inM.pneumoniae . A true comparison with a phylogenetically closer related Gram-positive bacterium like B.subtilis is not possible since the Bacillus DNA polymerase III holoenzyme complex has not been defined as yet and the nucleotide sequence of the entire B.subtilis genome has not been completed.
Mycoplasmapneumoniae does not code for a DNA polymerase I (polA)-like DNA repair enzyme. Instead, we find a truncated polA gene (A19_orf291) comprising only the 5′–3′ exonuclease domain, whereas in E.coli and B.subtilis the polA gene is much larger and codes for the 5′–3′ exonuclease and a 5′–3′ polymer-ase-specific domain.
Experimental results on DNA polymerase enzymatic activities in mycoplasmas are confusing. It was claimed that the DNA polymerase III of Mollicutes lacks the 3′–5′ exonuclease proofreading activity in general ( 33 ) and this was taken as an explanation for the observed genetic instability of many Mollicutes species ( 4 ). Recently, the nucleotide sequence of the polC gene of0020 Mycoplasma pulmonis and experimental results on enzyme purification and characterization of enzyme activities were published ( 34 ). The results indicated that the polC gene from M.pulmonis also codes for a 3′–5′ exonuclease, and that the size of the predicted PolC protein, 1435 amino acids, is very similar to the PolC homolog B01_orf1443 in M.pneumoniae and that the polymerase could be inhibited by compounds specific for PolC proteins of Gram-positive bacteria. Furthermore, the authors provided some experimental evidence for a second, smaller enzyme with DNA polymerase activity. Considering the characterization data of DNA polymerase activities in M.pulmonis and the nucleotide sequence data on DNA polymerase genes of M.pneumoniae and M.genitalium ( 9 , 35 ), one can conclude that at least these three Mycoplasma species have two DNA polymerase (polC) genes coding for a larger protein (≈1400 amino acids) with a 3′–5′ exonuclease activity and with the highest sequence similarities to the Gram-positive B.subtilis polymerase III. Therefore it is unlikely that an increased mutation frequency is caused by the DNA replication process. The nucleotide sequence of the smaller Pol III homolog (≈100 kDa) of M.pneumoniae and M.genitalium ( 9 , 35 ) resembles more the polC gene from the Gram-negative E.coli . This is also emphasized by the absence of the 3′–5′ exonuclease domain in the proposed genes. The gene for the smaller, Gram-negative typical PolC has not yet been found in M.pulmonis, but during the purification of the larger PolC, a second polymerase activity lacking exonuclease activity has been identified. The function of the exonuclease negative DNA polymerase can only be elucidated experimentally and it remains to be seen if it can substitute for the function of the polymerase I (PolA) in combination with the proposed 5′–3′ exonuclease of the truncated polA gene (A19_orf291). This topic has been also discussed for M.genitalium ( 35 ).
In addition to the DNA polymerase many more gene products are necessary for DNA replication, e.g. initiation, elongation and termination ( 32 ). The most obvious functions missing in M.pneumoniae according to the sequence analysis are an RNaseH for primer removal and a protein for the termination of replication.
The number of genes involved in DNA repair is considerably smaller in M.pneumoniae than in the ‘standard’ eubacteria E.coli and B.subtilis or even H.infuenzae with the smaller genome.
Mycoplasma pneumoniae codes only for 13 of the genes known to be involved in excision repair of DNA, recombination and SOS repair. Thus the genes recB, recC, recD, recG and ruvC involved in recombination are missing as well as the genes recN, recO, recQ and recR involved in SOS repair in E.coli . Nevertheless, a rudimentary stock of enzymes has been conserved in M.pneumoniae to permit homologous recombination [RecA, Ssb, PolA (see above), GyrA, GyrB, RuvA and RuvB] ( 36 ), excision repair ( 37 ) and a kind of truncated SOS repair ( 38 ). In particular missing is the lexA gene which plays a central role in regulating the SOS response including the expression of the recA gene in other bacteria.
We were also unable to find components of the so called mismatch-repair system encoded by the mutS, mutL and mutH genes. Since bacteria which normally carry the mut genes show a reduced genetic stability, if these genes are mutated, it seems likely that the absence of these genes in mycoplasmas causes an increased mutation rate ( 65 ).
The DNA dependent RNA polymerase of M.pneumoniae is coded by the conserved genes rpoA (α subunit), rpoB (β subunit), rpoC (β′ subunit) and rpoE (δ′ subunit). The only sigma factor found (H91_orf499) shares the highest similarity with the sigma factor SigA from B.subtilis ( 39 ). Presently, not enough experimental data are available for defining promoter sequences in M.pneumo-niae . The promoter of only three genes/operons have been determined experimentally by primer extension. These genes are the P1 operon ( 14 ), the ribosomal RNA operon ( 40 ) and F10_orf405 ( 27 ). The-10 region and to a lesser extent the-35 region of these three examples are comparable with consensus promoters sequences in B.subtilis ( 41 ). Termination of transcription seems to be independent of the termination factor Rho, since the corresponding gene could not be found. Transcription stops on typical terminator sequences which are short interrupted palin-dromic regions followed by a run of U residues. The Nus transcription termination factors, of which NusA (E07_orf540) and NusG (D09_orf320) are present, may play a role in the termination of transcription. NusB and NusC are absent. NusA is involved in termination and NusG in antitermination in other bacteria. Finally, GreA promotes elongation by the RNA polymerase by utilizing a novel transcript-cleavage reaction ( 42 ).
Gene expression and regulation
Regulation of gene expression in M.pneumoniae has not been studied so far. Therefore we do not know how this bacterium coordinates the synthesis of those gene products which are essential for reproduction. Also, M.pneumoniae has to sense and respond to environmental changes. This requires a signal transduction system. The presence of only one sigma factor (sigA, H91_orf499) which is also the only one of all proposed proteins showing the characteristic helix-turn-helix (HTH) motif, suggests that the response to external stimuli is not controlled by the level of expression of alternative sigma factors.
The presence of a cis -acting conserved palindromic repeated sequence in front offour heat shock genes, similar to the ‘CIRCE’ element first identified in B.subtilis ( 43 ) and the identification of the proposed repressor (C09_orf351, hrcA), indicates that the heat shock response in M.pneumoniae is regulated by the interaction of this repressor with the CIRCE element, and provides an example for a negative regulation of gene expression in M.pneumoniae .
The two-component signal transduction system ( 44 ), consisting of a sensor and a response regulator, which has been found in many prokaryotic and eukaryotic organisms is believed to be essential for all cells. Nevertheless, based on sequence similarity we were unable to detect any such system in M.pneumoniae .
Concerning other proteins with regulatory functions we identified several GTP-binding proteins and other proteins like the virulence associated protein vacB (K04-orf726). These regulatory proteins act by unknown mechanisms.
The translation machinery of M.pneumoniae is rather extensive. About 15% of all proposed ORFs, are involved in translation including 19 tRNA synthetases, 50 ribosomal proteins, various factors and enzymes, 33 tRNAs, one ribosomal RNA operon with one copy of each 5S, 16S and 23S rRNA ( 45 ), and a gene coding for the 10Sa RNA. The conservation of the 10Sa RNA which functions as tRNA and mRNA and is implicated in trans -translation ( 66 ), is interesting in evolutionary terms. Three exceptions are noteworthy: the lack of the ribosomal protein S1, of the peptide chain release factor 2 (RF2) and of the glutaminyl-tRNA synthetase. So far, quite a number of Gram-positive bacteria including Bacillus or Lactobacillus species also lack the S1 protein and the glutaminyl-tRNA synthetase ( 46 ).
One of the functions of the S1 protein is to bind the mRNA to the 30S small ribosomal subunit. Therefore, it was argued that ribosomal binding sites in front of many genes ( 47 ) of B.subtilis compensate for the missing S1 protein. The Shine-Dalgarno sequences are so well conserved, that they could be used routinely as a good indicator for proposing ORFs in the B.subtilis genome sequencing projects, but this does not apply to M.pneumoniae . The Shine-Dalgarno sequence is in many instances not well conserved or missing altogether, even in genes for which we know the translational initiation sites from independent studies.
Of the 20 standard tRNA-synthetases, the glutaminyl-tRNA synthetase is the only one not detected inM.pneumoniae . Studies on tRNA synthetases in Gram-positive bacteria have indicated that this enzyme is dispensable. Bacillus subtilis solves this problem by charging the tRNA Gln first with glutamate which is subsequently converted to glutamine by an amido transferase. The glutamyl tRNA synthetase aminoacylates both tRNA Glu and tRNA Gln . The corresponding amido transferase has not yet been identified in M.pneumoniae, therefore it is still an open question as to how glutamine is bound to its tRNA.
Finally, the modified codon usage by M.pneumoniae, reading UGA as tryptophan instead of a stop codon, requires the absence of the peptide chain release factor 2 (RF2) and the presence of the release factor 1 (RF1). The latter recognizes the stop codons UAG and UAA and RF2 the stop codons UGA and UAA. Since the UGA codon is frequently located within a gene it is essential to exclude RF2 to prevent the premature termination of proteins.
Surface structure, cytadherence-associated proteins and cell division
This category comprises the adhesins and the cytadherence associated proteins, including the components of the cytoskeleton-like structure, the function of which is probably to stabilize and maintain the shape of the wall-less mycoplasma, to direct proteins to certain regions in the membrane and to keep them in these positions ( 2 ). Adherence to the receptor(s) of the host cell depends on the tip structure. The correct assembly of the adhesin P1 (E07_orf1627) and the 30 kDa adhesin-related protein on the tip structure (H08_orf274) is necessary for attachment. The tip structure is an interesting example for bacterial cellular asymmetry ( 48 ).
The cytadherence-associated proteins were originally defined by hemadsorption-negative mutants which had lost certain proteins like the so called high molecular weight proteins HMW1, HMW2 and HMW3, the adhesin P1 and the proteins named A, B and C ( 2 , 28 ). B and C are most probably the gene products of the ORF6 gene of the P1 operon (40 kDa protein = C, 90 kDa protein = B). The gene for A is still unknown. Another criterion for a putative protein of the cytoskeleton-like structure is its partitioning into the Triton X-100 insoluble fraction after treating M.pneumoniae with this detergent. This fraction is ill defined and comprises ∼50 proteins, of which only a subfraction is associated with the cytoskeleton and/or cytadherence. The following proteins have been identified as most likely components of a cytoskeleton ( 2 ): HMW1 (H08_orf1018), HMW2 (F10_orf1818; Krause, submitted), HMW3 (H08_orf672), P200 (D02_orf1036o) ( 49 ), P65 (F10_orf405) ( 27 ). These proteins, with the exception of HMW2, share some common peculiar features, like an extended acidic proline rich domain and an abnormal migration in SDS-PAGE ( 49 ). The adhesin P1 is mainly distributed in the membrane fraction and to a lesser extent in the Triton X-100 insoluble fraction ( 50 ).
A large number of proposed ORFs contain sequences with high similarities to subregions of either the P1 protein or the ORF6 gene product of the P1 operon. The coding DNA sequences correspond to the repetitive DNA sequences RepMP2/3 (P1), RepMP4 (P1) and RepMP5 (ORF6). Preliminary experiments indicate that the proposed ORFs are not expressed under standard laboratory conditions. It has been observed that another independent isolate of M.pneumoniae, the strain FH, carries a different copy of RepMP2/3, RepMP4 and RepMP5 in its P1 operon than the M.pneumoniae strain M129 which is the subject of this paper ( 51 , 52 ). All experimental data so far show that only the repetitive sequences which are part of the P1 operon are expressed. The exchange of these copies presumably takes place by gene conversion as was indicated by DNA sequence analysis of the corresponding RepMP5 sequences in M.pneumoniae strains M129 and FH. Different is the situation with RepMP1, copies of which seem to be part of several expressed proteins. RepMP1-specific antibodies recognize several proteins on western blots of M.pneumoniae protein extracts ( 26 ).
Only little is known about cell division in M.pneumoniae . The lack of mutants, especially of conditional mutants, has prevented a detailed analysis. So far, the two proteins FtsZ and FtsH are classified as cell division proteins in analogy to their function in other bacteria ( 53 ). Other genes involved in chromosome partitioning or septum formation have not been identified in M.pneumoniae . Interesting problems to study might include the possible interaction of FtsZ with components of the cytoskeleton-like structure, which seems to play a key role in cell division, or the effects of cellular asymmetry on cell division and the formation of daughter cells. Other genes known to be involved in cell division in E.coli, the muk and min genes or additional fts genes were not found in M.pneumoniae ( 53 ).
Altogether 46 proteins were identified as lipoproteins based on the following characteristic lipoprotein-specific features ( 54 ): (i) one or more basic amino acids among the first 5–7 amino acids of the N-terminus, (ii) a hydrophobic signal peptide and (iii) a cysteine residue immediately downstream of the signal peptide, which is available for modification by the transfer of the diacylglyceryl moiety from glycerophospholipid to its sulfhydryl group. The precursor prolipoprotein with the modified cysteine is subsequently cleaved in M.pneumoniae by a specific signal peptidase (signal peptidase II). The modified cysteine will then be the first amino acid of the processed protein. The cleavage site including the cysteine and the three (positions -3, -2 and -1) upstream located amino acids, is to some extent conserved (-3: 37×L, 6×F, 1×A, 1×V; -2: 19×S, 10×A, 8×T, 6×V, 2×I; -1: 37×A, 7×S, 1×G).
The number of lipoproteins in M.pneumoniae is relatively high compared with the Gram-negative bacteria E.coli and H.influenzae . Even in the closely related M.genitalium only 21 putative lipoproteins could be found by analyses of the published data ( 9 ).
The lipoproteins of M.pneumoniae can be divided into six subgroups based on sequence similarities; also included in these groups are proteins with similarities to lipoproteins but without the lipoprotein signature at the N-terminal end. Quite a number of these proposed genes with high similarities are organized in tandem. For instance seven lipoproteins and one protein without the lipobox but with otherwise extended similarities are located between genome positions 249 627 and 256 463 (cosmid pcosMPE09). A gene family, with 13 proposed ORFs including five lipoproteins, is located between 306 862 and 320 524 (cosmid pcosMPD02). Presently it is unclear whether all of the proposed genes are expressed.
In vivo labelling of M.pneumoniae with 14 C-labelled palmitic acid and protein analysis by SDS-PAGE reveal, instead of the expected 46 lipoproteins, only between 20 and 25 lipoproteins (Pyrowolakis, unpublished data). This discrepancy could be explained either by a regulated expression which only allows some of the several tandemly organized lipoproteins to be synthesized or that the labelling with palmitic acid was not sensitive enough or that some lipoproteins carry fatty acids other than palmitic acid. Only four of all the proposed lipoproteins show significant similarities to other bacterial genes beside the ones from M.genitalium . These include A05_orf380V [high affinity transport system P37 with unknown specificity from Mycoplasma hyorhinis ( 55 )], D09_orf384 (aerobic glycerol–3-phosphate dehydrogenase, glpD), H03_orf213 (uridine kinase) and D02_orf207 (ATP synthase b subunit (atpF).
The processing of the prolipoprotein to the mature lipoprotein in E.coli requires the three enzymes prolipoprotein diacylglyceryl transferase, prolipoprotein signal peptidase and apolipoprotein transacylase. We find in M.pneumoniae only the transferase which catalyzes the thioether linkage between the diacylglycerol and the cysteine and the peptidase which cleaves in front of the cysteine following the signal peptide. The transacylase could not be identified either in M.pneumoniae nor in M.genitalium ( 9 ). Therefore it is still an open question if a third fatty acid is linked to the cysteine by an amide bond as has been found for lipoproteins of E.coli .
The absence of a periplasmic space provides reasons for the existence of a large number of lipoproteins. For surface-exposed proteins which have to function on the outside, anchoring them via long chain fatty acids at the M.pneumoniae cell membrane is an efficient way. Already known examples are substrate-binding proteins of transport systems or proteins possibly involved in antigenic variation for evasion of the immune system of the host, as has been shown for other mycoplasmas ( 56 ). Nothing is known about the fate of the cleaved signal peptides, as to whether they are degraded or recycled.
In light of the scarcity of metabolic pathways and the marked dependence on exogenous nutrients ( Table1 , Fig. 5 ), we expected M.pneumoniae to code for many transport systems to compensate for its inability to synthesize essential compounds like amino acids. Three different transport systems, mainly involved in import, were found in M.pneumoniae: (i) the ABC transporter system ( 57 ) consisting of two ATP-binding, two membrane-spanning and one substrate-binding domain which are frequently present on separate polypeptides, but sometimes also consist of two or three different domains located on the same peptide (D12_orf634 or D12_orf623), (ii) the phosphoenolpyruvate: carbohydrate phosphotransferase system (PTS), ( 58 ) and (iii) facilitated diffusion systems with transmembrane proteins functioning as specific carriers. Mycoplasma pneumoniae codes for 43 genes involved in the above mentioned transport systems according to the present status of annotation. In addition, there are several proposed proteins with 6 or 12 transmembrane segments which are candidates for membrane-spanning domains of transport systems. The relatively low number of proteins listed in Table 1 indicates that at least some of the systems might not be very substrate specific, e.g. the transport systems for amino acids. Transport systems for histidine, glutamine, an ORF showing significant similarity to a probable aromatic amino acid permease from yeast and an ABC transport system for oligopeptides were identified based on similarity of the ATP-binding domains of ABC transporters.
Surprisingly, we could not identify a transport system for the precursors for RNA and DNA synthesis, namely adenine, guanine, uracil and thymine which are essential components of mycoplasma growth media.
In this context one has to be aware of the ambiguity in the identification of ABC transport proteins on the basis of sequence similarity of the ATP-binding proteins with respect to the predicted substrate to be transported, since database searches indicate numerous candidates with different specificities but with very similar, high score values. All the annotations in this paper were done on the basis of the highest score values. Therefore it might be possible that the predicted specificity disagrees with the in vivo activity in M.pneumoniae . Additional information from similarities to transmembrane domains or the substrate-binding proteins is only rarely at hand, since, in general, similarities among these domains are not well conserved. Even in positive examples, the score values are relatively low. Sometimes additional circumstantial evidence is derived from an operon-like organisation of the genes coding for ABC transporters, e.g. the unspecified ABC transporter consisting of the proteins P69, P29 and P37 from nucleotide 519 560 to 523 050 (A05_orf542, A05_orf244 and A05_orf380V). A05_orf542 could act as the membrane-spanning domain, A05_orf244 as the ATP-binding domain and A05_orf380V, as a putative lipoprotein which could function as a substrate-binding protein. These proteins were also identified by their significant similarity to the corresponding genes in M.hyorhinis ( 55 ).
In M.pneumoniae the ABC transport system for oligopeptides consists of two different transmembrane [G07_orf376 = amiD (= oppC in B.subtilis); G07_orf389a = oppB] and ATP-binding domains (G07_orf851 = oppF, G07_orf423 = oppD). It is also organized in an operon-like arrangement from nucleotide 750 865 to 756 948. In striking contrast to B.subtilis, the substrate-binding domain (oppA) is absent in M.pneumoniae . Since an oppA homolog is also absent in M.genitalium a sequencing or annotation error seems unlikely. It remains to be experimentally determined whether the substrate-binding protein is dispensable or is part of one of the transmembrane or ATP-binding proteins. It is also possible that one or more of the lipoproteins function as substrate-binding proteins.
There is also evidence for bacterial ABC export systems in M.pneumoniae ( 59 ). For example D12_orf634 (msbA), D12_orf623 (pmd1) and D02_orf660 (lcnDR3) have the conserved ATP binding motif and the membrane-spanning domains on the same polypeptide. In addition D12_orf623 and D12_orf634 show also significant similarities to multidrug resistance proteins of different organisms.
Among the proposed PTS transport systems, we identified one for glucose and one for mannitol. They are similar to the homologous systems from several Gram-positive bacteria, with a EIIA and EIIBC domains on two separate polypeptides for the mannitol transport system and with three domains (EIIABC) of enzyme II in one polypeptide for the glucose transport system.
Besides glucose and mannitol, fructose also seems to be imported by the PTS system. According to our data the fructose-permease II component R02_orf694 (fruA) contains all three domains of enzyme II in one gene (EIIABC). In addition, R02_orf694 and the 1-phosphofructokinase (fruK, R02_orf300) are probably in one operon, but we do not find fruF which is also part of the fructose operon in enteric bacteria ( 58 ).
Both, Gram-positive and Gram-negative bacteria have a well conserved protein translocation system. The components identified which are part of the well characterized E.coli system ( 60 ) include cytosolic chaperones or regulators [trigger factor, SecB, DnaK, SRP (a ribonucleoprotein composed of 4.5 S RNA and Ffh) and FtsY] which deliver the protein to a membrane receptor (SecA). The receptor is also supposed to function as a motor, pushing the protein across the membrane via specific protein channels (SecY, SecG, SecE, SecD and SecF). The secreted proteins to be transported carry an N-terminal signal peptide which will be removed by a signal peptidase (SPaseI). Two routes of export have been proposed either via SecB and SecA or by SRP. The protein secretion system in M.pneumoniae is less complex ( Table 1 ). So far, the trigger factor, DnaK, SRP, FtsY and SecA have been identified. From the channel-forming proteins only SecY is present but SecG, SecF, SecE, SecD and the cytosolic receptor protein SecB are missing. Also absent is the signal peptidase SPaseI although computer-assisted motif prediction programs indicate the presence of corresponding substrates (signal peptides). The simplified protein export system might be a reflection of the fact that M.pneumoniae is only surrounded by a cytoplasmic membrane. Another problem concerns refolding of secreted proteins which are normally exported in an unfolded stage. Refolding might be catalyzed by chaperones which have to function on the cell surface ( 60 ). This might impose a special problem on the wall-less bacteria in general, since they do not possess a periplasmic space which could prevent proteins from diffusing. To anchor the proposed chaperones on the cell surface as lipoproteins would be a possible way to solve this problem.
Nucleotide synthesis: purine and pyrimidine salvage pathways
Guanine, guanosine, uracil, thymine, thymidine, cytidine, ade-nine and adenosine may serve as precursors for nucleic acids and nucleotide coenzymes, as determined in nutritional studies of Mollicutes . These components can be used for the synthesis of ribonucleotides by the salvage pathway as predicted from the enzymes listed ( Table 1 , Fig. 5 ). The ribonucleotides are converted to deoxyribonucleotides by ribonucleoside-diphosphate reductase, an enzyme complex formed by the gene products of nrdE (F10_orf721) and nrdF (F10_orf339). Adenine, guanine and uracil can be metabolized directly to the corresponding nucleoside monophosphates by the enzymes adenine phosophoribosyl-transferase (apt, F11_orf133), hypoxanthine-guanine phosphoribo-syltransferase (hpt, K05_orf175) and uracil phosphoribosyl-transferase (upp, B01_orf178). Uridylate, adenylate and guanylate kinases catalyze the generation of ADP, GDP and UDP. Surprisingly, we could not find the nucleoside diphosphate kinase (ndk), the key enzyme for the conversion from NDP to NTP. This finding is in agreement with data from the genomic sequence analysis of M.genitalium .
Another important enzyme, the CTP synthetase which converts UTP to CTP is also missing. Therefore the only route for the synthesis of CTP appears to be from cytidine to CMP by uridine kinase (H03_orf213) and to CDP by cytidylate kinase (P01_orf217). Deoxythymidine monophosphate (dTMP) could be either synthesized by thymidine kinase (tdk, B01_orf191) or by thymidylate synthase (thA, F10_orf328).
It will be of special interest to experimentally identify the enzyme(s) of M.pneumoniae which convert NDPs to NTPs, since such an enzymatic activity seems to be essential.
Carbohydrate metabolism and energy conservation
The ability to metabolize glucose and/or arginine and use it for the ATP synthesis is one of the key features in classification of Mollicutes. Mycoplasma pneumoniae is listed in Bergey's manual of systematic bacteriology as a glucose fermenter but not as an arginine-hydrolyzing species ( 61 ). This contrasts with our sequencing results, since the three enzymes involved in the arginine degradation pathway, arginine deiminase (H03_orf438), ornithine carbamoyltransferase (H10_orf273) and carbamate kinase (F10_orf309) are present according to our sequence data. The arginine deiminase gene occurs twice but one copy is inactive due to a raster-mutation resulting in two proposed ORFs (H10_orf198 and H10_orf238) corresponding to the N-terminal and C-terminal halves of a complete deiminase. The change in reading frame was also confirmed by sequencing of directly amplified genomic DNA. All these proposed ORFs are organized in an operon-like arrangement except for the deiminase (H03_orf438) which seems to be expressed as a single gene located far away from the mentioned operon. Included in this operon is a proposed protein (F10_orf565) with 12 predicted transmembrane domains indicative of a putative permease.
Glucose, fructose and mannitol are transported by the PTS system into the cell and further degraded by the Embden-Meyer-hof-Parnas (EMP) pathway to pyruvate. All enzymes required for this pathway have been identified. The second pathway for metabolizing glucose, the pentose phosphate pathway, is incomplete in M.pneumoniae . We found only the enzymes ribulose-5-phosphate-3-epimerase and transketolase ( Fig. 5 ). Glucose-6-phosphate dehydrogenase (G6Pde), 6-phospho-gluconate dehydro-genase (6PGde), and a transaldolase are missing. These data agree with enzymatic studies showing that G6Pde and 6PGde are absent in mycoplasmas ( 62 ).
Pyruvate can be further metabolized by two alternative reactions, either to lactate by lactate dehydrogenase (K05_orf312) or to acetyl-CoA by the pyruvate dehydrogenase complex and further to acetate by the phosphotransacetylase (A05_orf320, pta) and the acetate kinase (G12_orf390, ackA). The pyruvate dehydrogenase complex consists of E1α (F11_orf358a) E1β (F11_orf327), the two subunits of the pyruvate dehydrogenase, the dihydrolipoamide acetyltransferase E2 (F11_orf402) and the dihydrolipoamide dehydrogenase E3 (F11_orf457). The corresponding genes are clustered (nt 549 943–557 431; pcosMPF11); part of this cluster also contains the genes coding for NADH oxidase (nox, F11_orf479) and lipoate protein ligase (lplA, F11orf339). The later enzyme joins lipoic acid in an amide linkage to the e amino group of a lysine residue of the dihydrolipoamide acetyltransferase.
Membrane phospho-and glycolipid synthesis
In M.pneumoniae strain FH the following membrane phospho-and glycolipids have been found: digalactosyldiacylglycerol, trigalactosyldiacylglycerol, glucosylgalactosyldiacylglycerol, phosphatidylglycerol (PG) and diphosphatidylglycerol (DPG) ( 63 ). Since M.pneumoniae FH and M.pneumoniae M129 are very similar we assume that both strains carry essentially the same genes for phospho-and glycolipid-synthesis.
About 10 genes are required for the synthesis of the above-mentioned lipids; but according to our DNA sequence analysis only three of the expected genes could be unambiguously identified. They code ( Fig. 5 ) for the enzymes 1-acylglycerol–3-phosphate acyltransferase (plsC; gene name in Saccharomyces cerevisiae is slc1), phosphatidic acid cytidyltransferase (cdsA) and glycerolphosphate phosphatidyltransferase (pgsA). These enzymes are involved in the biochemical pathway for the synthesis of PG and DPG. Missing are the glycerol–3-phosphate acyltransferase (plsB) catalysing the synthesis of 1-acylglycerol–3-phosphate (acyl-G3P) from glycerol–3-phosphate (G3P), the phosphatidylglycerol phosphate phosphatase which converts phosphatidylglycerol–3-phosphate to PG and finally the cardiolipin synthetase (cls) which synthesizes DPG from PG. Interestingly, we find a gene homologous to the plsX gene from E.coli which is involved in membrane lipid synthesis in an undefined manner. The glycolipid synthesis could start with phosphatidic acid and would probably require a phosphatidic acid phosphatase and several UDP-glucosyl-or galactosyltransferases. None of these enzymes could be identified by similarity searches in databases.
As expected from biochemical studies no gene involved in fatty acid or cholesterol synthesis was determined in the sequence analysis. These components are incorporated as such from the medium.
An interesting enzyme is the proposed carnitine palmitoyl-transferase encoded by C09_orf60o, which might be involved in the modifacation of exogenous phosphatidylcholine ( 67 ).
It is impossible to address each proposed M.pneumoniae gene in this paper. We have tried to cover the most important categories of functions and point to genes which should be present, but could not be found by our applied methods. Typical examples are the missing diphosphonucleoside kinase for the conversion of (d)NDPs to (d)NTPs, and the substrate binding domain (oppA) for the oligopeptide ABC transporter. In addition, we could not find any indication for a number of genes/proteins, which should be there based on experimental evidence. Mycoplasma pneumoniae has been shown to be motile and to exhibit chemotactic behaviour ( 64 ). Motility genes are difficult to identify since the motility in M.pneumoniae is independent of pili or flagella and it is not yet known which are potential candidates. Therefore, any progress in this field depends on the isolation of mutants. Furthermore, none of the components of the chemotactic signal pathway, the Che proteins, which are well conserved among bacteria, or any other ‘two-component signal transduction system’ could be detected. Chemotactic behaviour in M.pneumoniae is difficult to study. While it might be possible that these bacteria are chemotaxis negative, only additional experiments will clarify this point.
It has been reported that M.pneumoniae produces hydrogen peroxide considered to be a pathogenicity factor ( 17 ). Therefore, to protect itself from oxidative stress one would expect to find the standard enzymes dealing with these stress factors like catalase, superoxide dismutase or peroxidase, but we have no similarity based evidence that these enzymes exist in M.pneumoniae . Experimental data on this topic are also inconsistent ( 62 ).
The results of our sequence analysis explain quite well the kind of changes which have led to the observed reduction of the genome size in M.pneumoniae from the presumed genome size of several million base pairs of the ancestral bacteria. The main cause is the loss of complete anabolic (no amino acid synthesis) and metabolic pathways and of genes for the synthesis of complex structures like the bacterial cell wall which requires a large number of genes. In addition, for several processes like DNA repair, DNA recombination, cell division or protein secretion, the number of genes involved is smaller than in the more complex bacteria.
No significant changes were observed in the size of individual genes which resemble more or less their counterparts inE.coli or B.subtilis . The occasionally observed smaller intergenic regions, like those found in the ATPase operon, do not appear to significantly contribute to the overall genome size reduction.
In contrast with the loss of complete pathways we frequently observed the amplification of complete genes or segments of genes (see sections on lipoprotein families or on the repetitive DNA sequences RepMP2/3, RepMP4 and RepMP5). In these two instances the obvious advantage would be the potential of expressing antigenic variants of surface-exposed proteins.
The various truncated genes which are also present in full length copies e.g. arginine deiminase (H03_orf438 and H03_orf238), DNA primase (H91_orf620 and D12_orf212) and the dihydrofolate reductase (H10_orf506 and F10_orf160) might be relics of recombination events which took place in the course of the process of evolution.
Finally among the many proposed proteins are a few which share the highest similarity over their entire length with a eukaryotic protein. The most prominent examples are the pre-B cell enhancing factor (pbeF, D09_orf451) and the carnitine palmitoyltransferase II precursor (cpt2, C09_orf600). Both might be candidates for examples of horizontal gene transfer, but at the present state of analysis a definitive answer cannot be given.
It will be the main task of future studies to reconcile the experimental evidence and the DNA sequence-based predictions, i.e. to indentify the genes for observed functions and vice versa, and to assign functions to proposed open reading frames with hitherto unknown functions.
One obvious topic is the comparative analysis between the completely sequenced genomes of the closely related species M.pneumoniae and M.genitalium ( 9 ). Since the present paper is already very voluminous we decided to publish this analysis in an additional paper (Himmelreich et al., in preparation).
We thank R. Frank and A. Bosserhoff for the synthesis of oligonucleotides, B. Reiner for her expertise in computer data analysis, Raphael Mosbach for his technical assistance concerning hardware problems, U. Leibfried for technical assistance, I. Schmidt for preparing the manuscript, D. Hofmann and H. Gôhlmann for reading of the manuscript and H. Schaller for financial assistance and his encouragement throughout our work. We thank S. Razin, A. Wieslander, K. Dybvig, K. Sitaraman, R. Walker, H. Neimark and R. Miles who read drafts of this publication. Their corrections, critical comments and suggestions helped us very much. This research was supported by a grant from the Deutsche Forschungsgemeinschaft (He 780/5–1-He 780/5–4) and by the Fonds der Chemischen Industrie.