Based on the genomic sequence data of Escherichia coli K-12 strain, we have constructed a complete set of cloned individual genes encoding Histidine-tagged proteins with or without GFP fused for functional genomic analysis. Each clone encodes a protein of predicted ORF attached by Histidines and seven spacer amino acids at the N-terminal end, and five spacer amino acids and GFP at the C-terminal end. Sfi I restriction sites are generated at both the N- and C-terminal boundaries of ORF upon cloning, which enables easy transfer of ORF to other vector systems by cutting with Sfi I. Expression of cloned ORF is under the control of an IPTG-inducible promoter, which is strictly repressed by lacI q repressor gene product. The set of cloned ORFs described here should provide unique resources for systematic functional genomic approaches including (i) construction of DNA microarray, (ii) production and purification of proteins, (iii) analysis of protein localization by monitoring GFP fluorescence and (iv) analysis of protein–protein interaction.
More than 200 microbial complete genome sequences are now available from publicly accessible databases, such as DDBJ at the National Institute of Genetics ( ), GenBank at the National Center for Biotechnology Information ( ) and EMBL-EBI at European Bioinformatics Institute ( ). 1 Among them, two prokaryotes, the gram-positive bacterium Bacillus subtilis and the gram-negative bacterium Escherichia coli , as well as the unicellular eukaryote Saccharomyces cerevisiae have been used extensively as model organisms for basic biological research. 2–4 Complete genome sequence revealed that although these microorganisms are among the most thoroughly studied genetic systems, <50% of their respective genes had been characterized experimentally. Genome sequence data are valuable resources not only for information complementary to traditional biological approaches but for development of new approaches known as ‘functional genomics’, a research area of computational analysis of complete genomes, followed by experimental testing of emerging hypotheses. 5 , 6
Genome sequence information throws new light on the nature and interrelationship among bacterial genes, including ancestries, gene families, modules and motifs in comparative terms. 7–9 Comparison of the chromosomal location of genes within and between species might also reveal aspects of genome evolution and functional coupling between genes. 10–12 Detailed information on the structure and function of genome and of individual gene products in particular are most abundant and comprehensive for E. coli , and approximately half of the gene products have been characterized by using a variety of genetical, biochemical and molecular biological techniques. On the basis of extensive data resulting from these analyses, genes of known function were classified into a number of distinct categories. 13 The importance of E. coli genome as a leading model system seems to have gained rapid recognition for future studies of functional genomics and systems biology.
To further clarify the nature of genes with unknown function, however, a variety of resources, such as individual cloned genes, disruption or deletion mutants for each of the predicted genes would be particularly useful. Complete genome sequence data permitted to construct such resources in Mycoplasma genitalium , Bacillus subtilis and Saccharomyces cerevisiae . 14–17 Construction of genome-wide clones in E. coli has a long history including plasmid clones, 18 cosmid library, 19–21 and phage lambda clones. 22 , 23
Furthermore, increasing number of proteins, called ‘moonlighting proteins’, having two or more distinct functions are accumulating recently. 24 It would therefore be a good opportunity to reevaluate reported function of known genes and to explore new function by using large-scale resources such as cloned genes reported here as well as deletion and other mutants.
Here, we describe a complete clone set of genes in Escherichia coli that should allow us to perform systematic functional analyses not only of genes of unknown function but of those of known function to survey their function. Each of the predicted ORF except start and stop codons was amplified by PCR followed by cloning into multicopy plasmid vector. Each product is Histidine tagged at its N-terminal end and GFP fused at its C-terminal. Expression of cloned genes is directed by the P T5- lac promoter that can be activated by IPTG but normally repressed by lacI q placed in cis .
2. Materials and Methods
2.1. Strains and growth conditions
E. coli K-12, strain AG1 [ recA1 endA1 gyrA96 thi-1 hsdR17 ( r K− m K+ ) supE44 relA1 ] that exhibits high transformation efficiency, purchased from Stratagene, Inc. was used for all plasmid construction and other experiments.
Cells were usually grown in Luria–Bertani (LB) medium [1% Bacto Tryptone (BD Diagnostic Systems), 0.5% yeast extract (BD Diagnostic Systems), 0.5% NaCl] containing 50 µg/ml of ampicillin (Meiji Seika Kaisha, Ltd), 30 µg/ml of kanamycin (Wako Pure Chemical Industries, Ltd) or chloramphenicol (Nacalai Tesque, Inc.) as required. Isopropyl-β- d -thiogalactoside (IPTG) was used at 0.1 mM to induce expression of cloned gene.
2.2. PCR primers
A pair of PCR primers was designed to amplify each of the predicted ORFs of E. coli which starts from the second codon to the last amino acid codon omitting the initiation and termination codons. Each primer contains 3 or 2 additional bases at its 5′ end followed by 20 or 21 bases of ORF-specific sequence for the N- or C-terminus, respectively. Additional nucleotides were included to facilitate directional cloning and to generate Sfi I restriction sites. Thus, all N-terminal primers have the sequence 5′-GCC(20N)-3′, where 20N is specific for each ORF beginning at the second codon. Similarly, all C-terminal primers have the sequence 5′-(21N)CC-3′, where 21N is specific to each ORF ending with the last amino acid codon. The resulting PCR fragment has the sequence of 5′-GCC-2nd through amino acid codon -GG-3′. All primers were phosphorylated before use in PCR reaction.
In the following special cases, primer sequences were modified to circumvent fortuitous generation of Sfi I restriction sites. (i) If the sequence of ORF happens to be ATG NNN NNG GCC N….N NNN GGC CNN NNN TAA, the PCR-amplified fragment would be GCC NNN NNG GCC N….N NNN GGC CNN NNN GG and the resultant plasmid after cloning would have the sequence ‘… GGCCCTGAG(GGCC NNN NNG GCC) N….N NNN GGC CNN NNN (GGCC TATGCGGCC)…’, generating additional Sfi I sites (underlined) besides these at the peripheries (in parentheses). To avoid this complication, the G residue of the third NNG codon was replaced by an appropriate residue without changing the amino acid encoded, and the third residue of the last amino acid codon was changed from GGC to GGT.
If the third residue of the last amino acid codon is G, and the ORF sequence is ‘ATG NN….NN NNG TAA’, the PCR-amplified fragment would have the sequence ‘GCC ATG NN….NN NNG GG’, and this fragment has an Sfi I site even if the last G residue was unexpectedly removed by exonuclease activity contaminated with polymerase or restriction enzyme. To avoid this, the last amino acid codon was replaced by a synonymous codon which lacks G at the third residue. However, when the last codon was Met(ATG) or Trp(TGG), no modification was made because of the lack of done synonymous codon.
2.3. PCR amplification
PCR amplification reaction was performed in 96-formatted well plates. To minimize PCR errors during amplification, 1 U of KOD DNA polymerase (TOYOBO CO., Ltd.) was used in 25 µl reaction mixture containing 20–30 ng Kohara phage clone DNA or E. coli genomic DNA, 1.0 µM of each primer and 200 µM dNTPs. A recombinant form of Thermococcus kodakaraensis KOD DNA Polymerase having 3′→5′ exonuclease-dependent proofreading activity and exhibits very high fidelity was used. 25 , 26 Reactions were performed according to manufacturer's instruction and the amplified fragments were blunt-ended by using KOD polymerase. PCR reactions were run by 25 cycles at 95°C for 15 s, 64°C for 15 s, 72°C for 4 min followed by a final incubation at 72°C for 5 min. All reactions were performed by an automatic sequencing reaction robot PRIZM877 (Applied BioSystems) to avoid errors arising from manual operation.
About 99% of ORFs (4217) could be successfully amplified by KOD DNA polymerase; many of the unsuccessful ORFs were >2 kb, and most of them (∼90%) could be amplified by DynaZyme (Finzymes Oy). The remaining ORFs (∼1% of total) could be prepared by using LATaq DNA polymerase (TAKARA BIO, Inc.).
2.4. PCR fragment purification
All PCR products were purified by agarose gel electrophoresis (0.7–1.5% gel) in 1× TAE buffer depending on the length of amplified fragments. Gels were stained with SyberGOLD (Molecular Probes, Inc.) and visualized by 480 nm light to minimize DNA damage. Images were then photographed and checked for their sizes. PCR products of expected size were cut out and eluted by using MagExtractor (TOYOBO Co., Ltd) according to the manufacturer's instruction.
2.5. Cloning of individual ORFs into pCA24N vector
DNA fragments purified by agarose gel electrophoresis were individually ligated with vector pCA24N DNA that had been digested with Stu I and dephosphorylated. Ligation was performed for 1 h at 16°C by using ligation kit (TAKARA BIO, Inc.). The ligated DNA was ethanol precipitated, washed with 70% ethanol, vacuum dried, and resuspended in 10 µl of H 2 O. DNA solution (1 µl) was used for electro-transformation, selecting for chloramphenicol resistance on LB plate. Competent cells (50 µl) prepared according to Laboratory Manuals 27 were mixed with 1 µl of DNA solution in an ice-cold 0.2-cm cuvette (Bio-Rad Laboratories, Inc.), treated at 2.5 kV with 25 mF and 200-ohms, followed by addition of 1 ml of SOC medium (2% Bacto Tryptone, 0.5% yeast extract, 10 mM NaCl, 2.5 mM KCl, 10 mM MgCl 2 , 10 mM MgSO 4 , 20 mM glucose). Cells were then transferred to each of 96 formatted deep wells, and allowed to recover for 1 h at 37°C with shaking at 250 r.p.m. before plating on selective media.
2.6. Confirmation of structure of ORF clones
Plasmid clones were purified by 96 well format Multi Screen Plasmid DNA purification kit (Millipore Corporation) and their structure was confirmed by digestion with Bgl I or Sfi I restriction enzyme and examining by agarose gel electrophoresis. Clones having expected gel pattern were further analyzed by direct sequencing of PCR-amplified products made by using pCA-F primer 5′-GGCGTATCACGAGGCCCTTTCGTCTTCACC-3′ and pCA-R3 primer 5′-TTGCATCACCTTCACCCTCTCCACTGACAG-3′. The primers used for sequencing were F-CA primer 5′-CATTAAAGAGGAGAAATTAACTATGAGAGG-3′ from His-tagged side, and R-CA primer 5′-CATCTAATTCAACAAGAATTGGGACAACTC-3′ from GFP side. The sequencing reaction was performed using ABI PRISM model 377, 3700 or 3730, according to the standard Big-dye protocol (Applied BioSystems).
2.7. GFP fluorescence
Colonies were formed on LB plates containing 30 µg/ml chloramphenicol with (1 mM) or without IPTG. GFP fluorescence of individual colonies was visualized by 480 nm light and photographed by IMAGE FREEZER AE-6905 (ATTO CORPORATION).
2.8. Protein purification
Cells producing Histidine-tagged protein from the clone were grown at 37°C in 5 ml of LB medium supplemented with 30 µg/ml chloramphenicol to OD 600 0.3. Samples were taken 2 h after the addition of 1 mM IPTG. The cells were collected by centrifugation (10 000 r.p.m. × 3 min at 4°C) and resuspended in 400 µl of cold buffer I [50 mM Sodium phosphate (pH 7.0), 200 mM NaCl, Proteinase inhibitor (Hoffmann-La Roche Ltd)]. Thereafter, all manipulations were done at 4°C. Crude cell extracts were obtained by sonication (5 × 5 s, level 3, Astrason ultrasonic processor) and centrifugation (16 000 r.p.m. × 15 min). Crude cell extracts were loaded onto a 30 µl Nickel (Ni 2+ )-column [prepared according to manufacturer's instructions (QIAGEN, Inc) and equilibrated with buffer I]. Affinity chromatography of the extracts was performed at 4°C. Loaded columns were washed three times with 1 ml of buffer 2 [Buffer I contains 20 mM imidazole and 0.05% n -octyl-β-glucoside (Pierce Biotechnology, Inc)] and proteins were eluted with buffer 3 [50 mM Tris–HCl (pH 6.8), 2% SDS, 0.1% bromophenol blue, 10% glycerol, 100 mM DTT, 6 M Urea and 250 mM imidazole]. Eluted proteins were analyzed by 7.5–15% gradient SDS–PAGE (BIOCRAFT Co., Ltd.) followed by Coomassie Brilliant Blue staining.
3. Results and Discussions
3.1. Construction of cloning vector pCA24N
To clone all the genes of Escherichia coli , we constructed a plasmid vector with the following properties: (i) high copy number plasmid, (ii) IPTG-inducible expression of cloned ORF repression of expression by lacI q , (iii) Histidine tag attached to the N-terminal end of ORF, (iv) in-frame fusion with GFP at the C-terminal end, (v) generation of Sfi I restriction sites at both boundaries of cloned ORF ( Fig. 1 ) (see below). The construction of pCA24N is shown schematically in Fig. 2 . Expression of a target ORF inserted at the Stu I site is directed by the IPTG-inducible promoter, P T5- lac . Each ORF is fused in-frame with Histidine tag 7 spacer amino acids at the N-terminal end ( Fig. 3 ).
The gene encoding a mutated form of GFP 28 was placed immediately downstream of ORF so as to produce fusion protein. This is mainly for detecting localization of protein product in the host cell. GFP was also supposed to serve as an indicator for correct cloning of amplified ORF DNA. It was designed, however, to permit removal of the ORF after removing GFP by digestion with Not I enzyme followed by self-ligation. The final structure of the cloned ORF after removing GFP is shown in Fig. 4 . Clones that have Not I restriction site(s) within ORF can be prepared separately by using partial digestion with Not I enzyme.
3.2. Cloning into pCA24N
The number of ORFs used as initial target cloning was 4276 excluding ORFs coded by IS elements. Most of them (4267 out of 4276 ORFs) were successfully cloned into pCA24N in-frame and in correct orientation: 3 out of 9 unsuccessful ORFs, yhcQ , yibP and aidB , could be cloned only in opposite direction probably because even small amount of leak expression from P T5- lac promoter or upstream region is toxic for the host cell. Remaining six ORFs, eaeH (7104 bp), yheB (2694 bp), yhiH (2685 bp), yagF (1968 bp), ydbD (2313 bp) and btuB (1845 bp), were failed to be amplified by PCR reaction partly due to their length to be amplified. btuB mRNA was reported to form complex structure to control binding to ribosome and its translation. 29 This might interfere to be amplified by PCR reaction.
The purified PCR-amplified fragments were then ligated into StuI site of pCA24N vector. After overnight incubation at 37°C, eight colonies were picked out for each ORF and suspended in 1 ml fresh LB medium in 96 formatted deep well, and were also streaked on 1 mM IPTG containing LB plates at 37°C to check in-frame cloning by observing growth and GFP fluorescence upon overproduction of GFP fusion protein encoded by each ORF. After overnight incubation at 37°C, plasmid DNA was extracted and purified by using Multi Screen DNA purification kit, digested with Sfi I or Bgl I restriction enzymes, and analyzed by agarose gel electrophoresis. Four independent clones, that have the expected structures, were chosen and stored as a DNA mixture to minimize loss of intrinsic genetic information during PCR amplification. Boundaries of cloning sites were confirmed by sequencing.
3.3. Growth inhibition by IPTG induction
To investigate the influence of over-supply of target gene products, we induced expression of ORF by growing patches of cells on LB agar medium supplemented with or without 1 mM IPTG at 37°C. Out of 4269 ORFs tested, 3301 ORFs showed some growth inhibitory effects by IPTG induction ( Fig. 5 ). Out of these ORFs, 2149 ORFs showed severe growth defects: 1158 ORFs were predicted to have membrane-spanned domain and as membrane proteins. Of 1158 plausible membrane proteins, 1032 ORFs showed severe growth defects when overproduced.
3.4. Use of ORF clones
Construction of DNA microarray
First, we tried to construct cDNA microarray by using ORF clones as DNA templates for PCR amplification. These clones not only function as templates for PCR amplification by one set of common oligo DNA primers for every target ORF, but are also found to serve as an appropriate source for DNA template and provided a stable supply of DNA fragment. E.coli DNA microarray is now commercially available from TAKARA BIO, Inc. The applications using these microarrays have been already published. 30–34
Production of purified protein
The present vector was designed as an efficient expression vector, and protein product of each clone can be easily induced by IPTG in quantity and purified by using Ni-or Co- NTA column ( Fig. 6 ). Some proteins, especially most of the membrane proteins, had difficulties to be purified by this method. Few cytosolic proteins showed similar difficulties to be purified. One possibility is that His-tag could not function because of N-terminal 3D-structure. His-tag proteins attached with its C-terminal should be also considered. These purified proteins are also good resource for further analysis, such as protein biochemistry, crystallography and antigens for antibody production. Furthermore, analysis of co-purified proteins with His-tagged cloned ORF would reveal candidates of interacting proteins. This analysis by using mass-spectrometry for efficient identification of interacting proteins is now underway systematically.
Protein localization detected by fluorescence from GFP
Protein localization is undoubtedly one of the important information about protein product to understand its function. When induced by adding IPTG, however, 62% of total ORF clones showed growth inhibition to the host cell. And in some cases, IPTG induction might lead to form inclusion body. To avoid these artificial effects, cells were grown without IPTG. pCA24N vector has lacI q for strict suppression of the expression from P T5- lac promoter without IPTG growth condition; however, very few proportion of the cells showed fluorescence from GFP probably because of leaky expression. After pilot tests by using clones that were known for their cellular localization, this analysis was carried out on the systematic stage (Niki, H., personal communication and kindly provided pictures, Fig. 7 ).
Functional analysis of ORF
Deletion mutation of the target gene might reveal its physiological function by loss of function but clone might also give us many suggestions of its function by over supply of the target gene product. Applications by using clones have been already published elsewhere. 35–38
3.5. Distribution of clones
Clones are freely available to use for academic purposes with agreement for material transfer and planned to be distributed from multiple sites in Japan and other countries. Initially, these clones are available from Nara Institute of Science and Technology and requests are accepted through our web page ( ).
3.6. Quality control and related information
To fix the ORF prediction, not only for functional annotation but also for coding region, from a genome sequence is difficult problem and always under evolving stage. To improve the quality of our plasmid clone library, we have been keeping efforts to re-construct target ORF clones based on the latest ORF prediction.
In November 2003, the annotation of E. coli genes by international consortium was started according to the correction of genome sequence of E. coli by Japanese group organized by Horiuchi (Hayashi, K., Morooka, N., Otsubo, E. et al., manuscript in preparation). Recently, more accurate ORF prediction and annotation was fixed as the first version by the consortium (Riley, M. et al., manuscript in preparation). The total number of ORFs or ORF fragments of E. coli K-12 W3110 strain is 4364. Out of 4364 ORFs, 77 were predicted as pseudogenes caused by IS insertion or frameshift mutation. Excluding these pseudogenes and IS or phage-related ORFs, 3986 newly assigned ORFs are required to be established as plasmid clones as ASKA library. According to this information, we found ∼900 plasmid clones in our library to be required for minor modification mainly at their start sites, and almost all have been already modified and ready for use. At the same time, we tried to read whole sequence of the fragment cloned to check the PCR error and keep one correct clone if available.
As mentioned in Materials and Methods, previous stock of clones are mixture of multiple up to four independent candidates of each target ORF, that had been confirmed their structure by restriction enzyme and sequencing from both side. Some of them might have PCR errors and we will keep efforts to eliminate such kind of errors by sequencing and re-cloning. The latest information about ASKA library is available through our web page ( ).
This work was supported by a Grant-in-Aid for Scientific Research on Priority Areas from the Ministry of Education, Culture, Sports, Science and Technology of Japan, a grant from CREST, JST (Japan Science and Technology) and in part from NEDO (New Energy and Industrial Technology Development Organization) and Inamori Foundation. We thank Hironori Niki (National Institute of Genetics) for providing pictures of protein localization. We also thank Takashi Yura, Katsumi Isono (Kobe University) and Takashi Horiuchi (The Institute for Basic Biology) for manuscript preparation. Funding to pay the Open Access publication charges for this article was provided by NEDO.