Abstract

cDNA clones have long been valuable reagents for studying the structure and function of proteins. With recent access to the entire human genome sequence, it has become possible and highly productive to compare the sequences of mRNAs to their genes, in order to validate the sequences and protein-coding annotations of each (1,2). Thus, well-characterized collections of human cDNAs are now playing an essential role in defining the structure and function of human genes and proteins. In this review, we will summarize the major collections of human cDNA clones, discuss some limitations common to most of these collections and describe several noteworthy proteomics applications, focusing on the detection and analysis of protein–protein interactions (PPI). These human cDNA collections contain principally two types of cDNA clones. The largest collections comprise cDNAs with full-length protein coding sequences (FL-CDS). Some but not all of these cDNA clones may represent the entire mRNA sequence, but many are missing considerable non-coding UTR sequence, usually at the 5′ end. A second type of cDNA clone, a ‘full-ORF’ (F-ORF) expression clone, is one where the annotated protein-coding sequence, excised of 5′ UTR and 3′ UTR sequence, has been transferred to a vector designed to facilitate transfer to other vectors for protein expression.

MAJOR HUMAN FL-CDS CLONING PROGRAMS

During the past decade, several large-scale government and academic programs have collected and characterized human FL-CDS cDNA clones. The major programs include the NEDO (FLJ) Project, the Kazusa cDNA Project, the Mammalian Gene Collection (MGC), the German Human cDNA Project, the Harvard Institute of Proteomics (HIP) and the ORFeome program of the Center for Cancer Systems Biology (CCSB) at the Dana-Farber Cancer Institute. These programs distribute their sequence information through GenBank, EMBL/EBI or the DNA Database of Japan (DDBJ), all of which exchange data daily through the International Nucleotide Sequence Database.

The following descriptions address only human cDNA clones, although most of these programs offer cDNA clones for additional organisms, as well. The utility of cDNA collections for proteomics studies have been described in several recent reviews (37). Further details on each program are provided in Table 1.

The human full-length sequencing project of Japan NEDO (FLJ)

The largest human FL-CDS cDNA collection is the NEDO (FLJ) project (8,9), a joint program of the Institute of Medical Sciences of the University of Tokyo, the Helix Research Institute and the Kazusa DNA Research Institute (KDRI) (http://www.nedo.go.jp/bio-e/) (1012). The NEDO (FLJ) collection (http://fldb.hri.co.jp/cgi-bin/cDNA3/public/publication/index.cgi) contains ∼30 000 human FL-CDS clones representing ∼18 500 loci obtained from ∼130 libraries (10) (S. Sugano, personal communication). Most of the clones in this collection were obtained using the ‘oligo-capping’ method that enriches for cDNAs extending to the 5′ end of the mRNA (1315). Recently, the 18 500 non-redundant FL-CDS clones have been configured as F-ORF Gateway expression clones (N. Goshima, manuscript in preparation; N. Nomura, personal communication). Clones may be requested from NITE-BRC (see Table 1 for URL) and require a signed material transfer agreement (MTA).

The KDRI HUGE collection of long FL-CDS clones

This collection contains FL-CDS cDNAs ranging in size from 3.3 to over 10 kb, representing more than 2000 novel, non-redundant genes (11,12). A major focus of the HUGE project (http://www.kazusa.or.jp/huge/index.html) is to characterize the function of proteins >50 kD. A significant part of the collection has been manually curated with additional information on possible protein function (11). Nearly, 1000 of these clones are available also as F-ORF Gateway™ expression clones; and the same set of F-ORF clones is being constructed in the Flexi® Vector system (O. Ohara, personal communication). This collection has recently been expanded to include an additional ∼3000 large cDNA clones, giving a total of ∼5500 clones. All clones are fully sequenced, and are available from the KDRI through a signed MTA.

Mammalian gene collection (MGC)

The MGC (16,17) is an FL-CDS cDNA cloning and sequencing program sponsored and managed by the US National Institutes of Health (NIH) (http://mgc.nci.nih.gov/). Its goal for human clones is to achieve, by 2007, at least one FL-CDS cDNA clone for each of the well-defined 18 368 human RefSeq loci, discussed subsequently (18) (L. Wagner, personal communication) (Note: See Supplementary Material for a list of these genes.) The first ∼12 000 non-redundant human clones were obtained by MGC from >180 cDNA libraries prepared from a wide variety of tissues. Most new human clones deposited over the past 2 years have been obtained by a directed RT-PCR cloning strategy. MGC full-length cDNAs are sequenced to a high standard of quality, and all sequences are compared with the reference genome to identify mismatches, with differences annotated in the GenBank entries (1,16). Clones are distributed, without restriction, through the IMAGE Consortium.

The German human genome project

Begun in 1997, a consortium of German Scientific Institutes has cloned and characterized cDNAs for transcripts then missing from other collections, and provided functional information on these transcripts (19,20). Descriptions of more than 10 000 cDNA clones obtained from this program (19,21,22) are provided at http://mips.gsf.de/proj/cdna/Sites/HU_cDNA_Database.htm. The cDNAs are fully sequenced to a high quality (less than one error per 10 000 nt), and many contain the full protein coding sequence. A set of ∼1200 F-ORF Gateway™ expression clones recently has been constructed that will be released in the near future (S. Wiemann, personal communication). Clones are available, without restriction, through the consortium's commercial distributor, RZPD.

Dana-Farber Cancer Institute Center for Cancer Systems Biology (CCSB)

This program (23,57b) is building a steadily expanding collection of human F-ORF expression clones (http://horfdb.dfci.harvard.edu/). Their goal is to generate a complete set of F-ORF ORFeome clones representing all protein-coding sequences in the human genome. The full coding sequence is PCR-amplified from MGC human FL-CDS clones and transferred into Gateway™ Entry vectors (23). The absence of a stop codon in these clones allows users to create both C- and N-terminal fusion proteins in appropriate expression vectors. This collection presently comprises non-redundant F-ORF expression clones for ∼10 000 human genes. Clones are all end-sequenced and distributed without restriction through Open Biosystems.

Harvard Institute of Proteomics (HIP)

HIP offers an ever-growing collection of human F-ORF expression clones in two different expression vector formats (5,6): Gateway™ (∼4000 F-ORFs, most with and without stop codons) and Creator (∼5500 F-ORFs, representing ∼3000 genes, most with and without stop codons). All are fully sequenced. These clones are distributed without restriction though HIP: http://www.hip.harvard.edu/.

HIP also provides a distribution service to research laboratories, called the Shared Plasmid Resource, whereby HIP distributes clones donated to HIP by any outside research laboratory. Shared Plasmid Resource clones are available without restriction to non-commercial requestors, and to commercial investigators with the permission of the donating laboratory.

Commercial sources of cDNA clones

Sizeable collections of human FL-CDS and F-ORF expression clones are also available from commercial vendors, the largest of which are listed in Table 2, together with some properties of their human clone collections.

Useful databases of human cDNA clone sequences and related information

  • The H-Invitational Database (H-Inv): This database is the product of two international H-Inv workshops, held in 2002 and 2004 (10), which convened a diverse group of bioscientists to annotate and manually curate 41 118 FL-CDS, representing upto 21 037 loci, derived from the largest published collections of human cDNA sequences. The H-Inv database browser (http://www.h-invitational.jp/) describes where each cDNA maps to the genome, together with extensive functional information.

  • UCSC Genome Browser Database: The UCSC Browser (http://genome.ucsc.edu/) displays extensive sequence and annotation information on the sequence of human and 11 other vertebrate genomes, and for several model organisms. The browser supports rapid visualization and querying of genes, gene predictions, mRNAs, ESTs, expression and variation data.

  • The Vertebrate Genome Annotation (Vega) database (http://vega.sanger.ac.uk) provides a resource for browsing manually annotated finished sequences of human, mouse, zebrafish and dog genomes. The Vega browser includes detailed genome maps of genes, transcripts, proteins and protein domains.

LIMITATIONS TO CURRENT cDNA COLLECTIONS

Errors in mRNA sequence and annotation

Reliable gene counting, discussed below, relies primarily on mapping cDNA sequences to the human genome, which in turn often reveals variations between the cDNA and genome sequence (1,2). Variation in cDNA sequences can arise either from experimental artifacts introduced into the cDNA sequence or by naturally occurring sequence variation. Experimental artifacts arise during cDNA synthesis, by reverse transcriptase, during the ligation of the cDNA to the vector, and during subsequent cloning steps, as well as from errors in the DNA sequence analysis of the cDNA insert. When RT-PCR is used to clone cDNAs, errors can arise during the DNA amplification process, as well as from the synthetic DNA primers used for PCR.

An accepted standard for mRNA sequences for human, mouse, and more than 2400 other species is the RefSeq program at NCBI (24). RefSeq provides a carefully curated, non-redundant set of full-length mRNA sequences (including 5′ and 3′ UTR), based on genomic, mRNA and protein sequence evidence. A second set of high-quality, full-length human, mouse and rat mRNA sequences is available through the MGC program at the NIH. Launched in early 2000, the MGC exploited recent advances in gene sequencing technology to set high cDNA sequencing standards (<1 error per 50 000 nt). The MGC sequences also are thoroughly curated to identify and eliminate clones with frameshifts or chimeras, and to annotate non-synonymous variation in the cDNA sequences that could be due to experimental artifact (1,16).

Taken together, the RefSeq and MGC sequences provide the most thoroughly curated collection of human cDNAs. Nevertheless, some errors undoubtedly remain in a fraction of these sequences. Furey et al. (1) recently aligned the combined mRNA sequences of MGC and RefSeq to the reference human genome sequence (July 2003 release); they found that EST and other mRNA sequences support natural variation in the genome sequence about four times more often than in the mRNA sequence, implying that the genome sequence is considerably more accurate than the mRNA sequences in these collections (1). After excluding known and probable polymorphisms the authors estimated about one difference per 2500 nt, representing sequencing errors or other experimental artifacts. A second study of MGC sequences (16) estimated that non-synonymous sequence changes (resulting in altered amino acids) because of experimental artifacts may populate about 10% of non-redundant human MGC clones.

Another kind of error can arise in the sequence record from an incorrect annotation of the protein-coding sequence (ORF) within an mRNA. Identifying whether a start codon in a particular transcript is or is not the initiating codon of a protein-coding transcript is difficult when transcripts lack an upstream stop codon within the 5′UTR in frame with the initiating ATG, coupled with an absence of definitive protein support for the N-terminal portion of the predicted protein. This is a common circumstance, as 39% of human RefSeqs mRNAs (24) have no in-frame upstream stop codon (L. Wagner, personal communication). Choosing the starting ATG in these situations often must rely on gene prediction programs, such as GenScan and NSCAN (2,25), and on algorithms that look for evidence of a transition from non-coding to coding sequences around the ATG (17).

Likewise when several ATG codons are in phase with a predicted ORF, the identification of the predominant initiating ATG may be ambiguous. The ribosome scanning model (26,27) predicts that the ATG furthest 5′ on the mRNA is preferred by ribosomes as the initiating ATG, unless it is hidden by RNA secondary structure or is competing with a nearby ATG surrounded by a more favorable Kozak consensus sequence (26,27). In some eukaryotic mRNAs, however, protein synthesis appears to initiate at different levels at two or more ATG codons and, rarely, at non-ATG codons (2830). Unconventional start codons such as these are likely to be missed during the annotation process.

The gold standard for annotated human protein-coding sequences is the Consensus CDS (CCDS) set of 13 142 genes (31) agreed to at every coding nucleotide by the CCDS group, comprising the NIH, National Center for Biotechnology Information (NCBI), European Bioinformatics Institute (EBI), Wellcome Trust Sanger Institute (WTSI) and University of California, Santa Cruz (UCSC).

Incomplete sequence representation in cDNA libraries

An optimally useful cDNA collection should contain at least one representative transcript for each protein-coding human gene. The total number of human genes, however, is still uncertain. The essentially completed human euchromatic genome sequence, published in 2004, suggested that the human genome encodes 20 000–25 000 genes (32). Recent EnsEMBL and NCBI gene models predict ∼22 000 human genes (L. Wagner, personal communication) [(2005) ENSembl Human, http://www.ensembl.org/Homo_sapiens/index.html ENSembl], similar in number to recent estimates for gene numbers in mouse (33) and rat (16,34). A more conservative and more recent human gene estimate, based on orthologous genes of mouse and dog (35), proposes ∼19 600 human genes (M. Clamp, manuscript in preparation). These sets of annotated protein-coding genes include some genes whose coding sequence is represented partly or entirely by gene predictions, rather than by the sequence of FL-CDS cDNAs.

Based on NCBI genome build 36.1, the number of human protein-coding genes corresponding only to FL-CDS cDNAs annotated on the genome (i.e. excluding gene models) totals 18 368 (as of March 25, 2006). (Note these genes are identified with the following Entrez query against the Gene database at http://www.ncbi.nlm.nih.gov/: homo_sapiens [ORGANISM] AND protein_coding NOT srcdb_refseq_model [PROPERTIES].) This gene set includes all 13 142 CCDS genes. A complete listing of this set of genes is provided in the Supplementary Material. Currently, the MGC contains one or more FL-CDS clones for 75% of the 18 368 MGC gene set and for 80% of the CCDS genes (L. Wagner, personal communication).

Until recently, MGC and most other large cDNA collections have been built from clones isolated by random screening of multiple cDNA libraries. RNA from a wide variety of tissues is typically used to promote transcript diversity. Nevertheless, random screening approaches preferentially clone cDNAs for the RNAs in highest abundance, which generally derive from a relatively small number of genes. The vast majority of genes, however, is represented at low abundance, with about 1–10 copies of mRNA per cell (3739). These mRNA frequency distributions likely contribute to the commonly observed drop in yield as libraries are progressively screened for new clones representing unique genes. To improve the yield of clones for new genes, libraries can be treated to normalize or equalize the abundance classes, and libraries can be pre-subtracted of RNA sequences previously isolated (4042). An alternative approach that is less sensitive to mRNA abundance is to clone individual RT–PCR products, targeting specific mRNAs, using primers based on RefSeq mRNA sequences (43,44).

Other factors can also contribute to the under-representation of genes and their transcripts in cDNA collections, including the low abundance of certain mRNAs that are unique to one or a few tissues, and therefore difficult to obtain in substantial quantity; cDNAs that encode products toxic to the bacterial cells used for cloning; cDNAs that contain inverted or direct repeats that are unstable during the cloning and propagation steps; and cDNAs greater than 4–6 kb in size, which are less efficiently cloned.

Bioinformatic methods suggest that 35–74% of human genes may utilize alternative splicing (4548), and additional isoforms can result from alternative transcript initiation sites (49) and alternative poly-A addition sites (50,51). The total number of physiologically relevant RNA isoforms is unknown, but specific isoforms are known to play important roles in cells, such as isoforms encoding proteins functioning in ion channels of nerve, muscle and cardiac cells (46,5254). The variety of RNA isoforms in different human cells is almost certainly under-represented in most current cDNA clone and sequence collections. For example, less than 10% of the 22 400 FL-CDS clones present in MGC appear to represent splice variants (L. Wagner, personal communication). Whatever deficiency of isoforms exists in today's FL-CDS collections will need to be remedied for future proteomics studies that attempt to sample the entire ORFeome.

Finally, single-exon genes and multi-exon genes encoding small proteins also are generally under-represented in current collections, in part by design. Although physiologically relevant proteins are encoded by single-exon genes (55,56), to avoid artifactually short cDNAs MGC and other programs have purposely excluded transcripts encoding proteins of fewer than 100 amino acids, except where there is strong protein evidence for their natural occurrence (16).

Practical limitations of FL-CDS clones

The FL-CDS clones available from some of the largest cDNA collections (Table 1) are generally unsuitable for immediate use in expression studies, often because they lack an appropriate promoter. Variable lengths of 5′ UTR sequence also can potentially encode unwanted amino acids and complicate the design of N-terminal fusion constructs where it is important to maintain the proper reading frame into the CDS. To properly position the coding sequences next to a promoter of choice or next to sequences encoding N-terminal fusion proteins, such as reporter proteins or epitope tags, the 5′ UTR sequences must be removed. Likewise, to prepare C-terminal fusion proteins, the natural stop codon must be removed and variable lengths of 3′ UTR sequence preferably excised.

The protein-coding sequence, with or without its stop codon, can be excised from cDNAs by PCR or occasionally by restriction digestion, provided suitable restriction sites are available. Both methods are time-consuming and can introduce mutations into the subclone. To address this problem, several groups have transferred the protein-coding sequences from these FL-CDS clone collections into specialized expression vector systems.

To conform to popular convention, the full-length protein-coding region (±stop codon), excised of UTR sequence, will hereafter be referred to as an ‘ORF’. [Note: mRNAs with multiple ATG codons near the 5′ end and in different phases of reading frame can potentially encode more than one open reading frame (ORF) of significant length. The annotated CDS of an mRNA is the ORF that is judged to be the most likely protein coding sequence for that mRNA, based on the bioinformatic criteria discussed earlier.] Clones configured in this manner will be called ‘ORF clones’, and ORF clones in expression vectors will be called ‘Expression ORFs’. The set of human ORF clones representing all protein-coding sequences is referred to as the ‘human ORFeome’.

Some of the most commonly used systems for large-scale cloning and expression of F-ORF clones are listed in Table 2. As shown in Figure 1, these vector systems permit the transfer of one or more ORFs from a ‘donor’ vector to one or multiple different ‘acceptor’ vectors, potentially all in a single experiment (57a,b). (In most cases, acceptor vectors are expression vectors with sequences flanking the ORF that promote its transfer from the donor vector.) Moreover, these transfers are performed using a single protocol that positions each ORF into each new acceptor vector in a configuration suitable for native or fusion protein expression, using reactions that maintain the orientation and proper reading frame of the coding sequence and that rarely introduce mutations. To prepare C-terminal fusion proteins, a separate collection of F-ORF clones lacking a stop codon is generally prepared.

All of the systems listed in Table 3, except for the Univector system, use sites flanking both ends of the ORF in a donor vector that are recognized by rare-cutting enzymes (restriction enzymes or recombinases), virtually eliminating inappropriate cleavage within the cDNA and vector backbone. These systems use positive (antibiotic) selection to obtain the desired product, together with counter-selection to reduce background colonies resulting from acceptor vector lacking an insert; together these constraints lead to extremely low backgrounds of unwanted constructs. Some of the features of these systems are listed in Table 3. A large number of human cDNA clones are available from commercial sources, in some cases as ready-to-use F-ORF expression clones in a variety of expression vectors (Tables 2 and 3).

Because these systems are suitable for the transfer of anywhere from a few ORFs to thousands of ORFs within a short period of time—enabling researchers to create rapidly large numbers of F-ORF expression clones—they are being used increasingly for large-scale proteomics studies, as discussed below.

ORFeome Collaboration

Though several collections of ORF expression clones are available (Tables 1 and 2), no single public collection contains ORFs representing all ∼18 000 well-defined RefSeq human genes, and clones for many of these genes are absent from all of the current collections. To address this need, in 2005, MGC, WTSI, CCSB, HIP, DKFZ and the RIKEN Yokohama Institute organized an effort, named the ORFeome Collaboration, to share resources and new human FL-CDS clones, with the aim of building a complete collection of F-ORF expression clones for all well-defined human genes, configured as Gateway Entry clones. These fully sequenced F-ORF clones will be distributed worldwide, without restriction, to academic, government and commercial researchers.

RECENT APPLICATIONS OF HUMAN ORFeome COLLECTIONS

ORFeome collections can be used at every scale of research from experiments on single ORFs to studies of entire ORFeomes. For example, ORFs can be used one at a time to study protein localization or for structural experiments. ORF collections also lend themselves to module-scale experiments, where a particular pathway or biological function can be examined in its entirety.

But the greatest value can be extracted from ORFeome collections when the entire resource is used to carry out large-scale experiments. Until recently, such studies would have been impossible to carry out because of low numbers of cloned ORFs, the lack of a central repository for those ORFs, and because the ORFs that were available were not in the same vector or were not expression ORFs, but rather cloned cDNAs containing 5′ and/or 3′ UTR. Three notable approaches that have begun to flourish with the availability of ORFeome collections are structural genomics (58), proteome-wide mapping of PPI primarily using the yeast 2-hybrid (Y2H) system (59a,b,60), and genome-scale cell-based assays including high-content screening using automated imaging analysis (61).

Structural genomics

In the emerging field of structural genomics, the aim is to lower the cost and expand the coverage of identified protein folds (62). To reach this goal, there have been several large-scale initiatives, such as the Protein Structure Initiative (http://www.structuralgenomics.org/), that aim to generate structures based on available protein-encoding ORFs. Although these centers have not yet taken full advantage of complete mammalian ORFeome collections, they have had some success with earlier ORFeome collections, such as that for Caenorhabditis elegans (58).

Protein interaction mapping

Two-hybrid (2-H) and one-hybrid (1-H) systems

High-throughput Y2H approaches generally consist of testing all available combinations of proteins as DNA-binding domain (DB-X) and activation domain (AD-Y) fusion proteins (63) (Fig. 2). Although early versions of this system were criticized for having a high false positive rate, the implementation of more stringent Y2H systems and the rigorous retesting of interactions have in large part eliminated this concern. In current versions of the Y2H system, low-copy centromeric vectors are used to reduce the expression level of the fusions to avoid spurious interactions. In many cases, auto-activating baits have an intrinsic trans-activating activity and can easily be eliminated, before starting the screen, by testing for reporter gene activity in yeast cells containing only the DB-X vector. Other DB-X auto-activators arise de novo during the screen and are eliminated using a plasmid shuffling counter-selection. In this method, a counter-selection relying on cycloheximide sensitivity is used to eliminate AD-Y from yeast cells to ensure that the Y2H reporter genes are activated only in the presence of both DB-X and AD-Y (64).

The Y2H system has been used for high-throughput PPI mapping for several model systems including yeast, Drosophila melanogaster and C. elegans. Owing to the availability of extensive ORF collections, similar module-scale and proteome-scale PPI maps have recently been generated for human. Maps have been generated for the Smad TGF-beta signaling pathway (65), mRNA degradation factors (66) and proteins linked to Huntington's disease (67).

Two large-scale human PPI maps have recently been published (59b,60). Both groups screened a significant portion of the ORFeome by testing all pair-wise combinations for interaction. Stelzl et al. (59b) used ∼3500 cDNAs from a human fetal brain expression library in addition to ∼2000 MGC ORFs; in contrast, Rual et al. screened a matrix of 8000 ORFs obtained from MGC cDNAs. Stelzl et al. generated a map of 911 high-confidence interactions (‘edges’) among 401 proteins (‘nodes’), whereas Rual et al. constructed a map of 2754 core interactions between 1549 proteins (Fig. 3). Though Y2H has proven to be a powerful and scalable tool for PPI mapping, it results in a high level of false negatives. Therefore, complementary approaches are needed to generate a more complete map of the human interactome.

In contrast to the Y2H, the Yeast 1-Hybrid (Y1H) is designed to detect protein–DNA interactions. Y1H protein–DNA interactions are defined using a single hybrid protein, AD-Y, where Y is a known or putative DNA binding protein (DB). Though the Y1H system has not yet been applied to a large collection of human ORFs, its use and scalability has recently been demonstrated in C. elegans (68). As in the Y2H system, reporter gene expression is used as a readout when AD-Y can bind to a sequence of DNA that has been cloned upstream of a reporter gene.

Other techniques used to detect interactions

In contrast to the Y2H system, where interactions occur in the nucleus, the following other techniques have been developed to test interactions in other cells types and cellular compartments.

Split ubiquitin

Recently, a two-hybrid system called ‘split ubiquitin membrane Y2H’ has been adapted for large-scale screening (69). This variant of the 2-hybrid system is currently the only one that allows large-scale investigation of integral membrane proteins, a class that cannot be screened using the traditional Y2H system. A yeast interaction map of ∼2000 interactions among ∼500 membrane proteins has been built using this technique (70).

LUMIER

A luminescence-based mammalian protein–protein interactome mapping system (LUMIER) has recently been described. This method was used in a high-throughput manner to study the transforming growth factor-beta (TGF-B) pathway (71). In this mammalian cell-based assay, bait proteins are fused to Renilla luciferase (RL) and the prey proteins are tagged with the FLAG epitope. Interactions are determined by performing an RL enzymatic assay on immunoprecipitates using a Flag-antibody. Though this technology has not yet been applied on a high-throughput scale the system could easily be adapted for such studies.

MAPPIT

MAPPIT (mammalian protein–protein interaction trap) uses a cytokine-receptor-based interaction trap to detect protein–protein interactions (72). The interaction between bait and prey reconstitutes the receptor by bringing the activated JAKs into proximity of functional STAT recruitment sites. This recruitment allows for the activation and dimerization of STATS, which then act as a transcription factor to drive a reporter gene. Because assayed interactions occur in the cytosolic sub-membrane space, MAPPIT does not rely on nuclear translocation of bait and prey proteins. Another advantage of this technique is that the readout is ligand-dependent, adding a unique level of control to monitor interactions. The usage of heterologous receptors fused to different bait proteins can also allow for detection of modification-dependent PPI such as phosphorylation-dependent interactions that might be too transient to be detected by the standard Y2H. Although MAPPIT can be used for module-scale screens, this procedure has not yet been adapted for proteome-scale screens.

Disrupting interactions

Disrupting PPI, with small bio-molecules and chemical compounds, can reveal a great deal about the protein features that generate interactions. For example, interaction interfaces and interacting domains can be mapped using reverse-Y2H assays. Such systems can also be exploited for drug discovery, where small molecules are identified that can disrupt an interaction of medical relevance. Reverse-Y2H (73) and MAPPIT (72) systems have been developed successfully to screen for disruptions of PPI. These disruptions can be caused by mutations in cis, within the interacting protein molecules, or in trans, by compounds that prevent the interactions from taking place. In the reverse-Y2H system, the interaction between DB-X and AD-Y can be used to drive the expression of URA3. Expression of this reporter leads to the conversion of 5-FOA to 5-FU, a toxin. This counter-selection is used to identify DB-X and AD-Y pairs that can no longer interact, based on their survival on media containing 5-FOA. In the reverse MAPPIT system, a disrupted protein interaction is identified based on the loss of an interaction between a protein fused to an inhibitor of JAK/STAT signaling and a bait fused to a functional cytokine-receptor, thus allowing for a restoration of reporter gene activity.

Cell-based assays

ORFs can also be used to perform high-throughput cell-based assays. Typically, expression ORFs are transfected into mammalian cells using a high-throughput transfection technique. A wide variety of assays have been performed using such methods. Often these assays depend upon technology that allows for ‘high-content screening’ of cells, so that changes in cell-shape and/or protein localization can be detected and analyzed in an automated fashion. For example Harada et al. (61) used high-content screening in a live cell assay to identify proteins which, when over-expressed, increase proliferation. In this study they identified more than 86 cDNAs that gave rise to increased proliferation in a cell line. Other studies have taken similar strategies to study localization of proteins in the cell (20,74).

Integration of Y2H data with other large-scale datasets

Interaction maps can be used as a scaffold to integrate different large-scale datasets. One example of this approach, done in yeast, combined PPI data and mRNA expression data to determine the biological role of topological ‘hubs’, defined as proteins with many interaction partners (75). By coupling high-quality interaction data with expression data the authors were able to show that hubs can be split into two categories: ‘date hubs’ which have relatively low correlation over a large number of conditions as revealed by their expression profiles and therefore interact with their partners at different times or locations; and ‘party hubs’, which have a higher correlation and bind their partners simultaneously. These results support a model of organized modularity where date hubs represent ‘higher level’ connectors between modules, whereas party hubs function inside modules.

Another example of data integration was carried out for C. elegans by combining phenotypic profiling and expression profiling data with PPI data (76). By deleting interactions with less than two types of functional evidence, a ‘multiple support network’ of more than 300 proteins and 1000 edges was built. This network was shown to harbor two types of models: protein complexes that constitute discrete molecular machines are represented by clusters of nodes whose edges were supported by both PPI and phenotypic correlation data; proteins involved in the same cellular processes without participating in the same biological pathways correspond to nodes whose edges are supported by phenotypic and expression correlation but lack support of Y2H data. Functions of previously unknown proteins were predicted using ‘guilt by association’ and were consistent with the localization patterns of GFP-tagged proteins.

The combination of PPI data with other large-scale datasets was also undertaken in both human proteome-wide Y2H studies, described earlier. To validate their interaction data, Rual et al. (60) correlated PPI data with expression studies in human and mouse tissues, conserved upstream motifs, GO term annotations and phenotype data of orthologous genes in the mouse (Fig. 3). Furthermore, they functionally annotated uncharacterized proteins in the interaction map by integrating PPI data with data from the Online Mendelian Inheritance in Man (OMIM) database. This resulted in the identification of 424 interacting protein pairs for which at least one partner was associated with a disease.

In a similar fashion, Stelzl et al. (59b) evaluated their dataset by comparing it with GO annotation and interaction maps in other species. They also compared their PPIs to the Kyoto encyclopedia of genes and genomes (KEGG), which allowed them to identify proteins that link two or more proteins annotated to act in the same pathway.

FUTURE APPLICATIONS OF ORFeome CLONES

The integration of PPIs with other large-scale data has proven very useful to build better network models, evaluate Y2H interactions and infer function for previously uncharacterized genes. This integration process, however, is still far from reaching its full potential. In order to gain a more profound understanding of PPIs in cellular networks, the integration of more complete datasets is required. Although most current large-scale studies only use parts of the available proteome, improved large-scale approaches should take advantage of the ORFeome to generate truly proteome-wide datasets. Furthermore, most current studies do not take into consideration alternative splice forms, but rather collapse alternatively spliced transcripts of single genes into a single ORF. Because splice variants and other RNA isoforms commonly follow different expression patterns in time and space, often related to different biological functions (77), their individual interactions should be treated as individual data points.

SUPPLEMENTARY MATERIAL

Supplementary Material is available at HMG Online.

ACKNOWLEDGEMENTS

P.L., D.E.H., S.M. and M.V. were supported by grants from the Ellison Foundation, the NCI, NHGRI and NIGMS awarded to M.V.

Conflict of Interest statement. None declared.

Figure 1. ORF transfer from a donor vector to multiple acceptor vectors. The ORF sequence in one or more donor vectors can be transferred in a single experiment to multiple acceptor vectors, using a single reaction protocol. The ORF in the donor vector is flanked by recombination sites (Gateway, Creator, Magic) or by rare-cutting restriction enzyme sites (Flexi Vector) that permit directional transfer and maintain the desired translational reading frame between the ORF and sequences in the acceptor vectors encoding N-terminal or C-terminal protein tags. Adapted from Hartley et al. (57a), Walhout et al. (57b).

Figure 1. ORF transfer from a donor vector to multiple acceptor vectors. The ORF sequence in one or more donor vectors can be transferred in a single experiment to multiple acceptor vectors, using a single reaction protocol. The ORF in the donor vector is flanked by recombination sites (Gateway, Creator, Magic) or by rare-cutting restriction enzyme sites (Flexi Vector) that permit directional transfer and maintain the desired translational reading frame between the ORF and sequences in the acceptor vectors encoding N-terminal or C-terminal protein tags. Adapted from Hartley et al. (57a), Walhout et al. (57b).

Figure 2. Five types of protein interaction assays. This figure summarizes the most common one- and two-hybrid assays used for PPI screens. (A) A variety of one- and two-hybrid assays have been developed to test PPIs in different cell types and subcellular locations. (1) The MAPPIT assay takes place in the cytosolic sub-membrane space of mammalian cells (72). (2) LUMIER is the only system with an extracellular interaction readout and can detect interactions taking place in any subcellular location (71). (3) The Y1H system screens for protein–DNA interactions and relies on the recruitment of the AD-Y protein to the yeast nucleus (68). (4) The Y2H system requires both fusion proteins to be located inside the yeast nucleus to detect PPIs (63). (5) The split ubiquitin assay is designed to detect PPIs between integral membrane proteins that take place in the yeast membrane (70). (B) In each of the assays presented here, baits and preys are represented in red and dark blue, respectively. (1) In the MAPPIT system, a ligand (L) binds the receptor's ligand binding domain activating the receptor-associated JAKs. The mutation Tyr1138->Phe on the cytosolic domain of the receptor prevents the recruitment and activation of STAT. Two additional mutations, Tyr985->Phe and Tyr1077->Phe, eliminate adaptor and/or negative feedback mechanisms. Upon phosphorylation of these binding sites, STATS are recruited and activated leading to the formation of STAT complexes that subsequently induce luciferase activity or puromycin resistance under the control of the rPAP1 promoter. (2) In the LUMIER assay, RL-tagged baits and flag-tagged preys are immunoprecipitated from mammalian cells. Interactions are detected enzymatically in the form of light emission. (3) The Y1H system detects protein–DNA interactions using a single hybrid protein AD-Y. A positive interaction activates a reporter gene. (4) In the Y2H system, the interaction of two fusion proteins, DB-X and AD-Y, reconstitute a transcription factor that activates a reporter gene. (5) In the split ubiquitin system, one integral membrane protein is fused to one half of the ubiquitin protein (NubG), whereas the second membrane protein is fused to the other half of the ubiquitin protein and a transcription factor (Cub-PLV). Interacting proteins bring the two halves of ubiquitin into proximity thereby reconstituting that protein which is then cleaved by an ubiquitin-specific protease releasing the transcription factor.

Figure 2. Five types of protein interaction assays. This figure summarizes the most common one- and two-hybrid assays used for PPI screens. (A) A variety of one- and two-hybrid assays have been developed to test PPIs in different cell types and subcellular locations. (1) The MAPPIT assay takes place in the cytosolic sub-membrane space of mammalian cells (72). (2) LUMIER is the only system with an extracellular interaction readout and can detect interactions taking place in any subcellular location (71). (3) The Y1H system screens for protein–DNA interactions and relies on the recruitment of the AD-Y protein to the yeast nucleus (68). (4) The Y2H system requires both fusion proteins to be located inside the yeast nucleus to detect PPIs (63). (5) The split ubiquitin assay is designed to detect PPIs between integral membrane proteins that take place in the yeast membrane (70). (B) In each of the assays presented here, baits and preys are represented in red and dark blue, respectively. (1) In the MAPPIT system, a ligand (L) binds the receptor's ligand binding domain activating the receptor-associated JAKs. The mutation Tyr1138->Phe on the cytosolic domain of the receptor prevents the recruitment and activation of STAT. Two additional mutations, Tyr985->Phe and Tyr1077->Phe, eliminate adaptor and/or negative feedback mechanisms. Upon phosphorylation of these binding sites, STATS are recruited and activated leading to the formation of STAT complexes that subsequently induce luciferase activity or puromycin resistance under the control of the rPAP1 promoter. (2) In the LUMIER assay, RL-tagged baits and flag-tagged preys are immunoprecipitated from mammalian cells. Interactions are detected enzymatically in the form of light emission. (3) The Y1H system detects protein–DNA interactions using a single hybrid protein AD-Y. A positive interaction activates a reporter gene. (4) In the Y2H system, the interaction of two fusion proteins, DB-X and AD-Y, reconstitute a transcription factor that activates a reporter gene. (5) In the split ubiquitin system, one integral membrane protein is fused to one half of the ubiquitin protein (NubG), whereas the second membrane protein is fused to the other half of the ubiquitin protein and a transcription factor (Cub-PLV). Interacting proteins bring the two halves of ubiquitin into proximity thereby reconstituting that protein which is then cleaved by an ubiquitin-specific protease releasing the transcription factor.

Figure 3. Integrating interaction maps with other large-scale datasets. This PPI network represents a sub-network of the human interaction map generated by Rual et al. All interactions, or edges, represented here have been confirmed by one or more additional functional links. Proteins are depicted as yellow nodes with adjoining gene symbols. Combined physical interactions and functional links between gene- or protein-pairs are depicted as magenta edges (for gene pairs that are co-expressed or share a common conserved upstream motif), green edges (for protein pairs that share a common GO term) or orange edges (for protein pairs that have mouse orthologs that share a common phenotype). Figure adapted from Rual et al. (60).

Figure 3. Integrating interaction maps with other large-scale datasets. This PPI network represents a sub-network of the human interaction map generated by Rual et al. All interactions, or edges, represented here have been confirmed by one or more additional functional links. Proteins are depicted as yellow nodes with adjoining gene symbols. Combined physical interactions and functional links between gene- or protein-pairs are depicted as magenta edges (for gene pairs that are co-expressed or share a common conserved upstream motif), green edges (for protein pairs that share a common GO term) or orange edges (for protein pairs that have mouse orthologs that share a common phenotype). Figure adapted from Rual et al. (60).

Table 1.

Government and academic programs to build collections of human FL-CDS clones and F-ORF expression clones

SourceNo. of genesaSequence validationStop codonUTR sequencesType of cloneVector systemClones distributed (MTA Req'd)Miscellaneous (References)
NEDO (FLJ)18 500Full-length++FL-CDSpME18S-FLbNITE-BRC (Yesc)>30 000 clones from oligo-capping cDNA libraries (S. Sugano, personal communication, N. Nomura, personal communication, 14,15)e
18 500Full-length±gF-ORFGateway™, pENTR201TBD(N. Goshima, in preparation; N. Nomura, personal communication)
MGC>13 000Full-length++FL-CDSVarious typesIMAGE (No)>22 000 clones (78)f
Kazusa>5 500Full-length++FL-CDSVarious typesKDRIc (Yes)ORFs >4 kb
>1 000Full-length±F-ORFGateway™, Flexi® dKDRIc,d (Yes)(79)
DKFZ>10 000Full-length++Many as FL-CDSRZPD (No)(20,22,80,81)
∼1 200Full-length±F-ORFGateway™, pENTR221RPZD (No)
CCSB>10 000End sequencesF-ORFGateway™, pENTR223Open Biosystems (No)(23,57b)
HIP∼4 000Full-lengthMost ±F-ORFGateway™, pENTR201, pENTR221HIP (No)(5)
∼3 000Full-lengthMost ±F-ORFCreator™ pDNR-dualHIP (No)
SourceNo. of genesaSequence validationStop codonUTR sequencesType of cloneVector systemClones distributed (MTA Req'd)Miscellaneous (References)
NEDO (FLJ)18 500Full-length++FL-CDSpME18S-FLbNITE-BRC (Yesc)>30 000 clones from oligo-capping cDNA libraries (S. Sugano, personal communication, N. Nomura, personal communication, 14,15)e
18 500Full-length±gF-ORFGateway™, pENTR201TBD(N. Goshima, in preparation; N. Nomura, personal communication)
MGC>13 000Full-length++FL-CDSVarious typesIMAGE (No)>22 000 clones (78)f
Kazusa>5 500Full-length++FL-CDSVarious typesKDRIc (Yes)ORFs >4 kb
>1 000Full-length±F-ORFGateway™, Flexi® dKDRIc,d (Yes)(79)
DKFZ>10 000Full-length++Many as FL-CDSRZPD (No)(20,22,80,81)
∼1 200Full-length±F-ORFGateway™, pENTR221RPZD (No)
CCSB>10 000End sequencesF-ORFGateway™, pENTR223Open Biosystems (No)(23,57b)
HIP∼4 000Full-lengthMost ±F-ORFGateway™, pENTR201, pENTR221HIP (No)(5)
∼3 000Full-lengthMost ±F-ORFCreator™ pDNR-dualHIP (No)

aMost of these collections also include mRNA isoforms for a small percentage of these genes.

bThe pME18S-FL vector is a mammalian expression vector, containing modified SV40 promoter, SV40 small t splice donor and acceptor 5′ to the cDNA insert; proteins expressed from this vector contain short N-terminal fusion peptide from SV40 (13).

cClones available without license only to academic researchers for research use (requires MTA).

dA matching set of ORF expression clones in the Flexi® Vector system is under construction.

eClones originating from the University of Tokyo and the Helix Research Institute are distributed through the NITE Biological Resource Center (http://www.nbrc.nite.go.jp/e/hflcdna-e.html), and clones originating from KDRI are distributed through KDRI. Kazusa clones distributed by KDRI: (http://www.kazusa.or.jp/NEDO/clone.req/index.html).

fMGC clones are distributed through the IMAGE Consortium distributors (see Table 2).

g±denotes expression ORFs are available with and without stop codons.

Table 1.

Government and academic programs to build collections of human FL-CDS clones and F-ORF expression clones

SourceNo. of genesaSequence validationStop codonUTR sequencesType of cloneVector systemClones distributed (MTA Req'd)Miscellaneous (References)
NEDO (FLJ)18 500Full-length++FL-CDSpME18S-FLbNITE-BRC (Yesc)>30 000 clones from oligo-capping cDNA libraries (S. Sugano, personal communication, N. Nomura, personal communication, 14,15)e
18 500Full-length±gF-ORFGateway™, pENTR201TBD(N. Goshima, in preparation; N. Nomura, personal communication)
MGC>13 000Full-length++FL-CDSVarious typesIMAGE (No)>22 000 clones (78)f
Kazusa>5 500Full-length++FL-CDSVarious typesKDRIc (Yes)ORFs >4 kb
>1 000Full-length±F-ORFGateway™, Flexi® dKDRIc,d (Yes)(79)
DKFZ>10 000Full-length++Many as FL-CDSRZPD (No)(20,22,80,81)
∼1 200Full-length±F-ORFGateway™, pENTR221RPZD (No)
CCSB>10 000End sequencesF-ORFGateway™, pENTR223Open Biosystems (No)(23,57b)
HIP∼4 000Full-lengthMost ±F-ORFGateway™, pENTR201, pENTR221HIP (No)(5)
∼3 000Full-lengthMost ±F-ORFCreator™ pDNR-dualHIP (No)
SourceNo. of genesaSequence validationStop codonUTR sequencesType of cloneVector systemClones distributed (MTA Req'd)Miscellaneous (References)
NEDO (FLJ)18 500Full-length++FL-CDSpME18S-FLbNITE-BRC (Yesc)>30 000 clones from oligo-capping cDNA libraries (S. Sugano, personal communication, N. Nomura, personal communication, 14,15)e
18 500Full-length±gF-ORFGateway™, pENTR201TBD(N. Goshima, in preparation; N. Nomura, personal communication)
MGC>13 000Full-length++FL-CDSVarious typesIMAGE (No)>22 000 clones (78)f
Kazusa>5 500Full-length++FL-CDSVarious typesKDRIc (Yes)ORFs >4 kb
>1 000Full-length±F-ORFGateway™, Flexi® dKDRIc,d (Yes)(79)
DKFZ>10 000Full-length++Many as FL-CDSRZPD (No)(20,22,80,81)
∼1 200Full-length±F-ORFGateway™, pENTR221RPZD (No)
CCSB>10 000End sequencesF-ORFGateway™, pENTR223Open Biosystems (No)(23,57b)
HIP∼4 000Full-lengthMost ±F-ORFGateway™, pENTR201, pENTR221HIP (No)(5)
∼3 000Full-lengthMost ±F-ORFCreator™ pDNR-dualHIP (No)

aMost of these collections also include mRNA isoforms for a small percentage of these genes.

bThe pME18S-FL vector is a mammalian expression vector, containing modified SV40 promoter, SV40 small t splice donor and acceptor 5′ to the cDNA insert; proteins expressed from this vector contain short N-terminal fusion peptide from SV40 (13).

cClones available without license only to academic researchers for research use (requires MTA).

dA matching set of ORF expression clones in the Flexi® Vector system is under construction.

eClones originating from the University of Tokyo and the Helix Research Institute are distributed through the NITE Biological Resource Center (http://www.nbrc.nite.go.jp/e/hflcdna-e.html), and clones originating from KDRI are distributed through KDRI. Kazusa clones distributed by KDRI: (http://www.kazusa.or.jp/NEDO/clone.req/index.html).

fMGC clones are distributed through the IMAGE Consortium distributors (see Table 2).

g±denotes expression ORFs are available with and without stop codons.

Table 2.

Commercial sources of full-length human ORF clones

Commercial providerClone collection nameNo. of unique genes representedbSequence validationStop codonUTR sequences presentVector systemMiscellaneous (References)
IMAGE distributorsaMGC>13 000Full-length++Mixed types(78)
GenecopoeiaORF express∼13 500>70% Full-length±Gateway™ entryGenecopoeia cDNA templates
OmicsLink∼13 500ORF transferred from ORF expressc±OmicsLink expression vectors (38 types)Ready-to-use expression clones
InvitrogenUltimate™ ORF collection>12 000Full-length+Gateway™ entryMGC cDNA templates
Open BiosystemsFreedom™ ORF collection1 100Full-length±Creator™ pDNR-dualcDNA templates (82)
Incyte gene collection>10 000Full-length++Mixed types(82)
ORFeome collection>10 000Ends onlyGateway™ entryMGC cDNA templates
OrigeneTrueClone™ collection>17 00075% Ends 25% F-L++pCMV6-XL4Isolated from cDNA libraries
FlexClones>150Full-length±Flexi Vector®(83)
RZPDFull ORF shuttle∼3 700Full-lengthMost ±Gateway™ entry
Full-ORF expression∼550Full-length±Gateway™ expression (15 types)Ready-to-use attB expression clones
Commercial providerClone collection nameNo. of unique genes representedbSequence validationStop codonUTR sequences presentVector systemMiscellaneous (References)
IMAGE distributorsaMGC>13 000Full-length++Mixed types(78)
GenecopoeiaORF express∼13 500>70% Full-length±Gateway™ entryGenecopoeia cDNA templates
OmicsLink∼13 500ORF transferred from ORF expressc±OmicsLink expression vectors (38 types)Ready-to-use expression clones
InvitrogenUltimate™ ORF collection>12 000Full-length+Gateway™ entryMGC cDNA templates
Open BiosystemsFreedom™ ORF collection1 100Full-length±Creator™ pDNR-dualcDNA templates (82)
Incyte gene collection>10 000Full-length++Mixed types(82)
ORFeome collection>10 000Ends onlyGateway™ entryMGC cDNA templates
OrigeneTrueClone™ collection>17 00075% Ends 25% F-L++pCMV6-XL4Isolated from cDNA libraries
FlexClones>150Full-length±Flexi Vector®(83)
RZPDFull ORF shuttle∼3 700Full-lengthMost ±Gateway™ entry
Full-ORF expression∼550Full-length±Gateway™ expression (15 types)Ready-to-use attB expression clones

aIMAGE distributors: ATCC www.atcc.org; GeneService www.geneservice.co.uk; Invitrogen www.invitrogen.com; Open Biosystems www.openbioystems.com; RZPD www.rzpd.de.

bMost of these collections include a small percentage of other mRNA isoforms.

cTransfer uses non-PCR method (RecJoin™) with low likelihood of introducing mutations.

Table 2.

Commercial sources of full-length human ORF clones

Commercial providerClone collection nameNo. of unique genes representedbSequence validationStop codonUTR sequences presentVector systemMiscellaneous (References)
IMAGE distributorsaMGC>13 000Full-length++Mixed types(78)
GenecopoeiaORF express∼13 500>70% Full-length±Gateway™ entryGenecopoeia cDNA templates
OmicsLink∼13 500ORF transferred from ORF expressc±OmicsLink expression vectors (38 types)Ready-to-use expression clones
InvitrogenUltimate™ ORF collection>12 000Full-length+Gateway™ entryMGC cDNA templates
Open BiosystemsFreedom™ ORF collection1 100Full-length±Creator™ pDNR-dualcDNA templates (82)
Incyte gene collection>10 000Full-length++Mixed types(82)
ORFeome collection>10 000Ends onlyGateway™ entryMGC cDNA templates
OrigeneTrueClone™ collection>17 00075% Ends 25% F-L++pCMV6-XL4Isolated from cDNA libraries
FlexClones>150Full-length±Flexi Vector®(83)
RZPDFull ORF shuttle∼3 700Full-lengthMost ±Gateway™ entry
Full-ORF expression∼550Full-length±Gateway™ expression (15 types)Ready-to-use attB expression clones
Commercial providerClone collection nameNo. of unique genes representedbSequence validationStop codonUTR sequences presentVector systemMiscellaneous (References)
IMAGE distributorsaMGC>13 000Full-length++Mixed types(78)
GenecopoeiaORF express∼13 500>70% Full-length±Gateway™ entryGenecopoeia cDNA templates
OmicsLink∼13 500ORF transferred from ORF expressc±OmicsLink expression vectors (38 types)Ready-to-use expression clones
InvitrogenUltimate™ ORF collection>12 000Full-length+Gateway™ entryMGC cDNA templates
Open BiosystemsFreedom™ ORF collection1 100Full-length±Creator™ pDNR-dualcDNA templates (82)
Incyte gene collection>10 000Full-length++Mixed types(82)
ORFeome collection>10 000Ends onlyGateway™ entryMGC cDNA templates
OrigeneTrueClone™ collection>17 00075% Ends 25% F-L++pCMV6-XL4Isolated from cDNA libraries
FlexClones>150Full-length±Flexi Vector®(83)
RZPDFull ORF shuttle∼3 700Full-lengthMost ±Gateway™ entry
Full-ORF expression∼550Full-length±Gateway™ expression (15 types)Ready-to-use attB expression clones

aIMAGE distributors: ATCC www.atcc.org; GeneService www.geneservice.co.uk; Invitrogen www.invitrogen.com; Open Biosystems www.openbioystems.com; RZPD www.rzpd.de.

bMost of these collections include a small percentage of other mRNA isoforms.

cTransfer uses non-PCR method (RecJoin™) with low likelihood of introducing mutations.

Table 3.

Expression-convenient vector systems for ORF expression: design, strengths and weaknesses

SystemSourceORF transfer byStrengthsWeaknessesReferences
Gateway™Invitrogenλ-att recombinationHigh efficiency, ORF transfer bi-directional; largest number of ORF clones available in this format. Compatible with Multisite™ Gateway, allowing exchange of additional elements, besides ORF. The system has been adapted for large-scale insert transfer by bacterial mating.Residual 23–25 bp attB sites; most Gateway clone collections lack rare-cutting restriction enzyme sites flanking ORF; Clonase enzyme costs; license required for commercial use.(57a,b,84a,b)
Creator™Clontech-TakaraCre-lox recombinationLow cost for enzyme and no licensing costs for commercial uses.Residual 34 bp lox-P sites; modest collections of human ORF clones in this format.(85)
MAGIC™Steve Elledge (Harvard) and Open BiosystemsE. coli homologous recombinationFlexible homology arms; transfer by mating avoids plasmid prep; low cost; no license required for commercial uses.Recombination within vector regions that are homologous; no standardized format for recombination sites and no sizeable collections of human clones in this format. Requires flanking recombination sites of ∼50 bp.(86)
Flexi®PromegaRare RE+ligation (SgfI/PmeI)RE sites add only three residual amino acids; low cost; donor libraries can be created in the expression vector; ORF transfer bi-directional for native and N-terminal, but not C-terminal fusions.Enzymatic digest and ligation required; adds Val to C-terminus of native proteins. Few human ORF clones available in Flexi-compatible vectors. License required for some commercial applications.(87)
Univector/Echo™Steve Elledge (Harvard) and InvitrogenCre-lox recombinationLow costRequires plasmid fusion, with inefficient transfer of large inserts; residual 34 bp lox-P site. Few human ORF available in this format.(88)
SystemSourceORF transfer byStrengthsWeaknessesReferences
Gateway™Invitrogenλ-att recombinationHigh efficiency, ORF transfer bi-directional; largest number of ORF clones available in this format. Compatible with Multisite™ Gateway, allowing exchange of additional elements, besides ORF. The system has been adapted for large-scale insert transfer by bacterial mating.Residual 23–25 bp attB sites; most Gateway clone collections lack rare-cutting restriction enzyme sites flanking ORF; Clonase enzyme costs; license required for commercial use.(57a,b,84a,b)
Creator™Clontech-TakaraCre-lox recombinationLow cost for enzyme and no licensing costs for commercial uses.Residual 34 bp lox-P sites; modest collections of human ORF clones in this format.(85)
MAGIC™Steve Elledge (Harvard) and Open BiosystemsE. coli homologous recombinationFlexible homology arms; transfer by mating avoids plasmid prep; low cost; no license required for commercial uses.Recombination within vector regions that are homologous; no standardized format for recombination sites and no sizeable collections of human clones in this format. Requires flanking recombination sites of ∼50 bp.(86)
Flexi®PromegaRare RE+ligation (SgfI/PmeI)RE sites add only three residual amino acids; low cost; donor libraries can be created in the expression vector; ORF transfer bi-directional for native and N-terminal, but not C-terminal fusions.Enzymatic digest and ligation required; adds Val to C-terminus of native proteins. Few human ORF clones available in Flexi-compatible vectors. License required for some commercial applications.(87)
Univector/Echo™Steve Elledge (Harvard) and InvitrogenCre-lox recombinationLow costRequires plasmid fusion, with inefficient transfer of large inserts; residual 34 bp lox-P site. Few human ORF available in this format.(88)
Table 3.

Expression-convenient vector systems for ORF expression: design, strengths and weaknesses

SystemSourceORF transfer byStrengthsWeaknessesReferences
Gateway™Invitrogenλ-att recombinationHigh efficiency, ORF transfer bi-directional; largest number of ORF clones available in this format. Compatible with Multisite™ Gateway, allowing exchange of additional elements, besides ORF. The system has been adapted for large-scale insert transfer by bacterial mating.Residual 23–25 bp attB sites; most Gateway clone collections lack rare-cutting restriction enzyme sites flanking ORF; Clonase enzyme costs; license required for commercial use.(57a,b,84a,b)
Creator™Clontech-TakaraCre-lox recombinationLow cost for enzyme and no licensing costs for commercial uses.Residual 34 bp lox-P sites; modest collections of human ORF clones in this format.(85)
MAGIC™Steve Elledge (Harvard) and Open BiosystemsE. coli homologous recombinationFlexible homology arms; transfer by mating avoids plasmid prep; low cost; no license required for commercial uses.Recombination within vector regions that are homologous; no standardized format for recombination sites and no sizeable collections of human clones in this format. Requires flanking recombination sites of ∼50 bp.(86)
Flexi®PromegaRare RE+ligation (SgfI/PmeI)RE sites add only three residual amino acids; low cost; donor libraries can be created in the expression vector; ORF transfer bi-directional for native and N-terminal, but not C-terminal fusions.Enzymatic digest and ligation required; adds Val to C-terminus of native proteins. Few human ORF clones available in Flexi-compatible vectors. License required for some commercial applications.(87)
Univector/Echo™Steve Elledge (Harvard) and InvitrogenCre-lox recombinationLow costRequires plasmid fusion, with inefficient transfer of large inserts; residual 34 bp lox-P site. Few human ORF available in this format.(88)
SystemSourceORF transfer byStrengthsWeaknessesReferences
Gateway™Invitrogenλ-att recombinationHigh efficiency, ORF transfer bi-directional; largest number of ORF clones available in this format. Compatible with Multisite™ Gateway, allowing exchange of additional elements, besides ORF. The system has been adapted for large-scale insert transfer by bacterial mating.Residual 23–25 bp attB sites; most Gateway clone collections lack rare-cutting restriction enzyme sites flanking ORF; Clonase enzyme costs; license required for commercial use.(57a,b,84a,b)
Creator™Clontech-TakaraCre-lox recombinationLow cost for enzyme and no licensing costs for commercial uses.Residual 34 bp lox-P sites; modest collections of human ORF clones in this format.(85)
MAGIC™Steve Elledge (Harvard) and Open BiosystemsE. coli homologous recombinationFlexible homology arms; transfer by mating avoids plasmid prep; low cost; no license required for commercial uses.Recombination within vector regions that are homologous; no standardized format for recombination sites and no sizeable collections of human clones in this format. Requires flanking recombination sites of ∼50 bp.(86)
Flexi®PromegaRare RE+ligation (SgfI/PmeI)RE sites add only three residual amino acids; low cost; donor libraries can be created in the expression vector; ORF transfer bi-directional for native and N-terminal, but not C-terminal fusions.Enzymatic digest and ligation required; adds Val to C-terminus of native proteins. Few human ORF clones available in Flexi-compatible vectors. License required for some commercial applications.(87)
Univector/Echo™Steve Elledge (Harvard) and InvitrogenCre-lox recombinationLow costRequires plasmid fusion, with inefficient transfer of large inserts; residual 34 bp lox-P site. Few human ORF available in this format.(88)

References

1

Furey, T.S., Diekhans, M., Lu, Y., Graves, T.A., Oddy, L., Randall-Maher, J., Hillier, L.W., Wilson, R.K. and Haussler, D. (

2004
) Analysis of human mRNAs with the reference genome sequence reveals potential errors, polymorphisms, and RNA editing.
Genome Res.
,
14
,
2034
–2040.

2

Brent, M.R. (

2005
) Genome annotation past, present, and future: how to define an ORF at each locus.
Genome Res.
,
15
,
1777
–1786.

3

Rual, J.F., Hill, D.E. and Vidal, M. (

2004
) ORFeome projects: gateway between genomics and omics.
Curr. Opin. Chem. Biol.
,
8
,
20
–25.

4

Pearlberg, J. and LaBaer, J. (

2004
) Protein expression clone repositories for functional proteomics.
Curr. Opin. Chem. Biol.
,
8
,
98
–102.

5

Brizuela, L., Braun, P., LaBaer, J., Marsischky, G. and LaBaer, J. (

2001
) FLEXGene repository: from sequenced genomes to gene repositories for high-throughput functional biology and proteomics.
Mol. Biochem. Parasitol.
,
118
,
155
–165.

6

Marsischky, G. and LaBaer, J. (

2004
) Many paths to many clones: a comparative look at high-throughput cloning methods.
Genome Res.
,
14
,
2020
–2028.

7

Weaver, T., Maurer, J. and Hayashizaki, Y. (

2004
) Sharing genomes: an integrated approach to funding, managing and distributing genomic clone resources.
Nat. Rev. Genet.
,
5
,
861
–866.

8

Ota, T., Suzuki, Y., Nishikawa, T., Otsuki, T., Sugiyama, T., Irie, R., Wakamatsu, A., Hayashi, K., Sato, H., Nagai, K. et al. (

2004
) Complete sequencing and characterization of 21 243 full-length human cDNAs.
Nat. Genet.
,
36
,
40
–45.

9

Yudate, H.T., Suwa, M., Irie, R., Matsui, H., Nishikawa, T., Nakamura, Y., Yamaguchi, D., Peng, Z.Z., Yamamoto, T., Nagai, K. et al. (

2001
) HUNT: launch of a full-length cDNA database from the Helix Research Institute.
Nucleic Acids Res.
,
29
,
185
–188.

10

Imanishi, T., Itoh, T., Suzuki, Y., O'Donovan, C., Fukuchi, S., Koyanagi, K.O., Barrero, R.A., Tamura, T., Yamaguchi-Kabata, Y., Tanino, M. et al. (

2004
) Integrative annotation of 21 037 human genes validated by full-length cDNA clones.
PLoS Biol.
,
2
,
e162
.

11

Kikuno, R., Nagase, T., Nakayama, M., Koga, H., Okazaki, N., Nakajima, D. and Ohara, O. (

2004
) HUGE: a database for human KIAA proteins, a 2004 update integrating HUGEppi and ROUGE.
Nucleic Acids Res.
,
32
,
D502
–D504.

12

Ohara, O., Nagase, T., Ishikawa, K., Nakajima, D., Ohira, M., Seki, N. and Nomura, N. (

1997
) Construction and characterization of human brain cDNA libraries suitable for analysis of cDNA clones encoding relatively large proteins.
DNA Res.
,
4
,
53
–59.

13

Suzuki, Y., Yoshitomo-Nakagawa, K., Maruyama, K., Suyama, A. and Sugano, S. (

1997
) Construction and characterization of a full length-enriched and a 5′-end-enriched cDNA library.
Gene
,
200
,
149
–156.

14

Suzuki, Y. and Sugano, S. (

2001
) Construction of full-length-enriched cDNA libraries. The oligo-capping method.
Meth. Mol. Biol.
,
175
,
143
–153.

15

Suzuki, Y. and Sugano, S. (

2003
) Construction of a full-length enriched and a 5′-end enriched cDNA library using the oligo-capping method.
Meth. Mol. Biol.
,
221
,
73
–91.

16

Gerhard, D.S., Wagner, L., Feingold, E.A., Shenmen, C.M., Grouse, L.H., Schuler, G., Klein, S.L., Old, S., Rasooly, R., Good, P. et al. (

2004
) The status, quality, and expansion of the NIH full-length cDNA project: the mammalian gene collection (MGC).
Genome Res.
,
14
,
2121
–2127.

17

Strausberg, R.L., Feingold, E.A., Grouse, L.H., Derge, J.G., Klausner, R.D., Collins, F.S., Wagner, L., Shenmen, C.M., Schuler, G.D., Altschul, S.F. et al. (

2002
) Generation and initial analysis of more than 15 000 full-length human and mouse cDNA sequences.
Proc. Natl Acad. Sci. USA
,
99
,
16899
–16903.

18

MGC-NIH (

2006
) http://mgc.nci.nih.gov/.

19

Wiemann, S., Weil, B., Wellenreuther, R., Gassenhuber, J., Glassl, S., Ansorge, W., Bocher, M., Blocker, H., Bauersachs, S., Blum, H. et al. (

2001
) Toward a catalog of human genes and proteins: sequencing and analysis of 500 novel complete protein coding human cDNAs.
Genome Res.
,
11
,
422
–435.

20

Simpson, J.C., Wellenreuther, R., Poustka, A., Pepperkok, R. and Wiemann, S. (

2000
) Systematic subcellular localization of novel proteins identified by large-scale cDNA sequencing.
EMBO Rep.
,
1
,
287
–292.

21

Wiemann, S., Arlt, D., Huber, W., Wellenreuther, R., Schleeger, S., Mehrle, A., Bechtel, S., Sauermann, M., Korf, U., Pepperkok, R. et al. (

2004
) From ORFeome to biology: a functional genomics pipeline.
Genome Res.
,
14
,
2136
–2144.

22

Wiemann, S., Bechtel, S., Bannasch, D., Pepperkok, R. and Poustka, A. (

2003
) The German cDNA network: cDNAs, functional genomics and proteomics.
J. Struct. Func. Genomics
,
4
,
87
–96.

23

Rual, J.F., Hirozane-Kishikawa, T., Hao, T., Bertin, N., Li, S., Dricot, A., Li, N., Rosenberg, J., Lamesch, P., Vidalain, P.O. et al. (

2004
) Human ORFeome version 1.1: a platform for reverse proteomics.
Genome Res.
,
14
,
2128
–2135.

24

Pruitt, K.D., Tatusova, T. and Maglott, D.R. (

2005
) NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.
Nucleic Acids Res.
,
33
,
D501
–D504.

25

Brent, M.R. and Guigo, R. (

2004
) Recent advances in gene structure prediction.
Curr. Opin. Struct. Biol.
,
14
,
264
–272.

26

Kozak, M. (

2002
) Pushing the limits of the scanning mechanism for initiation of translation.
Gene
,
299
,
1
–34.

27

Kozak, M. (

2005
) Regulation of translation via mRNA structure in prokaryotes and eukaryotes.
Gene
,
361
,
13
–37.

28

Hann, S.R. (

1994
) Regulation and function of non-AUG-initiated proto-oncogenes.
Biochimie
,
76
,
880
–886.

29

Kochetov, A.V., Sarai, A., Rogozin, I.B., Shumny, V.K. and Kolchanov, N.A. (

2005
) The role of alternative translation start sites in the generation of human protein diversity.
Mol. Genet. Genomics
,
273
,
491
–496.

30

Touriol, C., Bornes, S., Bonnal, S., Audigier, S., Prats, H., Prats, A.C. and Vagner, S. (

2003
) Generation of protein isoform diversity by alternative initiation of translation at non-AUG codons.
Biol. Cell
,
95
,
169
–178.

31

NCBI (

2005
) CCDS Database. NLM-NCBI.

32

International Human Genome Sequencing Consortium (

2004
) Finishing the euchromatic sequence of the human genome.
Nature
,
431
,
931
–945.

33

Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P. et al. (

2002
) Initial sequencing and comparative analysis of the mouse genome.
Nature
,
420
,
520
–562.

34

Gibbs, R.A., Weinstock, G.M., Metzker, M.L., Muzny, D.M., Sodergren, E.J., Scherer, S., Scott, G., Steffen, D., Worley, K.C., Burch, P.E. et al. (

2004
) Genome sequence of the Brown Norway rat yields insights into mammalian evolution.
Nature
,
428
,
493
–521.

35

Lindblad-Toh, K., Wade, C.M., Mikkelsen, T.S., Karlsson, E.K., Jaffe, D.B., Kamal, M., Clamp, M., Chang, J.L., Kulbokas, E.J., III, Zody, M.C. et al. (

2005
) Genome sequence, comparative analysis and haplotype structure of the domestic dog.
Nature
,
438
,
803
–819.

36

NCBI (

2003
) Gnomon description. NLM-NCBI.

37

Axel, R., Feigelson, P. and Schutz, G. (

1976
) Analysis of the complexity and diversity of mRNA from chicken liver and oviduct.
Cell
,
7
,
247
–254.

38

Ryffel, G.U. and McCarthy, B.J. (

1975
) Complexity of cytoplasmic RNA in different mouse tissues measured by hybridization of polyadenylated RNA to complementary DNA.
Biochemistry
,
14
,
1379
–1385.

39

Bishop, J.O., Morton, J.G., Rosbash, M. and Richardson, M. (

1974
) Three abundance classes in HeLa cell messenger RNA.
Nature
,
250
,
199
–204.

40

Hirozane-Kishikawa, T., Shiraki, T., Waki, K., Nakamura, M., Arakawa, T., Kawai, J., Fagiolini, M., Hensch, T.K., Hayashizaki, Y. and Carninci, P. (

2003
) Subtraction of cap-trapped full-length cDNA libraries to select rare transcripts.
Biotechniques
,
35
,
510
–516, 518.

41

Bonaldo, M.F., Lennon, G. and Soares, M.B. (

1996
) Normalization and subtraction: two approaches to facilitate gene discovery.
Genome Res.
,
6
,
791
–806.

42

Carninci, P., Shibata, Y., Hayatsu, N., Sugahara, Y., Shibata, K., Itoh, M., Konno, H., Okazaki, Y., Muramatsu, M. and Hayashizaki, Y. (

2000
) Normalization and subtraction of cap-trapper-selected cDNAs to prepare full-length cDNA libraries for rapid discovery of new genes.
Genome Res.
,
10
,
1617
–1630.

43

Baross, A., Butterfield, Y.S., Coughlin, S.M., Zeng, T., Griffith, M., Griffith, O.L., Petrescu, A.S., Smailus, D.E., Khattra, J., McDonald, H.L. et al. (

2004
) Systematic recovery and analysis of full-ORF human cDNA clones.
Genome Res.
,
14
,
2083
–2092.

44

Wu, J.Q., Garcia, A.M., Hulyk, S., Sneed, A., Kowis, C., Yuan, Y., Steffen, D., McPherson, J.D., Gunaratne, P.H. and Gibbs, R.A. (

2004
) Large-scale RT-PCR recovery of full-length cDNA clones.
Biotechniques
,
36
,
690
–696,
698
–700.

45

Mironov, A.A., Fickett, J.W. and Gelfand, M.S. (

1999
) Frequent alternative splicing of human genes.
Genome Res.
,
9
,
1288
–1293.

46

Modrek, B. and Lee, C. (

2002
) A genomic view of alternative splicing.
Nat. Genet.
,
30
,
13
–19.

47

Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W. et al. (

2001
) Initial sequencing and analysis of the human genome.
Nature
,
409
,
860
–921.

48

Johnson, J.M., Castle, J., Garrett-Engele, P., Kan, Z., Loerch, P.M., Armour, C.D., Santos, R., Schadt, E.E., Stoughton, R. and Shoemaker, D.D. (

2003
) Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays.
Science
,
302
,
2141
–2144.

49

Carninci, P., Kasukawa, T., Katayama, S., Gough, J., Frith, M.C., Maeda, N., Oyama, R., Ravasi, T., Lenhard, B., Wells, C. et al. (

2005
) The transcriptional landscape of the mammalian genome.
Science
,
309
,
1559
–1563.

50

Beaudoing, E., Freier, S., Wyatt, J.R., Claverie, J.-M. and Gautheret, D. (

2000
) Patterns of variant polyadenylation signal usage in human genes.
Genome Res.
,
10
,
1001
–1010.

51

Edwalds-Gilbert, G., Veraldi, K.L. and Milcarek, C. (

1997
) Alternative poly(A) site selection in complex transcription units: means to an end?
Nucleic Acids Res.
,
25
,
2547
–2561.

52

Lu, R., Alioua, A., Kumar, Y., Eghbali, M., Stefani, E. and Toro, L. (

2006
) MaxiK channel partners: physiological impact.
J. Physiol. (Lond.)
,
570
,
65
–72.

53

Kelley, C.A. and Adelstein, R.S. (

1994
) Characterization of isoform diversity in smooth muscle myosin heavy chains.
Can. J. Physiol. Pharmacol.
,
72
,
1351
–1360.

54

Diss, J.K., Fraser, S.P. and Djamgoz, M.B. (

2004
) Voltage-gated Na+ channels: multiplicity of expression, plasticity, functional implications and pathophysiological aspects.
Eur. Biophys. J.
,
33
,
180
–193.

55

Sakharkar, K.R., Sakharkar, M.K., Culiat, C.T., Chow, V.T.K. and Pervaiz, S. (

2006
) Functional and evolutionary analyses on expressed intronless genes in the mouse genome.
FEBS Lett.
,
580
,
1472
–1478.

56

Sakharkar, M.K., Chow, V.T., Ghosh, K., Chaturvedi, I., Lee, P.C., Bagavathi, S.P., Shapshak, P., Subbiah, S. and Kangueane, P. (

2005
) Computational prediction of SEG (single exon gene) function in humans.
Front Biosci.
,
10
,
1382
–1395.

57a

Hartley, J.L., Temple, G.F. and Brasch, M.A. (

2000
) DNA cloning using in vitro site-specific recombination.
Genome Res.
,
10
,
1788
–1795.

57b

Walhout, A.J., Temple, G.F. et al. (

2000
) GATEWAY recombinational cloning: application to the cloning of large numbers of open reading frames or ORFeomes.
Methods Enzymol.
,
328
,
575
–592.

58

Luan, C.H., Qiu, S., Finley, J.B., Carson, M., Gray, R.J., Huang, W., Johnson, D., Tsao, J., Reboul, J., Vaglio, P. et al. (

2004
) High-throughput expression of C. elegans proteins.
Genome Res.
,
14
,
2102
–2110.

59a

Walhout, A.J., Sordella, R. et al. (

2000
) Protein interaction mapping in C. elegans using proteins involved in vulval development.
Science
,
287
,
116
–122.

59b

Stelzl, U., Worm, U., Lalowski, M., Haenig, C., Brembeck, F.H., Goehler, H., Stroedicke, M., Zenkner, M., Schoenherr, A., Koeppen, S. et al. (

2005
) A human protein–protein interaction network: a resource for annotating the proteome.
Cell
,
122
,
957
–968.

60

Rual, J.F., Venkatesan, K., Hao, T., Hirozane-Kishikawa, T., Dricot, A., Li, N., Berriz, G.F., Gibbons, F.D., Dreze, M., Ayivi-Guedehoussou, N. et al. (

2005
) Towards a proteome-scale map of the human protein–protein interaction network.
Nature
,
437
,
1173
–1178.

61

Harada, J.N., Bower, K.E., Orth, A.P., Callaway, S., Nelson, C.G., Laris, C., Hogenesch, J.B., Vogt, P.K. and Chanda, S.K. (

2005
) Identification of novel mammalian growth regulatory factors by genome-scale quantitative image analysis.
Genome Res.
,
15
,
1136
–1144.

62

Chandonia, J.-M. and Brenner, S.E. (

2006
) The impact of structural genomics: expectations and outcomes.
Science
,
311
,
347
–351.

63

Fields, S. and Song, O. (

1989
) A novel genetic system to detect protein–protein interactions.
Nature
,
340
,
245
–246.

64

Vidalain, P.O., Boxem, M., Ge, H., Li, S. and Vidal, M. (

2004
) Increasing specificity in high-throughput yeast two-hybrid experiments.
Methods
,
32
,
363
–370.

65

Colland, F., Jacq, X., Trouplin, V., Mougin, C., Groizeleau, C., Hamburger, A., Meil, A., Wojcik, J., Legrain, P. and Gauthier, J.M. (

2004
) Functional proteomics mapping of a human signaling pathway.
Genome Res.
,
14
,
1324
–1332.

66

Lehner, B. and Sanderson, C.M. (

2004
) A protein interaction framework for human mRNA degradation.
Genome Res.
,
14
,
1315
–1323.

67

Goehler, H., Lalowski, M., Stelzl, U., Waelter, S., Stroedicke, M., Worm, U., Droege, A., Lindenberg, K.S., Knoblich, M., Haenig, C. et al. (

2004
) A protein interaction network links GIT1, an enhancer of huntingtin aggregation, to Huntington's disease.
Mol. Cell
,
15
,
853
–865.

68

Deplancke, B., Dupuy, D., Vidal, M. and Walhout, A.J. (

2004
) A gateway-compatible yeast one-hybrid system.
Genome Res.
,
14
,
2093
–2101.

69

Obrdlik, P., El-Bakkoury, M., Hamacher, T., Cappellaro, C., Vilarino, C., Fleischer, C., Ellerbrok, H., Kamuzinzi, R., Ledent, V., Blaudez, D. et al. (

2004
) K+ channel interactions detected by a genetic system optimized for systematic studies of membrane protein interactions.
Proc. Natl Acad. Sci. USA
,
101
,
12242
–12247.

70

Miller, J.P., Lo, R.S., Ben-Hur, A., Desmarais, C., Stagljar, I., Noble, W.S. and Fields, S. (

2005
) Large-scale identification of yeast integral membrane protein interactions.
Proc. Natl Acad. Sci. USA
,
102
,
12123
–12128.

71

Barrios-Rodiles, M., Brown, K.R., Ozdamar, B., Bose, R., Liu, Z., Donovan, R.S., Shinjo, F., Liu, Y., Dembowy, J., Taylor, I.W. et al. (

2005
) High-throughput mapping of a dynamic signaling network in mammalian cells.
Science
,
307
,
1621
–1625.

72

Eyckerman, S., Verhee, A., der Heyden, J.V., Lemmens, I., Ostade, X.V., Vandekerckhove, J. and Tavernier, J. (

2001
) Design and application of a cytokine-receptor-based interaction trap.
Nat. Cell Biol.
,
3
,
1114
–1119.

73

Endoh, H., Walhout, A.J. and Vidal, M. (

2000
) A green fluorescent protein-based reverse two-hybrid system: application to the characterization of large numbers of potential protein–protein interactions.
Meth. Enzymol.
,
328
,
74
–88.

74

Conrad, C., Erfle, H., Warnat, P., Daigle, N., Lorch, T., Ellenberg, J., Pepperkok, R. and Eils, R. (

2004
) Automatic identification of subcellular phenotypes on human cell arrays.
Genome Res.
,
14
,
1130
–1136.

75

Han, J.D., Bertin, N., Hao, T., Goldberg, D.S., Berriz, G.F., Zhang, L.V., Dupuy, D., Walhout, A.J., Cusick, M.E., Roth, F.P. et al. (

2004
) Evidence for dynamically organized modularity in the yeast protein–protein interaction network.
Nature
,
430
,
88
–93.

76

Gunsalus, K.C., Ge, H., Schetter, A.J., Goldberg, D.S., Han, J.D., Hao, T., Berriz, G.F., Bertin, N., Huang, J., Chuang, L.S. et al. (

2005
) Predictive models of molecular machines involved in Caenorhabditis elegans early embryogenesis.
Nature
,
436
,
861
–865.

77

Stamm, S., Ben-Ari, S., Rafalska, I., Tang, Y., Zhang, Z., Toiber, D., Thanaraj, T.A. and Soreq, H. (

2005
) Function of alternative splicing.
Gene
,
344
,
1
–20.

78

MGC-NIH (

2006
) Mammalian Gene Collection: http://mgc.nci.nih.gov/.

79

Okazaki, N., Imai, K., Kikuno, R.F., Misawa, K., Kawai, M., Inamoto, S., Ohara, R., Nagase, T., Ohara, O. and Koga, H. (

2005
) Influence of the 3′-UTR-length of mKIAA cDNAs and their sequence features to the mRNA expression level in the brain.
DNA Res.
,
12
,
181
–189.

80

DKFZ (

2006
) DKFZ-Wiemann lab standard protocols: http://www.dkfz.de/smp-cell/cell.org/groups.asp?siteID=49.

81

Arlt, D., Huber, W., Liebel, U., Schmidt, C., Majety, M., Sauermann, M., Rosenfelder, H., Bechtel, S., Mehrle, A., Bannasch, D. et al. (

2005
) Functional profiling: from microarrays via cell-based assays to novel tumor relevant modulators of the cell cycle.
Cancer Res.
,
65
,
7733
–7742.

82

OpenBiosystems (

2006
) Mammalian Resources: cDNAs http://www.openbiosystems.com/Genomics/Mammalian%20Resources/cDNA%20Clones/.

84a

House, B.L., Mortimer, M.W. and Kahn, M.L. (

2004
) New recombination methods for Sinorhizobium meliloti genetics.
Appl. Environ. Microbiol.
,
70
,
2806
–2815.

84b

Cheo, D.L., Titus, S.A., Byrd, D.R., Hartley, J.L., Temple, G.F. and Brasch, M.A. (

2004
) Concerted assembly and cloning of multiple DNA segments using in vitro site-specific recombination: functional analysis of multi-segment expression clones.
Genome Res.
,
14
,
2111
–2120.

85

Clontech (

2006
) Creator™ Gene Expression Systems: http://www.clontech.com/clontech/products/families/creator/index.shtml.

86

Li, M.Z. and Elledge, S.J. (

2005
) MAGIC, an in vivo genetic method for the rapid construction of recombinant DNA molecules.
Nat. Genet.
,
37
,
311
–319.

87

Blommel, P.G., Martin, P.A., Wrobel, R.L., Steffen, E. and Fox, B.G. (

2005
) High efficiency single step production of expression plasmids from cDNA clones using the flexi vector cloning system.
Protein Expr. Purif.
, December 5 [Epub ahead of print].

88

Liu, Q., Li, M.Z., Leibham, D., Cortez, D. and Elledge, S.J. (

1998
) The univector plasmid-fusion system, a method for rapid construction of recombinant DNA without restriction enzymes.
Curr. Biol.
,
8
,
1300
–1309.

Supplementary data