With the current rapid pace at which human disease genes are identified there is a need for practical, cost-efficient genetic screening tests. Two-dimensional electrophoretic separation of PCR-amplified gene fragments on the basis of size and base pair sequence, in non-denaturing and denaturing gradient polyacrylamide gels respectively, provides a rapid parallel approach to gene mutational scanning. Accuracy of the denaturing gradient gel electrophoresis (DGGE) component of this system strongly depends on the design of the PCR primers and the melting characteristics of the fragments they encompass. We have developed a fully automated generally applicable procedure to generate optimal two-dimensional test designs at a minimum amount of time and effort. Designs were generated for the RB1, TP53, MLH1 and BRCA1 genes that can be readily implemented in research and clinical laboratories as low cost genetic screening tests.
Rapid and accurate identification of DNA sequence heterogeneity is being recognized as of major importance in disease management. Comprehensive testing for gene mutational differences can provide diagnostic and prognostic information, which, in the context of integrated relational databases, could offer the opportunity for individualized, more effective health care. Practical examples include current attempts to initiate pre-symptomatic testing programs by looking for mutations in genes predisposing to such common diseases as breast and colon cancer (1).
Thus far, nucleotide sequencing remains the gold standard for accurate detection and identification of mutational differences. However, the high costs involved in sequencing have prompted the development of a large number of potentially more cost-effective alternative pre-screening techniques, including SSCP and SSCP-derived methods (2,3), chemical or enzymatic mismatch cleavage (4–6) and denaturing gradient gel electrophoresis (DGGE) (7–9). More recently, chip-based oligonucleotide arrays have been tested as a high throughput alternative, but such systems are not yet operational for large human disease genes (10,11).
Of the currently available systems, DGGE appears to be the most powerful method for detecting mutational differences between DNA fragments (12–14). The method involves a comparison of PCR-amplified fragments run side-by-side in a polyacrylamide gel against a gradient of increasing temperature or chemical denaturants. Each mutational variant will melt at a specific position along the gradient, which can be predicted on the basis of the melting theory (15).
A practical way of applying DGGE to mutational scanning of large human disease genes involves PCR amplification of each exon or part thereof, followed by two-dimensional electrophoresis of the mixture, using size separation in combination with broad range denaturing gradients. This method, termed two-dimensional gene scanning (TDGS) (16), allows identification of mutations in multiple exons simultaneously and has now been successfully applied in several laboratories (17–21). The approach is greatly facilitated by the introduction of an extensive multiplexing protocol, which permits amplification of at least 25 fragments simultaneously in a single PCR reaction (18,22).
A serious disadvantage of TDGS, and DGGE in general, which greatly hinders its rapid application to mutational scanning of novel genes is the need to pre-design target fragments. Primer pairs have to be designed for optimal PCR amplification, optimal melting behavior of the amplicons and optimal two-dimensional gel distribution, with every criterion limiting the available options. For TDGS tests involving many PCR fragments manual design of one set of optimal conditions for PCR, melting behavior and two-dimensional DNA fragment distribution is not practical. Until now, DGGE test designs were based on two different software programs. First, each exon of a given gene was analyzed by computational simulation of the DNA melting behavior using the program MELT87 (15). This program allows identification of the different melting domains within the known sequence of a DNA molecule and calculation of their specific Tm values. PCR primers, one of which is attached to a GC clamp (8,23), can be positioned so as to obtain optimal melting behavior of each fragment. Subsequently, primer pairs corresponding to DGGE-optimized fragments need to be evaluated for their suitability in PCR using a primer designer program. For TDGS a third criterion involves possible overlap of fragments in the gel.
Here we report the development and empirical testing of a fully automated computer program for TDGS test designs. On the basis of a known gene sequence the program creates optimized DGGE or TDGS designs of large genes with multiple exons within minutes. Its application is illustrated by empirically verified TDGS tests for RB1, TP53, MLH1 and BRCA1.
Materials and Methods
The TDGS computer design program was kindly written by Beltronics Inc. (Newton, MA) following detailed instructions from the first and senior author of this paper. The main shell that calls all sub-programs was written in Visual Basic (total disk space 4 MB). The program was built around Primer Designer 3 (commercially obtained from Scientific & Educational Software, State Line, PA) and MELT87 (kindly provided by Dr Leonard S.Lerman, Massachusetts Institute of Technology, Cambridge, MA). The percentage of urea/formamide (% UF) was calculated according to % UF = (Tm − 57) × 3.2.
Primers were obtained from Genosys Biotechnologies Inc. (The Woodlands, TX). For complete lists of all sequences, except MLH1 and BRCA1, see van Orsouw et al., Rines et al. and Smith et al. (18,28,29). Primer sequences for BRCA1 will be published elsewhere, but will be made available upon request. PCR amplification of gene sequences was carried out using the two-step protocol first described by Li and Vijg (22). Primers for long distance PCR were designed based on published sequences (24–26; primary accession no. X54156) using Primer Designer 3 to amplify the entire gene-coding region for each of the four genes as a 1-plex PCR (TP53), a 6-plex PCR (RB1), a 4-plex PCR (MLH1) or a 7-plex PCR (BRCA1). The LA PCR kit (Takara) was used for long PCR in a PTC-100 thermocycler (MJ Research). Multiplex short PCR was carried out using the long PCR products as template. Between 0.1 and 1.125 µM of each primer was used in a 50 µl reaction with 1 µl long PCR product as template in 20 mM Tris-HCl, pH 8.4, 50 mM KCl, 250 µM each dNTP and 5% formamide. Taq DNA polymerase (2.5 U) (Life Technologies) was added after an initial denaturation at 94°C for 60 s. Cycling conditions for multiplex short PCR and concentrations of MgCl2 varied for different genes and amplifications were carried out in a PTC-100 thermocycler (MJ Research).
Two-dimensional DNA electrophoresis
For RB1 5 µl multiplex short PCR was used per electrophoresis run. For TP53, MLH1 and BRCA1, 5 µl each of the different multiplex groups were combined. One tenth of a volume of loading buffer (0.25% xylene cyanol, 0.25% bromophenol blue, 15% Ficoll and 100 mM Na2EDTA) was added and the mixtures were loaded onto a 6.5 (TP53) or 10% (RB1, MLH1 and BRCA1) PAA non-denaturing size gel (acrylamide:bisacrylamide 37.5:1) in 0.5× TAE buffer. The samples were electrophoresed in the DGGE-4000 electrophoresis apparatus (CBS Scientific Co., Solana Beach, CA) for 5.3 h at 150 V (RB1), 5 h at 120 V (TP53) or 7.5 h at 140 V (MLH1 and BRCA1) at 50°C. After staining the gel with a mixture of equal amounts of SYBR green I and II (Molecular Probes, Eugene, OR) for 20 min the region containing all fragments of interest (usually between 100 and 600 bp) was cut out and loaded onto a denaturing gradient gel (DGGE). Gradients used were 0–50% UF for RB1, 20–70% UF for TP53, 25–70% UF for MLH1 and 20–65% UF for BRCA1. The second dimension was run for 12 h at 100 V (RB1), 14 h at 120 V (TP53) or 16 h at 100 V (MLH1 and BRCA1). Spot patterns were visualized by SYBR green staining using a FluorImager (Molecular Dynamics, Sunnyvale, CA).
Computerized fully automated design of TDGS target fragments
The combination of extensive multiplex PCR and two-dimensional electrophoretic resolution of the resulting amplicons (TDGS) is schematically depicted for a hypothetical 11 exon gene (Fig. 1).
The computer program developed to automate the tedious design of TDGS tests essentially combines MELT87 and a commercially obtained PCR design program (Primer Designer 3; Scientific and Educational Software, State Line, PA). The main shell was written in Visual Basic and runs on a pentium-based PC. The program calculates a series of optimal PCR primer pairs surrounding the target fragments (usually exons) and subsequently selects those that show optimal melting behavior, i.e. a two domain structure with a GC clamp as the highest melting domain. For a final selection the program calculates the most optimal spot distribution over the two-dimensional gel.
A simplified flow chart of the sequential steps involved in the design of primers for TDGS is depicted in Figure 2. The nucleotide sequence of the gene(s) is imported directly, e.g. as a GenBank file, and the start and end positions of the exons to be included in the test are manually indicated. All the subsequent steps are automated.
Variables in the PCR primer design sub-routine include stability, Tm, %GC and absence of polypurine and polypyrimidine tracts, dimers and secondary structures, such as hairpins (Table 1). Maximum fragment size was set at 500 bp (excluding the GC clamp). If an exon exceeds 500 bp (for example exon 11 of BRCA1) the program creates overlapping fragments, thereby automatically assigning these to different multiplex groups. The final assignment of fragments to a multiplex group depends on two criteria: (i) the annealing temperature of primers should be within ±5°C of the annealing temperature of the first fragment in the group; (ii) no dimer formation between any primers within the same multiplex group should occur.
PCR-optimized primer pairs are listed by rank, from more to less suitable primers. Each primer pair defines a fragment with at least 5 bp between the 5′-end of a primer and the exon-intron boundary (to ensure inclusion of possible splice site variants) and no more than 200 bp into the intronic flanking regions. If primers fulfilling the default parameters are not found these criteria can be modified, beginning with percentage GC, until primers are found. Such cases are separately reported, with the option of rejection. (In a future version of the program modification of criteria will happen automatically; in this testing phase such an option was not implemented.)
A routine check is subsequently carried out for possible homology with sequences in the long distance PCR fragments that serve as the template for the DGGE-optimized PCR (Fig. 1) as well as dimer formation with any of the other primers. If such homologies are found these primers are either re-designed or assigned to different short PCR multiplex groups. Another default value activated at this stage is the minimum fragment length (including the GC clamp) of 100 bp. Theoretically the smallest possible melting domain is estimated to be 50 bp (8). However, fragments of this size no longer fit the electrophoretic conditions standardized for TDGS (see Materials and Methods).
PCR-optimized primer pairs are imported into the melt sub-routine, which positions each of a number of predefined GC clamps (Table 2) at either the 5′- or the 3′-end of the fragments. Sometimes it was necessary to include an additional small GC clamp at the unclamped end of the fragment in order to make the fragment behave as one single melting domain (Fig. 3A, Table 2). Indeed, a small dip in melting temperature can cause smearing of the fragment during DGGE (results not shown). Likewise, a small AT-rich clamp was sometimes attached to the inside of the GC clamp in case the Tm did not decrease quickly enough (Fig. 3B). Several formal criteria were used to determine whether a melting map was satisfactory or should be rejected. The main criterion involves a constant temperature with an accepted deviation of no more than 1°C from the average domain temperature. If no optimal melting variant can be found the program switches to the primer pair next in rank and so on. If none of the primer pairs can be optimized for DGGE by GC clamping the program checks, via a back-loop (Fig. 2), if the exon can be subdivided into one or more fragments which demonstrate optimal behavior. If no optimal primers can be found the default values can be lowered, beginning with the PCR design criteria and ultimately with the melting domain criteria.
Each optimized primer pair is automatically translated into a spot in a two-dimensional pattern. Each new spot is subjected to the two-dimensional distribution sub-routine. For this purpose it was necessary to empirically determine the minimum distance for both size and Tm between which two spots could still be observed separately. For size it was found that when this distance increases exponentially from 5 bp for 100–200 bp fragments to 20 bp for fragments of 500–600 bp there is still ample space for distinct spot observation (Table 1). Default values for Tm were more difficult to establish. Since the melting theory provides an approximation rather than a true reflection of a fragment's melting behavior, spot positions cannot always be accurately predicted. We adopted a difference of at least 5% UF between two fragments of identical size as the default value (Table 1). A fragment not fulfilling two-dimensional distribution requirements has to be re-designed. Under the current conditions it can be calculated that a two-dimensional test can consist of ∼100 spots at maximum. At higher numbers excessive overlap requires the use of different gels or an increase in gel resolution.
The output files generated include a list of primer pair sequences with their positions, fragment sizes, predicted % UF of each fragment and the multiplex groups with the average annealing temperature. A simplified list of primers was specifically designed as an order form output file for a synthetic DNA provider. A third output file contains the data for a theoretical two-dimensional spot pattern, which can be imported into any graphic server. Depending on the gene used, a complete design can take between 30 and 120 min (see below).
The program was used to design TDGS tests for four genes, which were subsequently empirically verified. The retinoblastoma tumor suppressor gene RB1 was chosen because two-dimensional design for this gene had already been carried out manually and empirically tested (18). Hence its usefulness in testing the program by a direct comparison. The tumor suppressor gene TP53, the DNA mismatch repair gene MLH1 and the breast and ovarian cancer susceptibility gene BRCA1 were chosen because of the wide applicability of a cost-effective, foolproof mutation detection test for these genes. Figure 4 shows the theoretical and the empirical two-dimensional spot patterns for the RB1, TP53, MLH1 and BRCA1 genes. Overall, the predicted positions of the fragments corresponded well with the empirical ones, with some exceptions. Figure 5 shows the deviations from the predicted positions of all spots of the four genes as a function of GC content of the fragment. Interestingly, the positions of both the relatively GC-rich and the relatively AT-rich fragments appeared to be the most difficult to accurately predict; all nine fragments with deviations >5% UF were either AT-rich (three) or GC-rich (six). The situation for an individual gene is a reflection of this GC-dependent predictability (Fig. 5). It should be noted that in most cases (87 of 96) the difference between predicted and empirically determined position was less than the 5% adopted for the spot distribution routine of the program.
The retinoblastoma tumor suppressor gene RB1. The automatically generated design for RB1 included a total of 27 fragments for all exons except exon 1 and took 30 min to make, excluding importing the sequence (Fig. 4a). Like all TDGS designs, it is based on long distance PCR fragments as template, in this case a multiplex of six long fragments (22). Like many of its counterparts, exon 1 of RB1 is very GC-rich and difficult to include in one set of DGGE conditions. Its inclusion in this test would lead to clustering of all other exons in the lower melting regions as compared with the isolated presence of one spot at a melting temperature of 85°C. Inclusion of exon 1 in TDGS tests is possible by making a large temperature jump, i.e. using a step gradient. We have made no attempt to do that and for the moment have left exon 1 out of the design.
A comparison between the earlier and the present version of the RB1 design showed differences for only a few fragments: exons 8, 15/16 and 17. For these exons manual design proved to be difficult and was eventually cut short to avoid endless trial and error. Although all possible mutations were detected using the old design (18), the alternatives found by the computer program were better and logical; exon 8, for example, was divided into two fragments (28).
The TP53 tumor suppressor gene. Because of its relatively small size, ample information is available on mutations in TP53. A TDGS test was designed for the coding region of TP53, i.e. exons 2-part of 11 (Fig. 4b). The total design time (excluding exon indication) was 30 min. After carefully checking the design manually, no errors were found. Exon 4 appeared to consist of three melting domains and it was not possible to re-position the GC clamps so as to obtain one melting domain. Hence, the decision of the program was to split that exon into three fragments. In view of the overlap among the three fragments, a separate multiplex group had to be created for the middle fragment (4.2). The final test design consists of one long PCR fragment of 8.6 kb, which served as the template for two multiplex groups. Also, the TP53 design has been empirically tested for its capacity to detect mutations in a blind study. Like the RB1 test (18), all known mutations were detected and, thus far, there is no evidence for false positives (28).
The MLH1 DNA mismatch repair gene. The design for MLH1 took 30 min (excluding exon indication). Figure 4c shows the theoretical and empirical TDGS pattern for the MLH1 gene. Because exons 11 and 12 had to be subdivided into overlapping fragments, two multiplex groups are currently being used, with the long PCR carried out as a 4-plex PCR. Like many other genes, exon 1 of MLH1 is GC-rich and, hence, was found to melt at a much higher % UF compared with most of the other fragments. Thus far, a total of 41 coded samples with previously identified mutations have been analyzed blind with 100% concordance (29).
The breast and ovarian cancer susceptibility gene BRCA1. The tumor suppressor gene BRCA1 contains 24 exons, of which exon 11 contains ∼60% of the coding region. Figure 4d shows the theoretical and empirical two-dimensional patterns for BRCA1. Of all two-dimensional designs discussed this was the most difficult (total design time 2 h), the main reason being the need to make overlapping fragments for the 3.4 kb exon 11. Pre-amplification was accomplished by one 7-plex long PCR. Using the long PCR amplicons as template all 24 exons were amplified in a total of 37 fragments distributed over five multiplex groups. The overlap and sometimes short distances from fragment to fragment necessitated the use of so many multiplex short PCR groups. The non-coding exons 1a, 1b and the non-coding part of exon 24 were excluded. Evaluation of this test design using a panel of coded samples with previously identified mutations is currently ongoing. Thus far, mutations and polymorphisms have been detected in exons 2, 8, 11, 16, 20 and 23.
Now that we are beginning to understand the relationship between genetic defects and their phenotypic expression as complex human diseases, DNA testing for mutational differences between individuals has the potential to individualize health care. Individual genetic variation should have a high predictive value, not only in terms of disease susceptibility but also as a source of information on such variables as drug response and disease progression. However, before these new advances in genetics can be exploited, population-based studies are necessary to yield the integrated relational databases of gene mutational variation and disease phenotypes to optimally use genetic testing results to the benefit of the patient. Thus far, there is a technological gap between disease gene identification and large scale genetic testing. This is dramatically illustrated by the low volume of BRCA1 and BRCA2 genetic testing, in spite of the fact that mutations in these genes affect a substantial fraction of women with a high genetically determined risk of breast cancer. The high cost involved in complete sequencing of these large genes is mainly responsible for the low testing volume and essentially constrains the necessary large scale genetic epidemiology, e.g. as part of clinical intervention trials. Hence the need for cost-effective yet highly accurate gene mutational scanning methods.
As demonstrated by many laboratories, DGGE is a solid and proven means of detecting all possible mutations in a given gene (7–9,12–14). We have previously demonstrated that this system can be adapted to an efficient parallel system for analyzing genes or combinations of genes for all possible mutations at a cost that varies from US$ 30 to US$ 150 for genes or gene combinations involving 15–100 PCR fragments (17,18). The difficulty that remained was the generally recognized prolonged optimization time necessary before a DGGE-based test can be applied in practice in a reproducible manner. In this present paper this difficulty was addressed and we demonstrate that it is possible to completely pre-design a two-dimensional gene test yielding optimized primer pairs for both PCR and DGGE within minutes.
Pre-amplification of target sequences by long distance PCR allows for the extensive multiplex PCR applied in TDGS, which is especially useful for genes with exons spread through a large area of genomic DNA. TDGS can also be applied to consecutive segments of coding DNA, such as the mitochondrial genome and cDNA; this requires the design of overlapping fragments. A potential disadvantage of TDGS is that long distance PCR requires good quality DNA and/or RNA. The method is therefore likely to be less suitable for DNA from paraffin blocks and its application in genetic epidemiology will be limited to fresh or flash-frozen material. It can be argued, however, that extended optimization times and single fragment PCR from such samples would preclude large scale studies anyway.
There are two potentially major implications of the computerized test design as presented in this paper. First, as demonstrated, the program has allowed us to develop optimal test designs for a number of genes that have already been discovered and found to be associated with cancer. These tests are currently used in population-based studies and will soon be used on a limited scale in clinical testing. Their cost compares favourably with the only high accuracy alternative of nucleotide sequencing. More importantly, computerized test design opens up the possibility of creating an optimal set of conditions to analyze multiple genes in parallel, reducing the cost of testing even further. Examples are BRCA1/BRCA2 and MLH1/MSH2 for hereditary breast cancer and colon cancer respectively. The direct practical consequence of the test design program, therefore, is implementation for the first time of comprehensive and practical genetic testing services for major genetic diseases at low cost. Second, in order to apply genetic testing on a population basis, high throughput low cost technology is required. Two-dimensional gene scanning is currently the only low cost test that can be applied in high efficiency mode. Automation of the test design as presented in this paper allows one to rapidly include newly discovered genes in multigene test formats. As stated by Drews (30), there are probably at least 100 common human diseases, most of which have a genetic component. Assuming that on average sequence information on 5–10 genes is sufficiently informative to significantly contribute to management of such diseases, one can envisage a need for population-based mutational scanning of 500–1000 genes. Many of these genes have already been identified and, based on the current rate of progress in gene discovery programs and the speed at which the human genome program is moving to completion, it is not unlikely that within a decade virtually all major human disease genes will have been identified. Based on our present results, it is perfectly feasible to develop TDGS test designs for all disease genes that have thus far been identified, alone or in combinations. The practical results thus far obtained with TDGS of CFTR (17), RB1 (18), TP53 (28), MLH1 and MSH2 (19,20,29) suggest that cost-effective screening of selected populations for all possible sequence variants, with the ultimate aim of creating a relational database of diagnostic and prognostic information, is no longer a remote possibility.
We thank Drs Charles H.C.M.Buys, Robert M.W.Hofstra and Leonard S.Lerman for helpful comments during this study and critical reading of the manuscript. We are grateful to Roger Stern and Robert Bishop (Beltronics Inc., Newton, USA) for programming the different design steps into a Windows-based format. This work was supported by the Massachusetts Department of Public Health (grant 3408799DO47), the Harvard Nathan Shock Center of Excellence in the Basic Biology of Aging (grant 1P30AG13314-01) and research funds of Beltronics Inc. C.E. is the Lawrence and Susan Marx Investigator in Human Cancer Genetics. Inquiries regarding the use of the computer program described in this paper should be directed to the senior author of this paper (Email: email@example.com).