## Abstract

We have developed statistical models for estimating the failure rate of polymerase chain reaction (PCR) primers using 236 primer sequence-related factors. The model involved 1314 primer pairs and is based on more than 80 000 PCR experiments. We found that the most important factor in determining PCR failure is the number of predicted primer-binding sites in the genomic DNA. We also compared different ways of defining primer-binding sites (fixed length word versus thermodynamic model; exact match versus matches including 1–2 mismatches). We found that the most efficient prediction of PCR failure rates can be achieved using a combination of four factors (number of primer-binding sites counted in different ways plus GC% of the primer) combined into single statistical model GM1. According to our estimations from experimental data, the GM1 model can reduce the average failure rate of PCR primers nearly 3-fold (from 17% to 6%). The GM1 model can easily be implemented in software to premask genome sequences for potentially failing PCR primers, thus improving large-scale PCR-primer design.

## INTRODUCTION

During recent decades, the polymerase chain reaction (PCR) has become a very widely used method, routinely performed in various molecular biology applications. Ideally, there should be a pair of unique primers that amplify the desired target sequence selectively with maximum yield. A common way to deal with PCR problems is to optimize the conditions, such as the concentrations of reagents in the PCR buffer, or to modify the primer annealing temperature ( 1 ). In addition, careful design of primers is crucial for the success of PCR. The list of factors that may influence the success of PCR is long and has been studied by several investigators: the concentrations of the PCR buffer reagents, the primer length and the GC contents of primers and template, simple repeats in the primer sequence, stable secondary structures of both primer and product sequences, etc. ( 2–5 ). However, the circumstances that affect PCR and the molecular mechanisms behind it have become considerably better understood over the years. The most recent studies utilizing PCR incorporate more factors in primer design; for instance, the nucleotide composition of the primer 3′-end and the size of that important region, the nucleotide composition of the PCR product, the secondary structure of the PCR template around the primer annealing sites, the melting temperature (T _{m} ) of the product and the regionalized GC content of the PCR template ( 6–8 ). In addition, some studies have focused specifically on certain classes of factors such as the stability and the uniqueness of the primer 3′-end ( 9 , 10 ) or have concentrated on comparing the software available for PCR primer design ( 11 ).

Selection of nonoptimal primers can lead to amplification of undesired regions or no amplification at all. Careful choice of primers is even more crucial in high-throughput assays, where the total cost of the primers is high. Previously, we have worked on algorithms for masking repeats and from these we have developed a fast algorithm for finding and masking short repeats in the human genome ( 12 ). Following this earlier work, our aim here was to carry out a comprehensive statistical analysis of potential descriptive factors that could influence the failure rates of PCR primers. We prefer to use the term ‘failure rate’ instead of ‘success rate’ because only failures of PCR can be directly predicted from primer sequences. Success of PCR is dependent on many other factors besides primer properties (reagents, equipment, human errors, etc.) and therefore inherently less predictable.

The typical factors influencing the PCR failure rate include purity of genomic DNA, the length of DNA fragments, the precision of DNA concentration measurements, the presence of single nucleotide polymorphisms (SNPs) and the uniqueness of PCR primer sequences. In this article, we present a study of the most significant primer and product sequence-specific factors in reducing the PCR failure rate to offer scenarios for improving masking algorithms for the regions in genomes that might have high PCR failure rates.

## MATERIALS AND METHODS

### Computation

All programs described in this article were executed on a 2.66 GHz Intel Xeon™ processor machine with 6 GB of RAM. For statistical analysis and modeling, SAS software version 9.1.3 (SAS OnlineDoc® 9.1.3., SAS Institute Inc. 2004, Cary, NC, USA) was used. All calculations were performed on assembled chromosome sequences derived from NCBI build 35 ( 13 ). SNP data were retrieved from dbSNP build 125. Human exon and intron sequences were retrieved from UCSC Genome Browser (hg18) using Table Browser tool ( 14 ).

The number of exact-match binding sites for the PCR primers was calculated by the GenomeTester program from the GENOMEMASKER software package ( 12 ). The number of binding sites including mismatches was calculated using the BLAST program ( 15 ) with filtering turned off (-F F). The number of primer-binding sites (exact binding or binding with mismatches) based on thermodynamic affinity was found by the FASTAGREP program (executable available from http://bioinfo.ebc.ee/download/ ), which was executed with default parameter values.

To mask PCR product sequences, GenomeMasker, DUST (ftp://ftp.ncbi.nlm.nih.gov/pub/tatusov/dust/) and RepeatMasker (Smit, A.F.A., Hubley, R. and Green, P. http://www.repeatmasker.org/ ) were used. With RepeatMasker, the RepBase Update ( 16 ) 8.12 library of repeated motifs in the human genome was used. RepeatMasker was executed with the following sensitivity parameters: –s, –q and –qq. DUST was used with default parameters. The masking program GenomeMasker was used with masking letter parameter ‘l’ (lower-case masking) and masking-type parameter ‘target 500 501’.

The T _{m} of each primer was calculated using current thermodynamic tables ( 17 ). The secondary structures of PCR primers and products were calculated using MFOLD 3.2 ( 18 ) and primer–primer dimer bindings were calculated with MULTIPLX ( 19 ). Three additional factors (PROD_AUCGC, PROD_AUCGC2 and PROD_RATIOGC_100) for PCR products were calculated using software published by Benita and co-workers ( 7 ).

### Experimental datasets

Experimental PCR data originated from two large-scale datasets. One was obtained from analysis of 1278 SNPs on human chromosome 22 ( 20 ). In our analysis, we used 1014 out of 1278 primer pairs, most of which were designed with Primer3 software ( 21 ). Approximately 20% of the primer sequences were retrieved from the scientific literature or were designed manually. In the second dataset, we had 300 primer pairs designed by Primer3 from randomly selected regions of the human genomic DNA and we performed 10 experiments with each primer pair. Both primer sets were nonredundant; all primer sequences were unrelated (unique and nonoverlapping). Primer3 was used with no repeat libraries. No other repeat detection or masking procedures were used in the primer design.

PCR primers in these datasets had the following properties. T _{m} varied between 48°C and 75°C, with an average of 60°C. GC contents varied between 27 and 80%, with an average of 56%. Primer lengths varied between 18 and 25 nucleotides (average 21 bp) and PCR product size was 100–600 bp.

### Experimental conditions and testing

PCR conditions for amplifying products were as follows: 15 min preincubation at 95°C, followed by seven touchdown cycles of 20 s at 95°C, 30 s at 66°C (decreasing 1°C per cycle), 30 s at 72°C; seventeen cycles of 20 s at 95°C, 30 s at 58°C, 30 s at 72°C; sixteen cycles of 20 s at 95°C, 30 s at 56°C, 30 s at 72°C and final extension at 72°C for 7 min. DNA was extracted from human blood cells by a modified salting-out method ( 22 ). Each PCR reaction contained 15 ng of genomic DNA in a volume of 10 μl. All PCR reactions were conducted by Asper Biotech Ltd., Tartu, Estonia.

In both sets, each primer pair was tested experimentally at least 10 times (84 142 reactions in total) using DNAs from different individuals. The PCR primer failure rate was estimated by analyzing agarose gel electropherograms. Reactions were counted as positive if they gave clear bands of the expected size. Reactions with smears and multiple bands were counted as negative. The average PCR failure rate for a given PCR primer pair was expressed as the fraction of negative results.

### Statistical methods and development of models

The generalized linear model (GLZ) is a generalization of the general linear model (GLM) that can be used to predict responses both for dependent variables with discrete distributions and for dependent variables that are nonlinearly related to the predictors ( 23 ). The procedure of the SAS software GENMOD is based on the theory of GLZs. In the GENMOD procedure, both the number of successful trials and the total number of events were used, assuming a binomial distribution and logit link function.

To maximize the usage of limited numbers of primers we joined our two datasets into one and thereafter divided the dataset randomly into 10 nonoverlapping segments. In such a sampling procedure, one single segment can be treated as a ‘control set’. For each control set, a ‘training set’ was created from the remaining primer pairs. Ten different models were developed for each training set and tested against the given control set. The significance of the descriptive factors in assessing the failure of PCR was measured by χ ^{2} and by the corresponding *P* -values. Only factors that were statistically significant in both the training and control sets were considered further in the composition of the final models. Factors were included in the final model building if they were statistically significant in at least 5 sub-models out of 10. The *type1* option of the GENMOD procedure (Type I) was used to perform forward stepwise analysis (a cutoff value of *P* < 0.0001 was used) to build the final models. The values of the intercept and regression coefficients were retrieved from the models of training sets. Probabilities of failure ( *P* ) were predicted for each primer pair of the control set on the basis of the parameter estimates retrieved. The formula used by GENMOD procedure for calculating the probability *P* of a failure was

*α*is the intercept (constant),

*β*are regression coefficients (estimates),

_{i}*X*are the independent variables (factor values) and

_{i}*k*denotes the number of factors used in the given model. The decrease of experimental failure rates in the datasets tested was used as a measure for comparing different models.

## RESULTS

The major aims of the current study were: (i) to find the main factors related to the primer sequence that allow to predict the PCR failure rate; (ii) to compare statistical models of different complexities for their ability to predict the failure rate of the PCR reaction conducted with genomic DNA sequences. Models differing in complexity were compared because of the need for a model for large-scale applications, such as masking the low-success rate regions in the entire human genome. Therefore, the resulting algorithm should be able to calculate the failure rate of every possible PCR primer in the eukaryotic genome in reasonable time.

### Factors used to predict PCR failure rate

Two hundred and thirty-six different factors describing primer or primer pair properties were selected for statistical analysis ( Table 1 ). The values of these factors were calculated for each primer pair and for each predicted product sequence (only one PCR product per primer pair was considered here).

**Table 1.**

Factor description | Factor name | GM1 | GM1MM | GM2 | GM2MM | PCR | Number of factors |
---|---|---|---|---|---|---|---|

The number of binding sites of PCR primers (exact, with one and two mismatches allowed) with different word sizes from the 3′-end and from random positions in the primer sequence. | MAX/MIN[ 8 , 10 , 12 , 14 , 15 , 16 ]; | + | + | + | 12 | ||

MAX/MIN _FULL; | + | 2 | |||||

MAX/MIN[ 8 , 10 , 12 , 14 , 15 , 16 ]_RAND; | + | 12 | |||||

MAX/MIN[ 12 , 14 , 15 , 16 ]_1MM; | + | + | 8 | ||||

MAX/MIN_FULL_1MM; | + | 2 | |||||

MAX/MIN[ 12 , 14 , 15 , 16 ] _1MM _RAND; | + | 8 | |||||

MAX/MIN[ 12 , 14 , 15 , 16 ]_2MM; | + | + | 8 | ||||

MAX/MIN_FULL_2MM; | + | 2 | |||||

MAX/MIN[ 12 , 14 , 15 , 16 ] _2MM _RAND | + | 8 | |||||

The number of binding sites of PCR primers (exact, with one and two mismatches allowed) with variable word sizes from the 3′-end and from random positions in the primer sequence. The word size for each primer is extended until three different free energy levels are achieved: ΔG < −10, −15, −20 kcal/mol. | MAX/MIN_DG[ 10 , 15 , 20 ]; | + | + | + | 6 | ||

MAX/MIN_DG[ 10 , 15 , 20 ]_RAND; | + | 6 | |||||

MAX/MIN_DG[ 10 , 15 , 20 ]_1MM; | + | + | 6 | ||||

MAX/MIN_DG[ 10 , 15 , 20 ]_1MM RAND; | + | 6 | |||||

MAX/MIN_DG[ 10 , 15 , 20 ]_2MM; | + | + | 6 | ||||

MAX/MIN_DG[ 10 , 15 , 20 ]_2MM _RAND | + | 6 | |||||

The number of all binding sites of PCR primers (exact, with one and two mismatches allowed) counted with NCBI BLASTN (-F F). | [MAX,MIN]_BLASTALL | + | 2 | ||||

PCR primer length | PRIM_LENGTH_[MAX,MIN] | + | 2 | ||||

GC content of PCR primer with different word sizes from the 3′-end and full primer | PRIM_GC_PRC_[ 8 , 12 , 16 ]_[MAX,MIN]; | + | + | + | + | + | 6 |

PRIM_GC_PRC_[MAX,MIN] | + | 2 | |||||

The free energies of different subsequences from the primer 3′-end | PRIM_ DG[ 3 , 4 , 5 , 6 , 7 , 8 , 9 ]_[MAX,MIN] | + | + | + | + | + | 14 |

DUST score of PCR primer | PRIM_DUS_[MAX,MIN] | + | 2 | ||||

The strongest free energies of the dimers of primers alone and in pairs using local and global alignment approaches | MAX/MIN_PRIM_END1; | + | 2 | ||||

PRIM_PAIR_END1; | + | 1 | |||||

MAX/MIN_PRIM_END2; | + | 2 | |||||

PRIM_PAIR_END2; | + | 1 | |||||

MAX/MIN_PRIM_ANY; | + | 2 | |||||

PRIM_PAIR_ANY | + | 1 | |||||

The strongest secondary structure of the PCR primers in a given pair predicted with MFOLD at 55°C | [MAX,MIN]_PRIM_MFOLD | + | 2 | ||||

The T _{m} of the primer, difference of melting temperatures between the two primers in a given pair and the difference between annealing (used in PCR experiments) and melting temperature | TM_[MAX,MIN]; | + | 2 | ||||

TM_DIFF; | + | 1 | |||||

TM_TA_[MAX,MIN]_DIFF | + | 2 | |||||

Total number of SNPs in both primers and the position of the SNP closest to the 3′-end | NO_OF_SNPS; | + | + | + | + | + | 1 |

ALL_POS_FROM_3_END; | + | + | + | + | + | 1 | |

NO_OF_VALID_SNPS; | + | + | + | + | + | 1 | |

VALID_POS_FROM_3_END | + | + | + | + | + | 1 | |

The terminal and last two nucleotides of primer sequence, also the first nucleotide of amplicon following the primer sequence. These are categorical values (0 – given nuc. is not present in both primers, 1 – is present at least in one primer, 2 – is present in both primers). | PRIM_LAST_ONE_NUC_[A,C,G,T]; | + | 4 | ||||

PRIM_LAST_TWO_NUC_[AA,AC, AG,AT,CC,CG,CT,GG,GT,TT]; | + | 10 | |||||

PROD_FIRST_ONE_NUC_[A,C,G,T] | + | 4 | |||||

The number of predicted products with maximum length of 1000, 3000 and 10 000 nt for exact binding sites with different word sizes from the 3′-end and from random positions in the primer sequence. | PROD[ 8 , 10 , 12 , 14 , 15 , 16 ]_1000; | + | 6 | ||||

PROD_FULL_1000; | + | 1 | |||||

PROD[ 8 , 10 , 12 , 14 , 15 , 16 ]_1000_RAND; | + | 6 | |||||

PROD[ 8 , 10 , 12 , 14 , 15 , 16 ]_3000; | + | 6 | |||||

PROD_FULL_3000; | + | 1 | |||||

PROD[ 8 , 10 , 12 , 14 , 15 , 16 ]_3000_RAND; | + | 6 | |||||

PROD[ 8 , 10 , 12 , 14 , 15 , 16 ]_10000; | + | 6 | |||||

PROD_FULL_10000; | + | 1 | |||||

PROD[ 8 , 10 , 12 , 14 , 15 , 16 ]_10000_RAND | + | 6 | |||||

The number of predicted products with maximum length of 1000, 3000 and 10 000 nt for exact binding sites with variable word sizes from the 3′-end and from random positions in the primer sequence. The word size for each primer is extended until three different free energy levels are achieved: ΔG < −10, −15, −20 kcal/mol. | PROD_DG[ 10 , 15 , 20 ]_1000; | + | 3 | ||||

PROD_DG[ 10 , 15 , 20 ]_1000_RAND; | + | 3 | |||||

PROD_DG[ 10 , 15 , 20 ]_3000; | + | 3 | |||||

PROD_DG[ 10 , 15 , 20 ]_3000_RAND; | + | 3 | |||||

PROD_DG[ 10 , 15 , 20 ]_10000; | + | 3 | |||||

PROD_DG[ 10 , 15 , 20 ]_10000_RAND | + | 3 | |||||

PCR product length | PROD_LENGTH | + | 1 | ||||

GC content of PCR product | PROD_GC_PRC | + | 1 | ||||

Area under the GC curve and above 65% of the PCR product (7) | PROD_AUCGC | + | 1 | ||||

Number of GC windows with values above 65% divided by the length of the PCR product (×100) (7) | PROD_RATIOGC_100 | + | 1 | ||||

PROD_AUCGC × PROD_RATIOGC (7) | PROD_AUCGC2 | + | 1 | ||||

The strongest secondary structure of PCR product predicted with MFOLD at 55°C | PROD_MFOLD_55 | + | 1 | ||||

Percentage of masked nucleotides of PCR product using DUST | PROD_DUST_PRC | + | 1 | ||||

Percentage of masked nucleotides of PCR product using Repeat Masker with different sensitivity parameters (-s, -q, -qq) | PROD_RMs_PRC; | + | 1 | ||||

PROD_RMq_PRC; | + | 1 | |||||

PROD_RMqq_PRC | + | 1 | |||||

Percentage of masked nucleotides of PCR product using GenomeMasker with different word sizes (exact matches) | PROD_GM[ 8 , 10 , 12 , 14 , 16 ]_PRC | + | 5 | ||||

Total number of factors | 236 |

Factor description | Factor name | GM1 | GM1MM | GM2 | GM2MM | PCR | Number of factors |
---|---|---|---|---|---|---|---|

The number of binding sites of PCR primers (exact, with one and two mismatches allowed) with different word sizes from the 3′-end and from random positions in the primer sequence. | MAX/MIN[ 8 , 10 , 12 , 14 , 15 , 16 ]; | + | + | + | 12 | ||

MAX/MIN _FULL; | + | 2 | |||||

MAX/MIN[ 8 , 10 , 12 , 14 , 15 , 16 ]_RAND; | + | 12 | |||||

MAX/MIN[ 12 , 14 , 15 , 16 ]_1MM; | + | + | 8 | ||||

MAX/MIN_FULL_1MM; | + | 2 | |||||

MAX/MIN[ 12 , 14 , 15 , 16 ] _1MM _RAND; | + | 8 | |||||

MAX/MIN[ 12 , 14 , 15 , 16 ]_2MM; | + | + | 8 | ||||

MAX/MIN_FULL_2MM; | + | 2 | |||||

MAX/MIN[ 12 , 14 , 15 , 16 ] _2MM _RAND | + | 8 | |||||

The number of binding sites of PCR primers (exact, with one and two mismatches allowed) with variable word sizes from the 3′-end and from random positions in the primer sequence. The word size for each primer is extended until three different free energy levels are achieved: ΔG < −10, −15, −20 kcal/mol. | MAX/MIN_DG[ 10 , 15 , 20 ]; | + | + | + | 6 | ||

MAX/MIN_DG[ 10 , 15 , 20 ]_RAND; | + | 6 | |||||

MAX/MIN_DG[ 10 , 15 , 20 ]_1MM; | + | + | 6 | ||||

MAX/MIN_DG[ 10 , 15 , 20 ]_1MM RAND; | + | 6 | |||||

MAX/MIN_DG[ 10 , 15 , 20 ]_2MM; | + | + | 6 | ||||

MAX/MIN_DG[ 10 , 15 , 20 ]_2MM _RAND | + | 6 | |||||

The number of all binding sites of PCR primers (exact, with one and two mismatches allowed) counted with NCBI BLASTN (-F F). | [MAX,MIN]_BLASTALL | + | 2 | ||||

PCR primer length | PRIM_LENGTH_[MAX,MIN] | + | 2 | ||||

GC content of PCR primer with different word sizes from the 3′-end and full primer | PRIM_GC_PRC_[ 8 , 12 , 16 ]_[MAX,MIN]; | + | + | + | + | + | 6 |

PRIM_GC_PRC_[MAX,MIN] | + | 2 | |||||

The free energies of different subsequences from the primer 3′-end | PRIM_ DG[ 3 , 4 , 5 , 6 , 7 , 8 , 9 ]_[MAX,MIN] | + | + | + | + | + | 14 |

DUST score of PCR primer | PRIM_DUS_[MAX,MIN] | + | 2 | ||||

The strongest free energies of the dimers of primers alone and in pairs using local and global alignment approaches | MAX/MIN_PRIM_END1; | + | 2 | ||||

PRIM_PAIR_END1; | + | 1 | |||||

MAX/MIN_PRIM_END2; | + | 2 | |||||

PRIM_PAIR_END2; | + | 1 | |||||

MAX/MIN_PRIM_ANY; | + | 2 | |||||

PRIM_PAIR_ANY | + | 1 | |||||

The strongest secondary structure of the PCR primers in a given pair predicted with MFOLD at 55°C | [MAX,MIN]_PRIM_MFOLD | + | 2 | ||||

The T _{m} of the primer, difference of melting temperatures between the two primers in a given pair and the difference between annealing (used in PCR experiments) and melting temperature | TM_[MAX,MIN]; | + | 2 | ||||

TM_DIFF; | + | 1 | |||||

TM_TA_[MAX,MIN]_DIFF | + | 2 | |||||

Total number of SNPs in both primers and the position of the SNP closest to the 3′-end | NO_OF_SNPS; | + | + | + | + | + | 1 |

ALL_POS_FROM_3_END; | + | + | + | + | + | 1 | |

NO_OF_VALID_SNPS; | + | + | + | + | + | 1 | |

VALID_POS_FROM_3_END | + | + | + | + | + | 1 | |

The terminal and last two nucleotides of primer sequence, also the first nucleotide of amplicon following the primer sequence. These are categorical values (0 – given nuc. is not present in both primers, 1 – is present at least in one primer, 2 – is present in both primers). | PRIM_LAST_ONE_NUC_[A,C,G,T]; | + | 4 | ||||

PRIM_LAST_TWO_NUC_[AA,AC, AG,AT,CC,CG,CT,GG,GT,TT]; | + | 10 | |||||

PROD_FIRST_ONE_NUC_[A,C,G,T] | + | 4 | |||||

The number of predicted products with maximum length of 1000, 3000 and 10 000 nt for exact binding sites with different word sizes from the 3′-end and from random positions in the primer sequence. | PROD[ 8 , 10 , 12 , 14 , 15 , 16 ]_1000; | + | 6 | ||||

PROD_FULL_1000; | + | 1 | |||||

PROD[ 8 , 10 , 12 , 14 , 15 , 16 ]_1000_RAND; | + | 6 | |||||

PROD[ 8 , 10 , 12 , 14 , 15 , 16 ]_3000; | + | 6 | |||||

PROD_FULL_3000; | + | 1 | |||||

PROD[ 8 , 10 , 12 , 14 , 15 , 16 ]_3000_RAND; | + | 6 | |||||

PROD[ 8 , 10 , 12 , 14 , 15 , 16 ]_10000; | + | 6 | |||||

PROD_FULL_10000; | + | 1 | |||||

PROD[ 8 , 10 , 12 , 14 , 15 , 16 ]_10000_RAND | + | 6 | |||||

The number of predicted products with maximum length of 1000, 3000 and 10 000 nt for exact binding sites with variable word sizes from the 3′-end and from random positions in the primer sequence. The word size for each primer is extended until three different free energy levels are achieved: ΔG < −10, −15, −20 kcal/mol. | PROD_DG[ 10 , 15 , 20 ]_1000; | + | 3 | ||||

PROD_DG[ 10 , 15 , 20 ]_1000_RAND; | + | 3 | |||||

PROD_DG[ 10 , 15 , 20 ]_3000; | + | 3 | |||||

PROD_DG[ 10 , 15 , 20 ]_3000_RAND; | + | 3 | |||||

PROD_DG[ 10 , 15 , 20 ]_10000; | + | 3 | |||||

PROD_DG[ 10 , 15 , 20 ]_10000_RAND | + | 3 | |||||

PCR product length | PROD_LENGTH | + | 1 | ||||

GC content of PCR product | PROD_GC_PRC | + | 1 | ||||

Area under the GC curve and above 65% of the PCR product (7) | PROD_AUCGC | + | 1 | ||||

Number of GC windows with values above 65% divided by the length of the PCR product (×100) (7) | PROD_RATIOGC_100 | + | 1 | ||||

PROD_AUCGC × PROD_RATIOGC (7) | PROD_AUCGC2 | + | 1 | ||||

The strongest secondary structure of PCR product predicted with MFOLD at 55°C | PROD_MFOLD_55 | + | 1 | ||||

Percentage of masked nucleotides of PCR product using DUST | PROD_DUST_PRC | + | 1 | ||||

Percentage of masked nucleotides of PCR product using Repeat Masker with different sensitivity parameters (-s, -q, -qq) | PROD_RMs_PRC; | + | 1 | ||||

PROD_RMq_PRC; | + | 1 | |||||

PROD_RMqq_PRC | + | 1 | |||||

Percentage of masked nucleotides of PCR product using GenomeMasker with different word sizes (exact matches) | PROD_GM[ 8 , 10 , 12 , 14 , 16 ]_PRC | + | 5 | ||||

Total number of factors | 236 |

Factors marked by ‘+’ under a model are used in the building of this model.

Calculation of primer-binding sites can be modeled in various ways as shown in Table 1 : for example, using a primer sequence substring of fixed or variable length (based on the thermodynamic approach), finding matches for a string with 100% identity or with mismatches. One of the major goals of this study was to determine whether the number of binding sites of (a substring of) the primer including mismatches gives a better prediction of the PCR failure rate than models that use 100% identity for model binding. Another important goal was to determine whether modeling the primer-binding site thermodynamically would improve the prediction of PCR failure rate.

We also included factors that are not directly related to the parameters of the number of primer-binding sites: primer length, GC% of primers, Gibbs free energy of possible secondary structures and primer–dimers and *T*_{m} of primers. We have also analyzed as factors the significance of known SNPs in primer sequences and energetically stable perfect duplexes composed of the last nine bases of the 3′ end of the primer. The SNPs may cause allele-specific PCR or possible failure in primer-template hybridization.

Factors specific to primer pairs include several parameters associated with the PCR product: the number of predicted PCR products, the length of the product, GC% of the product and secondary structures within the product. Other factors under study are regionalized GC contents and the number of repeat-masked nucleotides within the template DNA counted by different methods. The nucleotide combinations of primer 3′-end and the first amplicon nucleotide following immediately the primer sequence were also studied.

### Five models for predicting PCR failure rate

We have created five different types of models based on the complexity of the calculation of factors included. Model complexity can be measured by several parameters. The most important is the time needed for computing the values of factors in a given model. The calculation of primer-binding sites including mismatches takes more time than the calculation of exact matches because of the substitutions required in a primer sequence when comparing the primer with genomic DNA. For example, there are 49 different sequence variants for a word containing 16 nucleotides (word size = 16) with one mismatch at any position. Executing brute-force scanning of 16-mer oligonucleotides against a DNA template without using any heuristics increases the search time more than 40-fold. To make things more complicated, determination of the free-energy levels of the oligonucleotides makes these calculations even slower. Such thermodynamic-binding site modeling allows us to define a word size that is biophysically more meaningful than fixed length. A sequence containing mostly A and T nucleotides has higher free-energy values than a GC rich sequence. In a PCR reaction under fixed conditions, the former sequence requires more nucleotides identical with the template to hybridize in the annealing process than the latter. Therefore, variable word length must be used for searching binding sites, and more time is required to calculate these sites from genomic DNA.

Factors included in the first four models (described below) contain the properties that can be used without prior information about the primer pairs; we can use a single primer sequence to calculate the values of these factors. Thus, they are suitable for premasking low-success rate regions in the genomic DNA and are therefore called GenomeMasker (GM*) models. The last model requires full sequences of both primers and the PCR product for calculating the values of all factors; therefore it can be considered as an e-PCR model, capable of predicting the outcome of a given primer pair. The overlap of factors between the different models is shown in Figure 1 . Ten factors that were used in all models are: GC content of the 3′-end of a primer (8, 12 and 16 nt) and the number of SNPs and their positions in the primer sequence (16 nt from 3′-end). Additional factors included in different models were:

**Figure 1.**

**Figure 1.**

*GM1 model* is the simplest model, containing 22 different factors. These include parameters associated with the number of exact binding sites in the human genome, modeled by fixed word length.

*GM1MM model* contains basically the same factors, but binding sites with one and two mismatches are also present (38 factors in total). Binding site modeling within this model is similar to many primer design programs in which primer specificity is determined using BLAST software.

*GM2 model* is the first of the thermodynamic models that contains binding site counts (exact matches) with variable length word sizes using three levels of Gibbs free energy: Δ *G*_{37} < −10, −15, −20 kcal/mol (16 factors in total). Those energy levels correspond to the following average word sizes in the human genome: 10 (min = 6, max = 15) 14 (min = 8, max = 22) and 18 (min = 10, max = 29) nt, respectively.

*GM2MM model* includes the counts of binding sites with variable-length words using three free energy levels and one or two mismatches (28 factors in total).

*PCR model* was built from all the factors examined in this study (236 in total).

Selection of the most significant factors for each model and the building of the final models were achieved by 10-fold cross-validation. We divided our combined experimental dataset into 10 subsets as described in the Methods section. For each sub-model, the best factors present in at least half the cross-validated datasets were included in building the final models. Their order in the model was based on the χ ^{2} -values over the whole dataset.

The top four major factors in each model are shown in Table 2 . Including more than four factors in a model did not significantly improve its prediction power (see below). The first two (and thus the most important) predictors for each model are related to the maximum number of primer-binding sites in the human genome ( Table 2 ). Other factors that improve the PCR model are PCR product length and the number of predicted PCR products. The GM1, GM1MM and GM2 models also included one factor that is related to primer GC content.

**Table 2.**

Factor name | χ ^{2} (1) | Model |
---|---|---|

MAX_DG15_2MM | 4862 | PCR |

MAX_DG15_RAND*MAX_DG15_RAND | 1374 | |

PROD_DG20_1000_RAND* PROD_DG20_1000_RAND | 378 | |

PROD_LENGTH*PROD_LENGTH | 298 | |

MAX_DG15_2MM | 4862 | GM2MM |

MAX_DG15_1MM*MAX_DG15_1MM | 1091 | |

MAX_DG20_2MM | 244 | |

MAX_DG20_1MM*MAX_DG20_1MM | 262 | |

MAX_DG20 | 4085 | GM2 |

MAX_DG15*MAX_DG15 | 1106 | |

PRIM_GC_PRC_8_MIN | 386 | |

MIN_DG20*MIN_DG20 | 277 | |

MAX15_2MM | 2854 | GM1MM |

MAX12_1MM*MAX12_1MM | 1681 | |

MAX12 | 1291 | |

PRIM_GC_PRC_16_MAX | 789 | |

MAX16 | 2507 | GM1 |

MAX15*MAX15 | 2394 | |

PRIM_GC_PRC_16_MAX | 1126 | |

MAX14 | 272 |

Factor name | χ ^{2} (1) | Model |
---|---|---|

MAX_DG15_2MM | 4862 | PCR |

MAX_DG15_RAND*MAX_DG15_RAND | 1374 | |

PROD_DG20_1000_RAND* PROD_DG20_1000_RAND | 378 | |

PROD_LENGTH*PROD_LENGTH | 298 | |

MAX_DG15_2MM | 4862 | GM2MM |

MAX_DG15_1MM*MAX_DG15_1MM | 1091 | |

MAX_DG20_2MM | 244 | |

MAX_DG20_1MM*MAX_DG20_1MM | 262 | |

MAX_DG20 | 4085 | GM2 |

MAX_DG15*MAX_DG15 | 1106 | |

PRIM_GC_PRC_8_MIN | 386 | |

MIN_DG20*MIN_DG20 | 277 | |

MAX15_2MM | 2854 | GM1MM |

MAX12_1MM*MAX12_1MM | 1681 | |

MAX12 | 1291 | |

PRIM_GC_PRC_16_MAX | 789 | |

MAX16 | 2507 | GM1 |

MAX15*MAX15 | 2394 | |

PRIM_GC_PRC_16_MAX | 1126 | |

MAX14 | 272 |

All factors are significant at *P* < 0.0001. Asterisks in factor names mark the polynomial regression of given independent variable. χ ^{2} -values illustrate the estimated simultaneous (Type I) effects of the best four factors on each model.

### Four factors are required for robust prediction of failure rate

The next step was to compare the predictive powers of these models using cross-validation datasets. The parameter estimates for the model formulae were calculated with SAS GENMOD procedure for each model using the whole dataset. The predictive power of each model was then measured and compared on the basis of the average fraction of failed primer pairs in those 10 control sets. It is always computationally easier to implement models with fewer factors. Therefore, we created graphs for each model with one to four factors to determine whether the additional factor improves the prediction of the model. The average failure rate of experimentally tested PCR pairs (predictive power of the model) was calculated over all control sets with increasing sensitivity for each model ( Figure 2 ). Sensitivity was defined as the percentage of primer pairs remaining after the failure-prediction model was applied to our cross-validation datasets. The cutoff values were raised from 0 to the given point, where the number of positive (remaining) primer pairs is at a predefined level (10, 20 and 30% etc.).

**Figure 2.**

**Figure 2.**

The predicted failure rate can be approximately halved in each model using only one factor. The simpler models, GM1 and GM1MM, which do not include thermodynamics, were not so successful if only a single factor was included ( Figure 2 A). This is correlated with the statistics in Table 2 , where the binding sites modeled by the thermodynamic approach also gave higher χ ^{2} -values than the exact-matching bindings. However, after more factors were added to the models, the picture was more homogeneous ( Figure 2 B); it seems that simpler models are just as effective in predicting the failure of primers as more complex models. The best model, GM1, helps to achieve a 3-fold (from 17% to 6%) decrease in the failure rate of primers in our dataset.

### Description of the major factors

Our models show that one of the major causes of failure of the PCR reaction is an excessive number of primer-binding sites. The dynamics of the best factors for failure rate is shown in Figure 3 . One can see that alternative binding sites increase the failure rate of the PCR reaction. The effect is very similar for binding sites modeled with or without mismatches and for modeling based on exact matches or thermodynamics ( Figure 3 A). Allowing a maximum of 10 binding sites for both primers in a pair, we can reduce the failure rate from 17% to 10% using only a single factor, which has the greatest impact on the PCR failure rate. To increase it further, we need to include more factors in our models, as described above.

**Figure 3.**

**Figure 3.**

However, the additional factors, such as the GC contents of 3′-ends of primers with different word sizes (8 and 16 nt), improve the simpler models to a level similar to the PCR model. Primers with high GC content tend to give higher PCR failure rates ( Figure 3 B). Although a greater number of predicted PCR products ( Figure 3 C) increases the PCR failure rate, adding this to the PCR model does not affect the failure rate significantly when the given model is compared with one and four factors ( Figure 2 ). A similar effect was observed with PCR product length.

### The GM1 model outperforms RepeatMasker

Primer design can also be conducted from repeat-masked genomic sequences. One purpose of our work was to find a good algorithm for masking regions of genomic sequences that can lead to the design of failing primers for large genomes. We estimated the failure rate of primers designed to the (i) unmasked genome, (ii) RepeatMasker masked genome, (iii) GenomeMasker masked genome and (iv) genome masked by GM1 with four factors, as described in this article ( Table 3 ). We also estimated the appropriate level of sensitivity for the GM1 model by selecting 1000 random regions from the human genome (each 1000 nucleotides long), masked those regions with each method and tried to design primers for the masked sequences with the Primer3 program. Then we recorded the fraction of masked regions for which at least one primer pair could be designed. This approach helps to select appropriate cutoff levels of parameter values for the GM1 model, with a good balance between successful primer design for any desired region of the genome and the PCR failure rate of the designed primers.

**Table 3.**

No masking | RepeatMasker | GenomeMasker | GM1 with 4 factors | ||||||
---|---|---|---|---|---|---|---|---|---|

10% | 20% | 30% | 40% | 50% | |||||

Failure rate ^{a} | 16.9 | 13.8 | 10.2 | 6.0 | 8.9 | 9.5 | 11.2 | 11.4 | |

Primer design possible ^{b} | 100 | 69.3 | 96.4 | 98.5 | 99.3 | 99.6 | 99.6 | 99.7 | |

Sequence masked ^{c} | |||||||||

in genome | 49.5 | 52.2 | 80.7 | 59.7 | 51.4 | 45.5 | 35.6 | ||

in introns | 0 | 12.5 | 38.8 | 83.7 | 63.4 | 54.5 | 47.7 | 37.6 | |

in exons | 4.5 | 28.2 | 86.0 | 69.7 | 61.1 | 54.0 | 40.4 |

No masking | RepeatMasker | GenomeMasker | GM1 with 4 factors | ||||||
---|---|---|---|---|---|---|---|---|---|

10% | 20% | 30% | 40% | 50% | |||||

Failure rate ^{a} | 16.9 | 13.8 | 10.2 | 6.0 | 8.9 | 9.5 | 11.2 | 11.4 | |

Primer design possible ^{b} | 100 | 69.3 | 96.4 | 98.5 | 99.3 | 99.6 | 99.6 | 99.7 | |

Sequence masked ^{c} | |||||||||

in genome | 49.5 | 52.2 | 80.7 | 59.7 | 51.4 | 45.5 | 35.6 | ||

in introns | 0 | 12.5 | 38.8 | 83.7 | 63.4 | 54.5 | 47.7 | 37.6 | |

in exons | 4.5 | 28.2 | 86.0 | 69.7 | 61.1 | 54.0 | 40.4 |

^{a} Fractions of failing primer pairs after using given masking method calculated from the experimental data of 1314 primer pairs.

^{b} Fraction of masked genomic regions for which at least one primer pair could be designed using 1000 random regions from the human genome (each 1000 nt long). Primer3 was used to design primer pairs.

^{c} Fraction of masked nucleotides from three different random sequence sets: 1000 genomic regions (each 1000 nt long), 1000 exonic sequences (average length 150 nt) and 1000 intronic sequences (average length 400 nt). With GenomeMasker and GM1 we have used an option to mask only one nucleotide from 3′-end of the repeated word.

Table 3 shows that using GM1 models with most stringent settings (10% remaining primer candidates) gives a failure rate oftwo to three times lower than that using RepeatMasker (6% versus 14%). At these settings GM1 can still design primers into 98.5% of randomly selected 1 kb long genomic regions (setup similar to SNP region amplification).

Overall genome masking percentage with RepeatMasker was 50%, with GenomeMasker (max 10 binding sites allowed, masking 1 nucleotide) 52% and with GM1 model (with four factors, masking only one nucleotide from 3′-end of the repeated word) 81% of nucleotides of human genome. We also selected 1000 random exonic sequences from all known human genes and observed that they are masked even more extensively by GM1 model (86% of nucleotides masked at most stringent settings), but less extensively by RepeatMasker (5% of exon nucleotides masked) or GenomeMasker (28% of exon nucleotides masked). Higher masking of exon regions by our GM1 method may reflect the ability of GM1 to take GC content of primers into account. Generally GC-rich primers have higher failure rate and therefore GC-rich exon regions are more extensively masked ( Figure 3 B). If this will cause problems in primer design in exon regions, less stringent settings of GM1 can be used for masking and primer design.

## DISCUSSION

The prediction of PCR failure rate on the basis of sequence proves to be an effective way to update current primer design methods. To understand what actually affects the outcome of PCR it is essential to build a model for premasking ineffective primer candidates in the genome. With this study, we want to improve the masking methodology instead of methodology of finding PCR products, commonly known as e-PCR. Although our analysis shows that number of PCR products does not influence PCR failure rate significantly, it might still be necessary to predict how many PCR products a given primer pair would generate. For this we would suggest using e-PCR programs (GenomeTester, mePCR, isPCR, PRIMEX). A comparison of these programs is shown elsewhere ( 12 ).

Our study did not involve statistics of the 3′-end base compositions of PCR primers or their combinations with amplicon bases immediately next to the 3′-end of the primer ( 6 ). This was because the number of experimentally tested primers that we could use in our statistical analysis is small (and so is the number of combinations of base compositions of 3′ ends of primers and the sequences of amplicons flanking that 3′-end). We very quickly ran into the problem of over-fitting the model with those nucleotide combinations present, and therefore decided to remove them later from our factor list. Nevertheless, we have analyzed separately the effects of the last one and the last two nucleotides of the primer 3′-end and the first amplicon nucleotide following primer ( Supplementary Figure 1 ). None of the nucleotide combinations was significant in our datasets although some small individual effects were noticed ( Supplementary Table 1 ). In another study, the most sensitive parameter was the regionalized GC content of the DNA template ( 7 ). In our case, this factor (PROD_AUCGC2) had a little significance in training sets, but not in control sets. One possible explanation may be the character of the regions in which the primers are located. Our study includes primers from random regions over the genome (mostly intergenic), whereas Benita's work ( 7 ) is based on amplifying human exons, in which the GC content is higher than the average for the human genome.

We have managed to reduce the failure rate of PCR from 17% to 6% with models incorporating four factors. This was achieved with enhanced repeat-masking modeling using any of the GM* models before the primer design process. To improve PCR quality even further, optimization of the reaction protocol and selection or amount of reagents are advised in many studies ( 2 , 5 , 24 ). Our experiments are based on the human DNA sequence only, but we expect our models to perform with similar efficiency in other large (eukaryotic) genomes.

The primers in this study were not selected randomly. Most of them were designed by Primer3 and some by other primer design programs. The primer design process eliminates many primers with high-failure rates (primers with single-nucleotide repeats, extreme GC%, etc.). Our statistical models cannot estimate the influence of factors that had already been filtered out in this process.

The models presented here suggest that using more than one factor can make simpler models as effective as complex ones. The GM1 model contains the maximum number of primer-pair-binding sites counted with three different word sizes (14, 15 and 16 nt). Although the most significant factor is MAX16, others complement the model with additional information. This can be compared with the GM1MM model with one factor, which includes binding sites with mismatches. The shorter words in the GM1 model may actually behave like the mismatch versions of the 16 nt words and therefore the failure rates are similar to those obtained with more complex models. On the basis of our earlier findings, modeling of primer-binding sites by mismatches and exact matches are highly correlated with each other ( 12 ). Thus, primers with few exact hits in the genome also have few binding sites with mismatches, and vice versa. Furthermore, the GM1 model can compensate the lack of thermodynamics by evaluating the GC content of the 3′-end of the PCR primer. Primers with higher GC contents tend to bind more strongly to alternative sites. Therefore, two primers with the same number of binding sites and different GC contents may have different PCR failure rates. The combined effect of these factors in GM1 creates behavior similar to GM2.

Taking these results together, it is possible to mask genomic DNA regions against possible primer candidates with high-failure rates using 100% exact matches only. The current version of our GenomeMasker application supports only single-threaded scanning of genomic DNA, and masking all words with given lengths in the genome that appear over the cutoff value defined by the user. Upgrading of this application should include the ‘triple-masking’ procedure, which means counting words of different sizes simultaneously. The cutoff values for number of binding sites’ word sizes should be replaced with failure rate probabilities, where the user can define the strictness of the program by choosing desired failure probability cutoff level.

In summary, the number of binding sites is the strongest predictor of PCR failure rate. The GM1 model, based on exact matches with fixed word sizes, is as efficient as more complex models that include thermodynamics and mismatches. Nonunique PCR primers are one major factor that can cause failure of the PCR reaction in experiments with complex genomes. Alternative binding sites create unwanted amplifications, lowering the yield of primers that can hybridize with the desired region or product, and therefore should be avoided whenever possible. This should give us a good opportunity to create better repeat-masking algorithms and reduce the PCR failure rate.

## SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

## ACKNOWLEDGEMENTS

This work was funded by Enterprise Estonia (EU19730), the Estonian Science Foundation (ETF#6041), target financing grant from Estonian Ministry of Education and Research 0182649s04 and EU FP6 grants ‘ArraySBS’ and ‘CONCO’. The authors would like to thank Asper Biotech Ltd. and Eneli Oitmaa sincerely for conducting the experimental part of this study. The authors also acknowledge Triinu Kõressaar, Priit Palta and Helena Andreson for critical reading of the manuscript. Funding to pay the Open Access publication charges for this article was provided by targeted financing from Estonian Ministry of Education and Research 0182649s04.

*Conflict of interest statement* . None declared.

## Comments