Transcriptional accuracy modeling suggests two-step proofreading by RNA polymerase

Abstract We suggest a novel two-step proofreading mechanism with two sequential rounds of proofreading selection in mRNA transcription. It is based on the previous experimental observations that the proofreading RNA polymerase cleaves off transcript fragments of at least 2 nt and that transcript elongation after a nucleotide misincorporation is anomalously slow. Taking these results into account, we extend the description of the accuracy of template guided nucleotide selection beyond previous models of RNA polymerase-dependent DNA transcription. The model derives the accuracy of initial and proofreading base selection from experimentally estimated nearest-neighbor parameters. It is also used to estimate the small accuracy enhancement of polymerase revisiting of previous positions following transcript cleavage.


See reference Measured
The same (7) k pre Pre-factor 10 9 s -1 From reference 10 9 -10 11 s -1 M -1 for second order rate constants (8) k pre-a Pre-factor of association  Table S1. Summary of all input information used in the calculations. R is the gas constant, 8.314510 J·K −1 ·mol −1 , and T is the temperature 310 K. The reaction rates in the fourth column are calculated without any free energy difference between ground states.
As seen in Table S1, the choice of free energy barriers for the reactions have been guided by experimental observations and measurements but not extracted directly from the references. In most cases the reference value has been measured at a lower temperature or without factors, which facilitates the measuring but gives a much slower reaction rate than the expected in vivo value. To some extent, this has been amended by the tuning of the parameters to the experimental total transcription time.
Association and dissociation have a common reaction barrier, ΔG ‡ a , but association is a faster reaction due to the association pre-factor and since the state POST·NTP is generally more stable than POST. In addition, the pre-factor of association is tuned to a high value. Polymerase effects have been experimentally verified for both reactions (3; 2), but since the experimental conditions of the measurement were far from in vivo conditions, we have tuned the parameters to match experimental accuracy measurements (10; 11; 12).
The reaction rates calculated using the reaction rate barriers have been included for comparison, but without any difference in free energy between ground states. Generally, these should be the highest values as they correspond to the cases when the reactions are going from one sub-state to another with a lower free energy (and higher stability) and only the reaction rate barriers ΔG ‡ barrier are used. The cases when the differences in free energy between the sub-states are also included as the reactions go to a sub-state with higher free energy would mean lower rates. The rates of nucleotide association depend on nucleotide concentration (the value shown uses the concentration of GTP), and k c and q c also take other values when multiplied with the polymerase discriminating effect, decreasing the rate of k c and increasing the rate of q c by a factor 50. Figure S1 shows the error frequency distribution after two steps of proofreading but without revisiting positions for our standard parameter and five sets with altered reaction barriers for translocation (higher ΔG ‡ trans ), association and dissociation (higher and lower ΔG ‡ a ), phosphodiester bond formation (higher ΔG ‡ c ) and transcript cleavage (lower ΔG ‡ cut ). The barriers are changed by an added or subtracted 2RT (translating to a reaction rate change of a factor 7.4), but the total error distributions are similar. There are many other combinations of parameter changes that we could have chosen to show, especially since the effects of the parameters interact, so this figure does not explore the entire parameter space but gives a brief overview. Figure S1. Error probability histograms for a few parameter sets. Black: the standard set as presented in Table S1. Beige: like the black one but ΔG ‡ trans is increased from 9.2 to 11.2, so that the rate of translocation is decreased. Blue: like the black one but ΔG ‡ a is decreased from 9.8 to 7.8, so that the rates of nucleotide association and dissociation are increased. Magenta: like the black one but ΔG ‡ a is increased from 9.8 to 11.8, so that the rates of nucleotide association and dissociation are decreased. Green: like the black one but ΔG ‡ c is increased from 13.1 to 15.1, so that the rate of phosphodiester bond formation is decreased. Red: like the black one but ΔΔG cut is decreased from 17.2 to 15.2, so that the rate of transcript cleavage is increased.  Fig. S1 by showing the transit times, ratio of first and second step proofreading selection (mean(F 1 )/mean(F 2 )) and the revisiting effect to second step of proofreading as a factor of the error frequency decrease for the same parameter sets. This is a further demonstration of how the parameter dependence of the model is complex and not easy to predict. From Table S2, we see that the transit time and the relation between the two steps of proofreading vary considerably, but the effect of revisiting positions is always small. It also demonstrates that the chosen parameters were tuned to obtain a transit time close to the experimental transit time of roughly 60 s.

Revisiting positions
There are two fundamental differences between arriving at an elongation state by transcript elongation or by transcript cleavage. After elongation, the last nucleotide in the transcript has just undergone initial selection and is checked by proofreading for the first time. When instead returning to a position by cleavage of the nucleotides ahead, the last two nucleotides of the transcript in the state will undergo an additional round of proofreading, corresponding to proofreading step two for the penultimate nucleotide, and proofreading step one followed by proofreading step two for the last nucleotide. The accuracy of the respective rounds of proofreading is not affected, but the total accuracy is amplified by the number of "extra" visits to a position. Consequently, we have to consider how often this occurs when determining the sequence effect on proofreading selection.
The dynamics of the elongation states along the RNA is expressed by a system of master equations. The reactions connecting the elongation states, comprising the sub-state reactions, are elongation at a compound rate constant κ and transcript cleavage at a compound rate constant ς, as described in the main text ( Fig. 1A) and previously (1). The other fundamental difference between arriving at an elongation state by transcript elongation or by transcript cleavage is the sub-state; directly after cleavage the polymerase is in the sub-state POST instead of PRE, the first sub-state after phosphodiester bond formation. Starting in state POST, κ and ς will not be the same as when starting in state PRE. Consequently, the ratio ς/κ in proofreading selection is also different, so that the proofreading selections F Q for the first and second proofreading step on returning to a position (Eq. 1 and 2) differs from F, the proofreading steps after initial selection described in the main text (main text Eq. 8 and 9), here referred to as F K .
Equivalently, for the second proofreading step: Therefore, we have designed a master equation where each elongation state E i is represented twice; both when reached by the forward reaction (denoted by subscript K) and by the backward reaction (denoted by subscript Q): The relations between the double elongation states are represented in Fig. S2, and described further below in the calculations of the total transcription time. The boundary conditions, that the system starts at state PRE in the first position in the operon from which it cannot backtrack and finishes when it reaches the termination site, allow us to calculate the mean time spent in each elongation state (1). The mean time spent in each elongation state divided by the mean time to leave it gives us the mean number of times each state is visited, and thus we know how often each position is revisited. The number of extra rounds of proofreading (RV i ) is the number of revisits before the nucleotide is cleaved off or transcription is terminated, which equals the number of revisits per incorporations:

Number of visits in backward states Revisits per incorporated nucleotide
Number of visits in forward states RV  (4) The probability of product formation for an incoming nucleotide at position i equals the probability to go through initial selection, proofreading selection step one, proofreading selection step one after returning to the position i a total of RV i number of times, proofreading selection step two after step one (1+RV i ) number of times, and proofreading selection step two after returning to the position i+1 a total of RV i+1 number of times. Only if the substrate escapes rejection in all these steps, with the respective probabilities I P , In analogy with the main text Eq. 10, the total accuracy per non-cognate substrate and nucleotide position based on this formulation of probability of product formation becomes: In Eq. 5, RV i+1 is the number of revisits to position i+1. Since the misincorporation does not affect the transcription bubble stability after leaving the active site by polymerase translocation, the mean number of returns RV i or RV i+1 will not depend on the identity of the incorporated nucleotide, only on the position i. The other variables in Eq. 5 are specific both for the position and the incorporated substrate.
We make the assumption that the rest of the transcription bubble, aside from the scrutinized position, is always correct. This means that the possible effect on the transcription bubble stability of the misincorporated base when it exits the polymerase after eight elongations is neglected. The probabilities of misincorporations in the rest of the transcript are, however, included in the reaction rate constants between elongation states (see below) so that the number of revisits includes the increased chance of cleavage of mismatches.
The model still only allows for one step of backtracking even though long backtracking should be included in a model of the total backtracking dynamics. It is, however, not important for the purpose of analyzing the accuracy effect. Since the accuracy enhancement comes from the number of cleavages to a certain position, it is the total number of cleavages and not the length of the transcript cleaved off that determines the accuracy enhancement. For this reason, the effect on accuracy of long backtracking would probably be small and the effect on proofreading selection of revisiting positions does not motivate its inclusion.

Calculation total transcription time
However, in this study we extended this set of equation to the double equations including the states reached by cleavage, since the compound reaction rate constants κ and ς will be different when the polymerase starts in sub-state POST instead of PRE. Each elongation state in Eq. 6 is instead described by two probabilities in the master equation, representing the probabilities of being in elongation state i after a forward reaction (P(E K(i) )) or a backward reaction (P(E Q(i) )): ( 1) ( 2) To do this, however, we must solve another set of master equations; the master equation of the substates needs to be solved for each elongation state. The elongation reaction rate constants κ and ς at which the polymerase moves from one elongation state to another, are calculated from the reaction rate constants k c and q c of the phosphodiester bond formation and transcript cleavage, respectively, multiplied by the probability of being in that sub-state. The probability of being in the sub-state equals the fraction of the mean time to leave the elongation state that this sub-state is inhabited, so that: Note that the time τ i is the total (mean) time spent in the elongation state, and does not equal the total time to leave it, which is the sum of the mean times of the sub-states. The time τ i also reflects the relation between the elongation state and its neighbors, taking into account that the state might be visited more than once. The ratio between the total time and the mean time to leave the state gives us the mean number of visits, as mentioned in the main text.
The mean times of the sub-states are described by the integrated master equation: There are two sets of boundary conditions that give two different solutions; for the elongation states reached by transcript elongation, the system of sub-states starts in state PRE, so that the solution to the integral of the time-derivate of the probability of being in state PRE is -1, and for elongation states reached by transcript cleavage, the system of sub-states starts in POST. The two solutions give the two different values for κ and ς; κ K(i) and ς K(i) , and κ Q(i) and ς Q(i) ; and they in turn give the two different solutions to ς/κ that constitute the difference between F 1K and F 1Q , as well as F 2K and F 2Q (Eq. 8 and 9 in the main text and Eq. 1 and 2 here).
There is one more adjustment to the calculation of the total transcription time; including the errors. As we know, the reaction rate constants differ for cognate and non-cognate substrates. Every time an elongation state is visited, both for the first time and when revisited, there is a chance that the last or second-to-last incorporated substrate is a mismatch. Therefore, the reaction rate constants used in the transcription time calculations are a composite of the cognate and non-cognate reaction rates, each weighted by the average error probability.
The average error probability, equal to the per-position error frequency, is estimated differently for different cases. In a forward elongation state, reached by transcript elongation, the error probability is the error rate after initial selection in the case with only one-step proofreading. This must be true since the last incorporated nucleotide has not yet undergone proofreading and since errors in the penultimate position of the transcript are not detected. The error rate after initial selection is hence: 1 Error rate after initial selection 1 I   (10) In a backward elongation state, reached by transcript cleavage, the error probability is the error rate after initial selection, one round of proofreading in a forward state (F 1K ) and an average number of rounds of proofreading in a revisited state (F 1Q ), in the case with only one-step proofreading. This average number of rounds of proofreading equals half the number of total revisits to the position, so that the error rate becomes: 2 11 1 Error rate in revisited state , 1-step proofreading = 1 With the two-step proofreading, errors in the penultimate position are also detected. In a forward state, reached by elongation, the average error is the error rate at the last position, after initial selection, plus the estimated error rate at the penultimate position when the elongation state after the next is reached by elongation. An error in the penultimate position must have gone through initial selection and one round of step-one proofreading after elongation, but it is also possible that the polymerase has revisited the state of the step-one proofreading, in which case this nucleotide has also gone through step-two proofreading in a forward state. Lastly, it is also possible that this position has been revisited, whereby the penultimate nucleotide has undergone step-two proofreading in the revisited state. Hence, the error rate in the penultimate position must account for the chance of an error persisting initial selection, one round of step-one proofreading in a forward state (F 1K ), an average number of rounds of step-one proofreading in a revisited state (F 1Q ), an average number of rounds of step-two proofreading in the forward state (F 2K ), and an average number of rounds of step-two proofreading in a revisited state (F 2Q ). The average number of rounds of step-two proofreading in a forward state is estimated as half the number of revisits to the preceding position, so that:  (12) Lastly, in a revisited state, reached by transcript cleavage, the average error again is the error rate at the last position, after initial selection, plus the estimated error rate at the penultimate position when the elongation state after the next is reached by elongation. However, when the position is revisited we know for certain that the error in the penultimate position has passed through step-two proofreading in the forward state (F 2K ) at least once. Hence, the error rate becomes: This can be compared to the total accuracy after the full number of revisits (Eq. 5).
The composite reaction rate constant for an arbitrary reaction between elongation states hence becomes: (1 The attentive reader will have realized that not only are the error probabilities calculated using the number of revisits, RV, but also that the number of revisits is calculated using the error probabilities. This self-reference is solved by iterating the calculation of revisits until the solution of the total transcription time is stable; for most parameter sets, this is reached in no more than three cycles, starting with the solution for revisits with cognate reaction rates only. Through these somewhat cumbersome calculations, the total time of transcript elongation for the whole operon is estimated both for the model with only one step of proofreading and for the twostep proofreading. Compared to the calculation used in the previously published model (1), without non-cognate reaction rates and the double elongation states, the total transcription time increased by around 50%, motivating re-tuning of the parameters to match the experimental transcription rate. This increase was mostly due to the non-cognate reaction rate constants.