Generalised interrelations among mutation rates drive the genomic compliance of Chargaff's second parity rule

Abstract Chargaff's second parity rule (PR-2), where the complementary base and k-mer contents are matching within the same strand of a double stranded DNA (dsDNA), is a phenomenon that invited many explanations. The strict compliance of nearly all nuclear dsDNA to PR-2 implies that the explanation should also be similarly adamant. In this work, we revisited the possibility of mutation rates driving PR-2 compliance. Starting from the assumption-free approach, we constructed kinetic equations for unconstrained simulations. The results were analysed for their PR-2 compliance by employing symbolic regression and machine learning techniques. We arrived to a generalised set of mutation rate interrelations in place in most species that allow for their full PR-2 compliance. Importantly, our constraints explain PR-2 in genomes out of the scope of the prior explanations based on the equilibration under mutation rates with simpler no-strand-bias constraints. We thus reinstate the role of mutation rates in PR-2 through its molecular core, now shown, under our formulation, to be tolerant to previously noted strand biases and incomplete compositional equilibration. We further investigate the time for any genome to reach PR-2, showing that it is generally earlier than the compositional equilibrium, and well within the age of life on Earth.

Note S1. Derivations of the full equilibrium solutions Note S1. 1

. The analytic solutions of the mutation rates under no-strand-bias assumption at equilibrium
Under the no-strand-bias (NSB) assumption, we can consider the example of k C→T = k G→A , where the s→u subscript denotes the mutation of the base s into u. Here, k s→u is the rate constant, which involves all processes that initiate and fixate the mutation in dsDNA, also converting the complementary strand. Central to our model is a plausible assumption of the strand-invariance of the mutation rates, i.e. the s→u substitution happens at the same rate independently from whether s is in the template or complementary strand of the dsDNA. To this end, at whichever rate C mutates to T in the template strand, with the same rate C converts to T in the complementary strand. Therefore, each strand will also have the complementary G to A conversions with the rate similar to C to T conversion, hence k C→T = k G→A . This symmetry in rate constants significantly simplifies the mutation network from 12 to six independent rate constants, as such: The reduced number of independent rate constants under the NSB assumption, produces a system of four kinetic equations, as such: each describing the evolution of A, G, T and C base contents in fractions, making it possible to infer the time evolution dynamics of genomic base composition using different values for the rate constants and the initial base contents. Taking into account the outlined NSB-driven equalities, we can convert the system of four ODEs into a substitution matrix, M, as follows: from which we can solve for the equilibrium base contents. In Mathematica, the system can be specified as: The system is fully solved, and the solutions imply the C A = C T and C G = C C equalities at equilibrium under NSB assumption for mutation rates. These solutions also link the mutation rate constants with the equilibrium genome composition. We can express that link for the overall G+C content content of any genome, under the NSB equilibrium, as follows: Since C A+T = 1 − C G+C , where the G+C content content is naturally independent from k C→G = k G→C and k A→T = k T →A mutation rate constants.
Note S1.2. The NSB principle can be extended into higher k-meric orders The NSB principle can lead to significant simplifications when considering oligomers of any size for crossmutations, hence we can potentially exploit the extra information dyad-, triad-etc. counts may reflect on the genome. In section , the rate constants were reflecting the average singleton mutations across different sequence context found in a genome. Thus, the above model is applicable for finding the individual base contents in the scale of the entire genome. In contrast to this, when we want to recover the k-mer k ∈ {1,2,3,4,...} frequencies in a given genome, we need to consider the rate constants for many more k-mer transitions, as the neighbouring bases alter the mutation rate at a given site. To clarify this, consider the k CpA→TpA and k CpG→TpG substitutions in the dyad context. Both substitutions are a result of a C→T mutation. However, the rates for the two substitutions will be different (k CA→TA = k CG→TG ), because of the neighbouring base (A vs. G) effects. In this particular example, we know that in the CpG context, the mutation rates for C are substantially elevated [1][2][3][4]. The equalities between the rate constants noted in section are true in the k-mer case as well, with the only difference being that we need to account for directionalities (5'-3' vs. 3'-5') of the oligonucleotides. Hence, the rate constants of the substitutions that are reverse complementary to each other will be the same. We can see this on the example of ApT→ApT substitution. The NSB principle extended into dyads. The schematic representation of the cross-mutation network among dyads (a) is constructed based on the tesseract (hypercube). There, the primed superscripts denote the position in the dyad, from 5' to 3' direction (A'G" is the same as ApG). Each line should be interpreted as a set of two counter-directed arrows. The models, along with the reduction of the number of independent rate constants are shown below. b shows an example of the time evolution of dyad contents, modelled without and with context dependence for mutation rates, starting from arbitrary sets of initial dyad contents and rate constants. The colouring scheme is also random. The convergence to unique 10 dyad content values can be noted in b.
If we assume no context dependence for the mutation rate constants, we can use the same i, j, k, l, m, n rate constants from the 1mer-NSB model in section to describe all the cross mutations in the hypercube-based cross-mutation model. Of course, this assumption is crude, especially for some dyads where the neighbouring effect can be substantial (for instance the bases in CpG dyads have substantially higher substitution rates). However, this relatively reduced dimeric model may still be useful to describe the overall dyad contents in genomes and to reveal the magnitude of neighbour effects on substitution rates. The system of dyad-state equations therefore reduces into the following: This system is still complex for Mathematica, but we can find exact symbolic solutions by carefully exploring the emergent groups in the equations. First, let us just assign an arbitrary set of numerical values to the {i, j, k, l, m, n} rate constants. We can now find the numerical solutions: The numerical solution to the system with the arbitrary assigned coefficients implies that: where, the brackets denote the equalities expected from the oligo version of the second parity rule. Hence, the empirical rule is, again, emergent from the solution of the system. We just have some additional equalities that are present under the assumption of no context dependence for the mutation rate constants. Therefore, for such hypothetical genomes, Chargaff could observe even more equalities, and the dimeric composition of the whole genome could have been described by just 3 values, a, b and c. Let us go on and use the found groups (invariant to the i, j, k, l, m, n numeric values, as soon as those are greater than 0) to further trim the system and get the symbolic solutions for a, b and c. The trimming of the system can be done by taking only one equation from each of the found three groups. The solutions to this model will be valuable as we can further compare the outcomes for real genomes with the actual dimeric contents, in which case the distortions from the theoretical values will reveal the presence of strong neighbouring effects on the mutation rates. We shall solve the reduced system by first creating the sparse arrays and solving the system of linear equations in a matrix convention. The solutions will be for the remaining C AA , C AC , C CC dimer fractions that correspond to a, b and c above, representing the full solution to the complete system specified above. We can convert the above into a substitution matrix, M, as follows: We can solve this equation in The unique {a, b, c} solution is found and can be easily verified by cross checking against the numerical solutions for any arbitrary set of non-0 and positive {i, j, k, l, m, n} rate constants. We have shown that, in this approximation, the model always converges into three unique dyad counts at equilibrium, the expressions of which reflect the connection between the genomic dyad content and the underlying individual rate constants. As expected, the three solutions from 2mer-NSB without context dependence for the mutation rate constants are the cross multiplications of the two unique solutions obtained from 1mer-NSB. Note S1.3. Application of the mutation rate constants under no-strand-bias in predicating singleton and dyad composition of chimpanzee genome Jiang et al. used genome-wide dSNP data from the chimpanzee genome to infer different substitution fractions [5]. The ancestral states of the sites were deduced by comparing the dSNP containing sequences to the homologous ones in humans. The work also reported the normalised substitution fractions, where each base is found in equal 25% frequency. Those substitutions are a result of the mutations happening within approximately 5-7 million years (t) after the divergence from the human-chimp most recent common ancestor [6,7]. Since t is relatively small, we can assume that the probability of the sites undergoing repeated mutations is negligible. Since the data are normalised into a uniform base content, n i is always 0.25. Furthermore, both n i and t will cancel out in the context of the equilibrium base content equations derived above. To this end, the normalised substitution fractions can be used in a manner similar to the rate constants, as their scaled versions. Below we shall take the reported values, calculate the overall equilibrium base contents (both monomeric and dimeric) in the chimpanzee genome, and compare those with the actual genomic data. The SNP-sequence-context-normalised substitution fractions, in %, for the chimpanzee genome (from the described publication) are: where f ij denotes the fraction of i→j base (single nucleotide) substitutions. We use these fractions as a replacement for the {i, j, k, l, m, n} rate constants. First, we average the already negligible differences between the reported substitution fractions, which are supposed to be equal by the NSB rate constant equality described in section .
Now we can calculate the equilibrium base contents of the chimpanzee genome using the cross-mutation network solutions, where {C A , C G , C T , C C } are the individual base contents, {a, b, c} are the dimeric contents under the assumption of neighbour-invariant rate constants, where which results in the following individual base contents Application of the models on the chimpanzee genome. Comparison of the equilibrium single base contents, predicted via 1mer-NSB, with the experimental one extracted from the reference genome, a. Correlation between the dyad contents is shown in b, where the prediction is done via 2mer-NSB without context dependence for the mutation rate constants.
Unfortunately, we cannot apply the full hypercube model (without the neighbour invariance), which could have been applied numerically with NSolve in Mathematica if we would have the individual substitution fractions coming from 48 different dimer contexts. Comparing those with the experimental base and dyad contents of chimpanzee genome reveals a striking correlation. Since our solution is for the equilibrium genome, we can infer from the above Figure a, that the chimpanzee genome is slightly off equilibrium and will move towards it by having more G:C|C:G→A:T|T:A mutations, consistent with the prior view on the state of the chimpanzee genome [5]. As for comparing the predicted dyad contents, since we do not have the substitution fractions for the 48 independent dyad crossings to use those for the complete dyad inference (2mer-NSB), we have only used the solution to the hypercube model with the assumption of 2mer-NSB without context dependence for the mutation rate constants by using the six rate constants inferred from the single-nucleotide-based substitution data. The agreement is still rather impressive (above Figure b), where, as expected, the neighbour effect distorts the CpG dyad frequency the most since the bases in CpG have much higher mutation rates owing to the involvement of epigenetic mechanisms. We expect these differences to vanish, as better-quality substitution data become available, accounting for the sequence context and enabling the usage of the full 2mer-NSB model for dyads.
Note S2. Machine learning model for classifying PR-2 compliance Here, we developed a machine learning model to the non-symmetric, uniform distribution simulation outcome to fit a classifier for compliance and non-compliance with the PR-2 solutions. This would give an impression on the full potential of the mutation rate constants to proxy mirror the PR-2 compliance status without actually solving the ordinary differential equation systems. To do this, we used the treebased extreme gradient boosting (XGBoost) machine learning model as our central framework for the model development [8,9]. In gradient boosting, an ensemble of learners is developed with each iterative learner predicting the residual of the ensemble of prior learners. The combination of the decision trees (maximum interaction depth and minimum child weight) as the underlying learner with the gradient boosting process (number of boosted trees, learning rate, subsample percentage and gamma) allows flexible tunability and optimisation of the six hyperparameters; a combination that is commonly used in a wide range of machine learning competitions such as Kaggle (www.kaggle.com) and predictive modelling [8,10]. Another important reason for selecting this machine learning strategy is that XGBoost allows the extraction of information regarding the importance of features, thereby giving us insight into which of the 12 mutation rate constants are most important to predict PR-2 compliance.
We employed the XGBoost machine learning model as the central framework for the model development in our study available in R through the caret library (https://cran.r-project.org/package=caret). For the machine learning model evaluation, we used the receiver operating characteristic (ROC) performance metric (using the MLeval library (https://cran.r-project.org/package=MLeval)) because, as we will describe in the following paragraph, our observations in the training set are balanced between each class. If it were skewed, other performance metrics may be more appropriate, such as the precision-recall curve [11][12][13]. We applied a k-fold cross validation (KCV) for evaluating the model (see Materials and Methods for details on hyperparameters). In a KCV procedure, the data set is randomly split into k smaller parts in order to reduce the risk of any over-fitting, and is a common practice for training machine learning models in the scientific literature [10,[14][15][16][17]. For instance, if k=5, the training data is split into 5 smaller sets, where, in each iteration, the hyperparameters of the model are trained using k-1 of the folds as the training data set. Next, the model is validated on the remaining part of the data set. Specifically, the remaining fold is used by the trained model as a test set to evaluate the performance metric, which, in our model, is the accuracy. This process is repeated k times, where, at the end of this process, the performance measure is reported as the average of the values in each k-fold. Given the size of the training data and common range of values for the k in KCV in the scientific literature, our model was trained with k=6 repeated once.
Using the tolerance values from eukaryotic organisms on the 25 million systems generated from the nonsymmetric, uniform distribution simulation, we naturally end up with a disproportionate amount of noncompliant cases over the compliant cases. For the purpose of the machine learning process, we, therefore, used an equal number of compliant and non-compliant cases for the training set, starting from all 12 mutation rate constants as features. We also performed a principal component analysis on these 12 mutation rate constants for the equilibrated cases without imposing the PR-2 tolerance and found that there was no outstanding principal component (data not shown). The selection of non-compliant cases is done via random sampling (seed = 2022). We normalised the selected data set (seed = 1234) to randomly sample the seeds for each of the six cross-validation processes, repeated once. In the model training process, we tuned the learning parameters using a grid search approach. The performance metric selected for the development of the machine learning model was the receiver operator characteristic (ROC) curve, which shows the True Positive Rate (TPR) vs. The 12 mutation rate constant-based features ranked by relative average importance for the achieved prediction quality. The selection of non-compliant PR-2 cases was done via random sampling. As the model is trained on a random selection of non-compliant PR-2 cases, the feature importance of the 12 mutation rate constants would not be representative of the wider data set. Instead, we repeated the training process 1000 times using the optimised hyperparameters to obtain an average feature importance plot of the 12 mutation rate constants. The barplots represent the average feature importance and the orange bar represents the average ± 1 standard deviation.
False Positive Rate (FPR) at varying classification thresholds. TPR is defined as the following: T P R = T P T P + F N (11) and FPR is defined as the following: To quantify the performance of the model across all possible classification thresholds, we compute the Area Under the ROC Curve (AUROC). The feature importance was calculated using the varImp function from the caret package. The optimised model obtained an AUROC value of 0.79 (above Figure a). Given the model is trained on a random selection of non-compliant PR-2 cases, the feature importance plot of the 12 mutation rate constants would not be representative of the wider data set. To circumvent this problem, we repeated the training process 1000 times using the optimised hyperparameters to obtain an average feature importance plot of the 12 mutation rate constants (below Table). For each random sampling of noncompliant PR-2 cases, we fixed the seed to 123. The diagonal transversion mutation rates (k A→T ; k T→A ; k G→C ; k C→G ) have a consistently higher feature importance compared to all other mutation rates (above Figure b).   Figure S3. Numerical analysis of all-independent mutation rate constants assumed to have a normal distribution. The systems of equations were solved starting from 25% initial contents for all four bases and rate constants randomly and independently drawn from a truncated normal distribution obtained from the Trek methodology [14] in byr-1 range. 25,000,000 such systems were calculated to produce genomic base contents at the final 4.28 byr time point. The 2-dimensional kernel density estimate scatterplots in a-d present the distribution of G+C content and C AT content skews from the outcome of the simulation (colours vary with decreasing occurrence frequency from red to blue). The simulations were run four times separately with standard deviation values scaled by an additional multiplier m ∈ {1,2,5,10} in plots a-d, respectively.  Figure S4. Numerical analysis of the NSB model symmetry-constrained mutation rate constants assumed to have a truncated normal distribution. The systems of equations were solved starting from 25% initial contents for all four bases and symmetry-constrained rate constants randomly drawn from a truncated normal distribution obtained from the Trek methodology [14] in byr −1 range. 25,000,000 such systems were calculated to produce genomic base contents at the final 4.28 byr time point. The 2dimensional kernel density estimate scatterplots in a-d present the distribution of G+C content and C AT content skews from the outcome of the simulation (colours vary with decreasing occurrence frequency from red to blue). The simulations were run four times separately with standard deviation values additionally scaled by m ∈ {1,2,5,10} multipliers in plots a-d, respectively. (1 + !" ) Figure S5. The base content solutions at equilibrium reveal the dependencies between the genomic G+C content content and the mutation rate constants (red line). The strand symmetric mutation rate constants were obtained from 17 species across the eukaryotic and prokaryotic kingdoms [18][19][20][21][22][23][24][25][26][27][28] and overlaid as a scatterplot with colourings based on their associated eukaryotic (purple) or prokaryotic (dark green) kingdoms.   Figure S6. Evaluations of the analytical relations of the 12 independent mutation rate constants compliant with PR-2. (left) Equations generated from Eureqa (now part of DataRobot) [29,30] used to predict the mutation rate constant and comparing it against the mutation rate constant that arrived to the PR-2 compliant solution in the simulation. Pearson correlation coefficients are indicated on each graph where the equations for the k A→T has the highest value (R = 0.883) while k G→A has the lowest (R = 0.740). (right) Similar graphs were obtained for the test set (see main paper), where the equations for the k G→C has the highest value (R = 0.872) (R = 0.872 for k A→T ) while k T→C has the lowest (R = 0.729). The diagonal dotted line represents perfect correlation with a slope of 1.

Chargaff
Genome equilibri ∆Eq-Ch Figure S7. Distribution of time to reach PR-2 compliance and genome equilibrium. 10 million systems with the simulation model were generated where the initial base content was randomly sampled from the maximum allowed range in eukaryotic and prokaryotic organisms for the first two randomly sampled bases, with the remaining two bases being sampled from 1-remainder for the four bases to always sum to 1 for a complete genome. The strand symmetric-based mutation rate constants were randomly drawn from a truncated normal distribution, the mutation rate constants of which were obtained from work done by Michael Lynch [28]. This process was performed for each of the six species independently. The green histograms represent the distribution of time to reach PR-2 compliance and the purple histograms represent the distribution of time to reach genome equilibrium (see Materials and Methods for details). The vertical dotted line intercepts the x-axis at 4.28 byr, the maximum current estimate of age of life on Earth [31].      Figure S8. Examination of the closeness of current PR-2 compliant lifeforms to the fully equilibrated solution.(a) The strand symmetric mutation rate constants were obtained from 17 species across the eukaryotic and prokaryotic kingdoms [18][19][20][21][22][23][24][25][26][27][28], aligned to the Trek-scaling [14] and substituted the parameters of the 12 sets of mutation rate constant equations. Each species is represents a unique colour. The Pearson correlation coefficient values are indicated on each graph where the k C→A obtained the highest value (R = 1.000). (b) Similar graphs were obtained where each graph represents all the 12 mutation rate constants from (a) for each of the 17 species. The Homo sapiens species has the lowest Pearson correlation coefficient (R = 0.525). The diagonal dotted line represents perfect correlation with a slope of 1.