Kinetic sequencing (k-Seq) as a massively parallel assay for ribozyme kinetics: utility and critical parameters

Abstract Characterizing genotype-phenotype relationships of biomolecules (e.g. ribozymes) requires accurate ways to measure activity for a large set of molecules. Kinetic measurement using high-throughput sequencing (e.g. k-Seq) is an emerging assay applicable in various domains that potentially scales up measurement throughput to over 106 unique nucleic acid sequences. However, maximizing the return of such assays requires understanding the technical challenges introduced by sequence heterogeneity and DNA sequencing. We characterized the k-Seq method in terms of model identifiability, effects of sequencing error, accuracy and precision using simulated datasets and experimental data from a variant pool constructed from previously identified ribozymes. Relative abundance, kinetic coefficients, and measurement noise were found to affect the measurement of each sequence. We introduced bootstrapping to robustly quantify the uncertainty in estimating model parameters and proposed interpretable metrics to quantify model identifiability. These efforts enabled the rigorous reporting of data quality for individual sequences in k-Seq experiments. Here we present detailed protocols, define critical experimental factors, and identify general guidelines to maximize the number of sequences and their measurement accuracy from k-Seq data. Analogous practices could be applied to improve the rigor of other sequencing-based assays.

S-3.1-a AAGTTTGCTAATAGTCGCAAG Figure S1. Standard curve for qPCR measurement, fitted by 'stats.linregress' function from 'SciPy' Python package. Figure S2. Relative standard deviation vs. mean of total RNA measured for each set of reacted sample triplicates, quantified by spike-in sequence (circles) or qPCR + Qubit (crosses). The mean relative standard deviations are similar for spike-in method (0.165) and for qPCR + Qubit method (0.176).
Text S1. Using customized kinetic models in the 'k-seq' package. Kinetic models with a closed form integrated rate law can be written as a python function, with the independent variables and parameters to be estimated as the input arguments, and the dependent variables as the return values. The function can be passed to the `k_seq.estimate` module for k-Seq fitting.
In general, rate constants can be determined by varying either time points or concentrations. The choice depends on experimental concerns. In the case of the aminoacylation ribozymes studied in this work, varying the initial concentration of substrate BYO was experimentally desirable for the following reasons. First, because BYO hydrolyzes in aqueous solution, fixing the time point simplified the form of the kinetic model for fitting (namely, introducing a constant factor to the exponential rather than a timedependent exponential of an exponential; please see Supporting Text S3 in (1)  As another example, k-Seq data can be fit to second-order kinetics. The integral form for second-order kinetics can be expressed as where ! is the initial amount of the sequence, " is the amount of reacted sequence at time , ! is the initial substrate concentration, and is the second-order rate constant. Let " = 0 where is the reacted fraction. The reacted fraction (as the dependent variable measured in k-Seq experiment) can be written as: This function can be used to fit the k-Seq experiments with varying time or varying initial substrate concentration ! , and a maximum amplitude may be added as a parameter if desired.  Region  Figure S10. Distribution of metric values for sequences from 6 selected regions from the simulated reacted fraction dataset, with various noise level. Histogram bars are stacked for visibility. Both metrics σ * and γ captured the trend of model identifiability as summarized in Table S2. In contrast, Δ failed to capture the difference in model identifiability between sequences with different regions and sample noise. showed good correlation between the two metrics (Spearman's ρ = 0.945, p-value = 0.000); larger variance was observed for sequences with > 2 due to lower counts and noisier measurements.  Table S3: Categorization of analyzable and non-analyzed sequence reads. We used a stringent criterion for joining, namely exact matching of the sequence reads in both directions. While this resulted in a substantial fraction of discarded reads, the benefit of improved sequencing error was important. The next largest category was sequences detected in the unreacted sample but not in the reacted samples; these are presumed to be relatively inactive sequences and tended to be higher-order mutants. To check that the analyzability requirement did not miss highly reactive sequences that were under-represented in the unreacted pool, we determined the mean counts for sequences that were found in the reacted samples but not in the unreacted sample. The highest mean count for those sequences was 3.6 reads, indicating that these sequences were not seen at high enough copy number after reaction to be considered a reliable candidate as an active sequence. However, the possible appearance of such candidates would depend on the specific experimental design and ribozyme being studied. residues and mutation rate , the fraction of a d-th order mutant in the pool is The maximizing the fraction of a single d-th order mutant satisfies + / = 0, thus η / = arg max η + = d/L (S2) and the maximum fraction for a d-th order mutant in the variant pool is We used Equation S3 to determine the optimal maximizing the fraction of a given order of mutant in the pool. For a sample with total reads, the expected counts for a d-th order mutant is + . Figure S14. Expected counts for mutants in mixed variant pools of four wild types given different mutation rate ( ) and 10 7 total reads, calculated using Equation S1. For mutants with > 2, the maximum number of expected counts is not sufficient for all possible d-th order mutant sequences to be estimated accurately (i.e., counts are < 10-100 at this sequencing depth), showing a limitation of the variant pool library.
Text S3. Effects of sequencing error in the variant pool.
From Equation S1, the relative abundance ratio between a d-th mutant and a (d+1)-th mutant is Assuming a constant sequencing error rate per nucleotide, and considering the effect of sequencing error only by substitution, the probability that a sequence will be misidentified as one of its one-mutation neighbors is .
The fraction of abundance that originates from its neighbors due to sequencing error is      , where ( ) is the fraction of sequence and is the number of unique sequences) for sequences with various minimum count thresholds in the unreacted samples. The variant pool is more even (higher normalized entropy) than the enriched pool.