Machine-driven parameter screen of biochemical reactions

Abstract The development of complex methods in molecular biology is a laborious, costly, iterative and often intuition-bound process where optima are sought in a multidimensional parameter space through step-by-step optimizations. The difficulty of miniaturizing reactions under the microliter volumes usually handled in multiwell plates by robots, plus the cost of the experiments, limit the number of parameters and the dynamic ranges that can be explored. Nevertheless, because of non-linearities of the response of biochemical systems to their reagent concentrations, broad dynamic ranges are necessary. Here we use a high-performance nanoliter handling platform and computer generation of liquid transfer programs to explore in quadruplicates 648 combinations of 4 parameters of a biochemical reaction, the reverse-transcription, which lead us to uncover non-linear responses, parameter interactions and novel mechanistic insights. With the increased availability of computer-driven laboratory platforms for biotechnology, our results demonstrate the feasibility and advantage of methods development based on reproducible, computer-aided exhaustive characterization of biochemical systems.


INTRODUCTION
Systematic explorations of reaction parameters have been driven by automation and miniaturization of laboratory experiments, and sub-microliter liquid handling systems hold the biggest promises in terms of throughput together with reducing the cost of reagents. For instance, microfluidics technologies were used to generate 64 different combination of salt and DNA concentrations in a hybridization assay, for five different salts (1). More recently thousands of unique combinations of different concentrations of reagents and their reaction products were measured using microfluidic droplets (2). However, such approaches rely on merging physical streams of reagents, and thus are limited in the number of parameters that can be explored simultaneously and are also hard to apply to categorical parameters. On the other hand, several nanoliter-handling platforms have appeared on the market and empower researchers to design digitalized analysis with an arbitrary number of reagents. Among these platforms we chose acoustic droplet ejection technology because it combines several advantages. First, there is no contact between the machine and the liquid, which eliminates the cost of disposable plastic pipette tips. Second, source reagents can be provided in microplates with hundreds or even thousands of wells, allowing for the use of molecular barcodes. Third, each liquid transfer is fast.
As a pilot reaction for optimizing, we chose reversetranscription, which converts messenger RNA (mRNA) molecules to complementary DNA (cDNA), a suitable substrate for quantitative DNA sequencing technologies (3). The reverse-transcription reaction is central to biological experiments that aim at quantifying the activity of genes. In experiments where the amount of biological substrate is limited in quantity, the performance of the reverse-transcription reaction becomes a limiting factor. In particular, analysis of single cells requires a highly efficient conversion from mRNA to cDNA (4). The reversetranscriptase enzyme needs a short DNA oligonucleotide to prime the reaction. In addition, many methods for singlecell analysis use an additional 'template-switching' oligonucleotide in order to extend the cDNA sequence in preparation for sequencing (see Figure 1 and for review, Picelli, 2017 (5)). While increasing concentrations of the reversetranscription primer and the template-switching oligonucleotide tend to increase the efficiency of the reaction, a limit is reached at high values (see for instance Zajac et al. (6)). The proliferation of protocols using clearly different reaction parameters ( Table 1) suggests rather that oligonu- Figure 1. Method overview. This conceptual drawing shows a 384-well plate where each individual well is recognized by a combination of 64 different 'barcodes' and six different 'indexes'. The studied reaction is a reverse-transcription where a mRNA is converted to cDNA by a reverse transcriptase (RT) using a primer. In addition, a template-switching (T-S) oligonucleotide is present in the reaction to extend the sequence of the mRNA. After incubation in the plates, reactions prepared with six different starting amounts of RNA are amplified together in six different polymerase chain reactions using indexed primers, pooled in a multiplexed sequencing reaction and then demultiplexed in silico using the unique combination of barcode and index sequences. Multifactorial quantitative analysis of the sequencing reads then follows, according to the initial reaction parameters decided for each well in the experiment design.
cleotide concentrations seen as optimal by protocol developers depend on other parameters such as enzyme type or substrate amounts.
In the past decade, we have developed a method for quantitative high-throughput gene expression analysis, nanoCAGE (7,15), in which template-switching oligonucleotides and random reverse-transcription primers are used to introduce adapters at both ends of the cDNAs (Figure 1) for amplification and sequencing. The key features of the nanoCAGE method are that its reverse-transcription primers have random ends to cover non-polyadenlyated RNAs such as histone mRNAs and non-coding RNAs, and that the riboguanosine tails of the template-switching oligonucleotides are used to increase the specificity for capped mRNA ends (15,16). While these two oligonucleotides are essential to the reaction, they are also a source of noise as for instance they can prime each other and create artificial templates for the PCR reaction that follows. Thus, the choice of their concentrations was a trade-off between reaction efficiency and reaction specificity.
We originally developed nanoCAGE for samples yielding nanogram amounts of total RNA input. In the latest version of the nanoCAGE protocol (7), the PCR-amplified cDNAs are fragmented and prepared for sequencing by tagmentation with the Tn5 transposase (17). As a barcode sequence is introduced at the 5 end of cDNA molecules by the template-switching oligonucleotide during the reversetranscription step and an index sequence is added at the 3 end of sequencing products during the tagmentation step, several samples can be multiplexed and sequenced together. Additionally, Unique Molecular Identifiers (UMI) are also included in the sequence of template-switching oligonucleotides, which allows single molecule counting by removal of PCR duplicates in sequencing data. Since nanoCAGE is focused on 5 ends and does not aim at full transcript coverage, libraries can be completely sequenced even on an Illumina MiSeq sequencer (see for instance Tauran et al. (18)).
In pilot experiments, we obtained evidence that the latest nanoCAGE protocol might be miniaturized for picogram amounts of total RNA, but it remained unclear whether the amount of the other reagents in the reverse-transcription would need to be scaled down. We routinely use six quality control metrics ( Table 2) to assess the performance of nanoCAGE reactions. Importantly, these metrics require less than a thousand of reads per experiment to converge quickly to stable values, therefore allowing us to massively multiplex nanoCAGE reactions in small-scale, cost-efficient sequencing runs. Using nanoCAGE quality metrics as a proxy for the quality of reverse-transcription reactions, we developed the method presented here, to search for optima in the 4D parameter space representing enzyme type, RNA starting amount and molarity of the template-switching oligonucleotide and the reverse-transcription primer.

METHODS
Nanoliter-scale liquid transfers were performed on an Echo 525 instrument (Labcyte) using transfer sheets generated RTP: reverse-transcription primer. The other reagents (desoxynucleotide triphoshates (dNTP), magnesium (Mg) and manganese (Mn) ions and dithiothreitol (DTT)) are listed for the sake of completeness but were not investigated in this manuscript. Richness score (Hurlbert 1971), representing the expected number of molecules that would align in known genes if the library were downsampled to 10 reads. from event logs produced by simulations of instrument runs in the R programming language. All the scripts are archived as supplemental material on Zenodo and available for display at https://github.com/oist/labcyte-rt-optimisation/tree/ 1.0.0. Each 384-well plate was organized in six zones interleaved to match the geometry of standard multichannel pipettes in order to collect the contents of their wells and pool them in microtubes (Figures 1 and 2). Within a zone, each of the 64 wells was recognized with a unique barcode sequence inserted in the template-switching oligonucleotide. The barcoded oligonucleotides were chosen randomly from a list of 70 (out of 96) that passed quality controls (see below). Each zone received 9 × 6 combinations of molarities of template-switching oligonucleotides and reverse-transcription primers, plus nine controls with no reverse-transcription primers, plus 1 control with no RNA ( Figure 2). Wells from each zone received a different amount of total RNA. Four plates were prepared with the SuperScript III enzyme (Invitrogen) and four with Super-Script IV (Invitrogen). We used a total RNA prepared from mouse liver. An RNA Integrity Number (RIN (19)) of 9 was measured on an Agilent Bioanalyzer with an RNA 6000 pico kit. As molarity measurement were not possible with this kit, RNA amounts are measured as a mass in this work.
The template-switching oligonucleotides were synthesized by Integrated DNA Technologies, Inc. (IDT), at a scale of 10 nmol each in a 96-well plate format and resuspended at a stock concentration of 1 mM upon reception. To assess the quality of oligonucleotide synthesis, we ran a control experiment in which the parameters of the reverse-transcription reactions were kept constant (see supplemental experiment six on https://github.com/oist/labcyte-rt-optimisation/blob/1. 0.0/Labcyte-RT Data Analysis 6.md for details). For the main experiment, we selected 70 oligonucleotides among those who gave sequence yields closest to the mean. RT reaction master mixes (containing 0.0528 M sorbitol, 0.264 M trehalose, 0.75 M betain, 0.01 M DTT, 0.625 mM dNTPs, 20 U/l SuperScript enzyme III or IV and 1× SuperScript buffer III or IV), template-switching oligonucleotides, reverse-transcription primers, RNA and ultrapure water were transferred from independent wells of a source plate (Greiner Bio-One) to the different wells of the destination plates (Applied Biosystems) using specific liquid transfer patterns for each plate. After the acoustic transfer of RT reagents (total RT reaction volume equal to 500 nl), the 384-well destination plates were directly sealed, centrifuged at 4 • C, and deposited in a 7900HT Fast Real-Time PCR system (Applied Biosystems) to perform first , RNA masses are shown on a picogram scale (6 different RNA quantities ranging from 1 pg to 100 ng) and reagent transfer volumes are indicated on a nanoliter scale. Wells displayed in gray correspond to 60 negative control reactions performed without RT primers (see C and J) or without RNA (see E and L). There were thus 324 reagent combinations studied on each plate using two different enzymes (SSIII and SSIV), which represents a total of 648 data points used to explore the parameter space of the reaction. (O-P) Example of reverse transcription barcode ID distribution before (O) and after randomization (P) for a set of 64 reactions corresponding to one RNA mass (wells indicated in yellow on E and L). The 64 barcoded cDNA samples obtained after the reverse transcription were pooled, amplified and tagged by a specific index sequence (Q) representative of the corresponding group of samples in the sequencing library. Six different indexes characteristic of the six RNA masses were used for each plate. Randomized plates were prepared in quadruplicate for each reverse transcriptase using 24 different indexes, making a total of 1536 samples (64 barcoded samples x 24 indexes) that were pooled and sequenced together for each enzyme. strand cDNA synthesis (temperature program: 22 • C for 10 min; 50 • C for 30 min; and 70 • C for 15 min). RT products from each zone (see above and Figure 2) were further collected with a multichannel pipette, pooled and purified using AMPure XP beads (Beckman Coulter). Purified pools were then processed as in the nanoCAGE protocol (7) applying 18, 21, 24, 27, 30 and 33 PCR cycles during the first amplification reaction, respectively for pools of RT re-actions corresponding to template RNA amounts ranging from 100 ng to 1 pg. Libraries prepared for each starting RNA amount were tagged by specific index sequences (Nextera XT DNA Library Preparation Kit, Illumina), mixed equimolarly and sequenced on a MiSeq system using a reagent kit v2 (Illumina) with 46 cycles for the first read, 8 cycles for the index and 21 cycles for the second read. One run was performed for SuperScript III and one for Super- Script IV, reusing the same indexes. Sequencing data were processed with the MOIRAI workflow system (20) (Figure 3) to produce alignment files (BED), which were loaded into R using the CAGEr Bioconductor package ((21), https: //bioconductor.org/packages/CAGEr). Sequencing quality metrics ( Table 2) were calculated using custom R scripts (see https://github.com/oist/labcyte-rt-optimisation/blob/1. 0.0/Labcyte-RT Data Analysis 8.md for details). Richness indexes were calculated on gene expression levels (mapped counts per gene) using the 'rarefy' function of the 'vegan' R package (22). Variability of the replicates was assessed by calculating the mean average difference (MAD) and normalizing it by dividing it by the mean. Its value was lower than 0.2 in more than half of the replicates. Contour plots of the normalized MAD, as well as standard plots displaying all replicates of each parameter combination are available as supplemental figures. . RT products were harvested with a multichannel pipette, pooled and purified using Ampure XP beads (Beckman Coulter). Purified pools of single cell RT products were processed for nanoCAGE library preparation (7). Pools of reactions performed using standard primer concentrations with SuperScript III or SuperScript IV; and using alternate primer concentrations with SuperScript IV were subsequently tagged by specific index sequences (Nextera XT DNA Library Preparation Kit, Illumina), pooled and sequenced on a MiSeq instrument. Sequence data were subsequently processed with MOIRAI and custom R scripts as described above.

RESULTS
Using a Labcyte Echo 525 instrument, we assembled 500 nl reverse-transcription reactions in 384-well plates, by dispensing droplets of 25 nl from a source plate containing the reagent stocks to the destination plates. After incubating the reactions, we assessed their yield and quality by quantitative DNA sequencing using the nanoCAGE method. We sequenced two multiplexed pools of reactions using a selected set of 64 different template-switching oligonucleotides carrying sample identifiers ('barcodes'), combined with a second set of 24 sample identifiers ('indexes') ( Figure 2). Each pool contained 1536 reactions, prepared in four plates. As we focused on quality controls, the sequencing was kept at a small scale: ∼5000 sequence read pairs per reaction.
We explored one categorical and three continuous parameters spanning multiple orders of magnitude with data points regularly spaced on a logarithmic scale. The broadest range was for RNA amounts, with 6 points ranging between single-cell amounts (1-10 pg) and starting amounts typical for bulk RNA libraries (10-100 ng, or a few g in larger reaction volumes). The second broadest range was the molarity of the template-switching oligonucleotide (nine points between 0.6 and 160 M), followed by the reversetranscription primer's molarity (six points between 1 and 24 M). Combined with a categorical dimension encoding two different reverse-transcription enzymes: SuperScript III (SSIII) and SuperScript IV (SSIV), we thus generated 648 different combinations, to which were added 120 negative controls. We prepared four replicates, each of which using two plates (one per enzyme).
To prevent position biases in the reaction microwell plates, such as drying on the edges or inhomogeneous heating during reaction, we randomized the coordinates of the reactions in each replicate ( Figure 2). This randomization also has the effect of making each replicate plate unique, which prevents from accidental swapping. In particular, the different positions of the negative controls in each plate act as an identifying fingerprint. Lastly, since we used barcoded oligonucleotides, the randomization prevented that barcode sequences, individual reagent quality, or position-specific variability in the reagent transfer would get confounded with any reaction parameter. Randomizations were made entirely reproducible by the computer-aided generation of machine-readable volume transfer instructions and were reproducibly repeated for each replicate with a different random seed.
To ensure equal sequence coverage, the PCR amplifications that followed the reverse-transcriptions were scaled according to the starting quantities of RNA. Each PCR amplification was multiplexed with a different index (Figure 1). For each sample amplified in the same PCR reaction, we then calculated relative sequence yields, and observed a positive correlation with the molarity of the template-switching oligonucleotide ( Figure 4A), and little effect of the reversetranscription primer, except at the highest concentrations of both oligonucleotides and the RNA ( Figure 4B). To better represent the effect of the interaction between the two oligonucleotides, we calculated the medians of each quadruplicate, and represented them as a contour plot ( Figure 4C) on surfaces where the molarity of each oligonucleotide is one dimension. We use the same graphical representation for other quality control statistics in the remaining figures of the manuscript. As these plots do not show directly the variability between replicates, categorical plots versions in the same style of Figure 4B are available in the supplementary material.
To assess the quality and specificity of the reactions, we then calculated metrics reflecting the suitability of the reactions for transcriptome analysis (Table 2, Figure 5). With RNA inputs lower than 100 pg, the SSIII enzyme started to produce large amounts of oligonucleotide artefacts, at a scale that would compromise its use in transcriptome analy-ses. The SSIV enzyme performed comparatively well on one order of magnitude lower RNA input (10 pg). For both enzymes, increasing the molarity of the reverse-transcription primer lead to an increase of the oligonucleotide artefacts, but this could be compensated by a proportional increase of the molarity of the template-switching oligonucleotide, as shown by the diagonal patterns on the contour plots ( Figure  5A). A similar diagonal pattern can be observed in the contour plots for the ribosomal RNA rate (Figure 5B), showing again a positive effect in terms of quality for the reduction of the reverse-transcription primer's molarity and the increase of the template-switching oligonucleotide, especially for the SSIV enzyme. Another interesting trend was that overall, ribosomal RNA rates were lower for two extreme amounts of starting RNA material (100 ng and 1 pg). For SSIII at all RNA amounts and for SSIV at low RNA amounts, the highest molarities of the template-switching oligonucleotide were also increasing the ribosomal RNA rate. We then calculated the mapping rate, in which the ribosomal RNA sequences are considered unmapped. The contour plots (Figure 5C) were essentially the reverse of Figure 5B, showing that most non-ribosomal reads mapped correctly.
Within the set of all mapped reads, we calculated the promoter rate ( Figure 5D). Both enzymes showed similar trends except that SSIV had a lower baseline. Increases of the molarity of the template-switching oligonucleotide again increased reaction quality except at the highest molarities. Strikingly, the best promoter rates were obtained at high molarity of the reverse-transcription primer, which is the opposite trend in comparison with the other quality metrics calculated above. All RNA amounts benefited from high oligonucleotide molarities.
We then calculated the proportion of mapped reads that could be strand-invasion artefacts ( Figure 5E), which form when the template-switching oligonucleotide prematurely hybridizes with the nascent cDNA (23). Contrarily to the intuition that higher concentrations of this oligonucleotide would lead to higher frequencies of artefacts at higher molarity of the template-switching oligonucleotide, our results show a more complex relationship. For instance, the most extreme strand-invasion rates were reached with SSIII at high RNA amounts and low template-switching oligonucleotide molarity. This suggests that under these conditions, the artefacts may be more easily created than the desired products, and that higher concentrations of templateswitching oligonucleotide and reverse-transcription primer are necessary to repress artefact formation. Thus, our approach gave us not only a fine-grained mapping of the optimal reaction conditions in the parameter space, but also mechanistic insights.While our strategy of shallow sequencing allows us to calculate accurate quality metrics despite a very low coverage of each reaction, it is impossible to compare the expression profiles with each other, or to know the total number of genes that would be detected if we had orders of magnitude deeper coverage. To estimate the potential for gene detection, we calculated richness indexes for each library following the method of Hurlbert (24), on a scale of 10 ( Figure 5F). These indexes are the expected value (in the statistical sense) of the number of detected genes if the libraries were down-sampled to 10 mapped reads. In our libraries, surprisingly, richness indexes were higher at lower template-switching oligonucleotide molarities and higher reverse-transcription primer molarities. While intuitively it seems desirable to have higher richness indexes, it remains to be determined if they might be the reflection of a compression of the dynamic range.
Our results confirm that bulk and single-cell reactions have different optima. At high amounts of starting RNA (100 ng), the standard nanoCAGE protocol (SSIII enzyme, 10 M template-switching oligonucleotide, 1 M reversetranscription primer) is close to the optimal results, minimizing the amount of oligonucleotide and ribosomal RNA artefacts and maximizing the amount of data that correctly aligned to promoter regions. On the other hand, our results suggest that a protocol can be designed for single cells by using the SSIV enzyme, increasing the molarity of templateswitching oligonucleotides and reducing the molarity of the reverse-transcription primers. Indeed, we validated this observation during the development of a single-cell version of the nanoCAGE protocol, where we tested the use of SSIV with the template-switching oligonucleotide molarity increased to 45 M and the reverse-transcription primer molarity decreased to 0.4 M, on single cells isolated by flow cytometry in 4 L reverse-transcriptions reactions. As the amount of RNA per cell is usually estimated to be in the order of magnitude of 10 pg, the reaction concentration is in the order of the 1 pg / 500 nL condition in our parameter screen. In line with the screen's results, the change of enzyme and oligonucleotide concentrations dramatically reduced the amount of oligonucleotide artefacts, although at the expense of a reduced promoter rate and an increased rRNA rate (Figure 6).

DISCUSSION
To our knowledge, we are the first to report the exploration of a four-dimension parameter space for a biochemical reaction using 648 data points. Our experimental design described here follows a systematic and exhaustive approach testing all combinations ('grid search') between a few parameters. Nevertheless, such a design cannot be extended to a larger number of parameters, because their total number of combinations would not fit into 384-well plates (although it would take advantage of smaller formats such as 1536-well plates or miniaturized array), and because highdimension spaces are typically sparse. More efficient search strategies, such as random search (25), may enhance the power of the method. On the side of evaluation metrics, high dimensionality is also a problem. One possible workaround would be to reduce to two dimensions, for instance by calculating one arbitrary score representing undesirable products (oligonucleotides artifacts, ribosomal RNA sequences and strand invasion artifacts) and one arbitrary score representing efficient discovery of transcription start sites (mapping rate, promoter rate and richness), and then search for 'Pareto efficient' pairs representing states where the score on one dimension can not be ameliorated without deteriorating the score on the other dimension.  We observed a dramatic difference between the two reverse-transcriptases. According to the maker, SSIV has better thermostability, processivity and yield than SSIII. However, it is important to note that their reaction buffers also differ: SSIV's buffer is proprietary and is advertised to perform better in the presence of inhibitors. Swap experiments would be needed for determining whether the enzyme or the buffer are the key factor driving our observations. Another simple explanation could be that the unit definition, which is 'the amount of enzyme required to incorporate 1 nmole of deoxyribonucleotide into acid-precipitable material in 10 min at 37 • C using poly(A) oligo(dT) 12-18 as a template/primer' may correspond to different actual quantities of enzyme, given their different thermal optima. Here, we compared two enzymes, but the number of reversetranscriptases on the market is much larger (4). We expect differences at least as large as the ones observed between SSIII and SSIV. Our approach for a quantitative exploration of a reaction's parameter space may provide a method to better understand the impact of protein sequence differences between these enzymes, in particular between the engineered derivatives of the Moloney murine leukaemia virus reverse-transcriptase, provided that this information is disclosed by the vendors. Beyond addressing such questions, the approach that we have developed here will also provide the opportunity to search for ad-hoc mixtures of enzymes, and possibly buffers, that may maximize the output of the reaction (conversion to cDNA) and its quality (efficiency and specificity of template switching).
The reverse-transcription reaction that we studied here uses so-called 'random' primers that end with six random bases, for an even coverage of the transcriptome. Their drawback is the creation of artefacts when they hybridize to other oligonucleotides. In line with this, we observed that increasing the molarity of these reverse-transcription primers was increasing the amount of artefacts in the sequencing libraries ( Figure 5A). Unfortunately, an opposite trend was seen on other quality statistics, such as the promoter rate ( Figure 5D). This conflict may be hard to resolve. There are alternatives to random primers, such as oligo-dT primers that target the poly-A tail of the mRNAs, not-so-random primers (26) or pseudo-random primers (27). While the discussion of their pros and cons is beyond the scope of this manuscript, it is important to note that their optimal molarities are likely to be different.
Methods in molecular biology are made of a large number of serial steps, and the approach that we followed here is not limited to the reverse-transcription reaction. Here, we used quantitative sequencing for the readout, and this strategy can also apply to other reactions. For instance, the activity of a DNA-cutting enzyme (for instance Cas9) can be assayed by the degradation of a sequencing template. Conversely, the activity of a DNA-joining enzyme (for instance T4 DNA ligase) can be assayed by the assemblage of a template. Other platforms for the readout can be used, such as fluorescence detection, obviously for quantitative polymerase chain reaction, but also for any other reaction that can be designed to produce or degrade a fluorescent reporter.
Beyond the example presented here as a proof of principle, we believe that parameter space screening may become a routine experiment in the future. In particular, it will greatly benefit from computer-aided experiment design and robotic automation of experiment execution (28,29). The combination of laboratory automation and systematic parameter screening will ease the way to cross-replication studies in independent laboratories, as a strategy for costsharing, removal of implementation bias, and detection of human errors or data tampering.