RNAblueprint: flexible multiple target nucleic acid sequence design

Abstract Motivation Realizing the value of synthetic biology in biotechnology and medicine requires the design of molecules with specialized functions. Due to its close structure to function relationship, and the availability of good structure prediction methods and energy models, RNA is perfectly suited to be synthetically engineered with predefined properties. However, currently available RNA design tools cannot be easily adapted to accommodate new design specifications. Furthermore, complicated sampling and optimization methods are often developed to suit a specific RNA design goal, adding to their inflexibility. Results We developed a C ++  library implementing a graph coloring approach to stochastically sample sequences compatible with structural and sequence constraints from the typically very large solution space. The approach allows to specify and explore the solution space in a well defined way. Our library also guarantees uniform sampling, which makes optimization runs performant by not only avoiding re-evaluation of already found solutions, but also by raising the probability of finding better solutions for long optimization runs. We show that our software can be combined with any other software package to allow diverse RNA design applications. Scripting interfaces allow the easy adaption of existing code to accommodate new scenarios, making the whole design process very flexible. We implemented example design approaches written in Python to demonstrate these advantages. Availability and implementation RNAblueprint, Python implementations and benchmark datasets are available at github: https://github.com/ViennaRNA. Supplementary information Supplementary data are available at Bioinformatics online.

To get a better understanding of the solution landscape based on the introduced move steps, we analyzed the local neighborhood of three small examples with dependency graphs of varying complexity shown in Supplementary Figure 2. Using one of the introduced sampling methods (global, C-local and P-local, see section 2 in the main text), the local neighborhood was explored by stochastic sampling. The analysis includes the actual hamming distance to the start sequence ( Supplementary Figure 3), and the cost change (Supplementary Figure 5) for the two parts of the multi-state objective function (formula (2) in the main text). Additionally the random move, where one of the four sampling methods is chosen randomly at each step was investigated. For C-local, 85% of the reachable neighborhood, i.e. 3506 neighbors, was generated for each of the 100 sequences. The same absolute number was used for the global and random approach. With the P-local move, only very few neighbors can be reached, therefore we sampled as many solutions as possible using an exit condition.
The hamming distance describes the size of the move step in terms of actually changed nucleotides, see Supplementary Figure 3. The distribution of these distances for any move step was very much dependent on the structure of the dependency graph. For the four and five structure example the C-local move always showed a flat distribution at smaller hamming distances and one defined peak at a certain distance. This results from the fact that the corresponding dependency graphs consisted of one bigger connected component in addition to several small ones (see Supplementary Figure 2). The connected components could be divided into two disjoint sets due to the bipartite property of the base assignments. If the coloring pattern of the disjoint sets of the bigger component was changed in a way that the coloring switches, all nucleotides of this big connected component are changed, resulting in the defined peak with a hamming distance of exactly the size of this component. If the sets maintained the coloring pattern, we obtained a flat distribution of several smaller distances. Global sampling of the full sequence resulted in similar peaks, however with a shift towards higher distances, as all the smaller connected components are also resampled at every move. The peaks at higher distances show a more even distribution for the same reason. In the analyzed examples, no decomposed path was longer than three nucleotides, excluding special vertices. Therefore, we only obtained hamming distances between 0 and 3 with the P-local approach. Sampling with a randomly picked move step resulted in a very nice superimposition of all the hamming distance distributions, see Supplementary Figure 3.
We further investigated the cost change from a start sequence to its local neighborhood reachable by applying the described sampling methods, Supplementary Figure 5. This is depicted in two-dimensional density plots as cost changes for the two parts of the multi state objective function (objective 1 and objective 2) at the x-and y-axis. The weighted overall cost change can be obtained by following the inclined lines. The purple line indicates neighbors with a constant overall cost, the scale of the actual improvement or decline can be read from the x-axis. From left to right, plots with further optimized sequences obtained from different time steps of Figure 4 were used as start sequences for the analysis. The degree of optimization is therefore measured as "number of sampled sequences".
In the most left plot (number of sampled sequences = 0), the local neighborhood showed a quite similar distribution in terms of cost improvements on objective 1 and objective 2 for any sampling method. After 100 iterations of optimization, C-local showed the highest number of neighbors with better costs (number in purple box), furthermore the cost improvement possible for individual neighbors was highest for the C-local approach. Both might result from the low quality of the optimized sequences compared to the other approaches. When analyzing even more optimized sequences after 1000 steps, the C-local move still showed highest number of better neighbors. However, the cost improvement possible was quite similar between the various methods. Only with the P-local approach, the cost could not be substantially improved as the local minimum had almost been reached. After 5 · 10 5 iterations no better solutions could be obtained for the C-local and P-local approach as the optimization appeared to have reached a local minimum. Overall, P-local sampling behaved similar to C-local sampling, but reached the local minimum of the optimization much earlier due to the smaller size of the reachable neighborhood. In a local minimum no better solution could be obtained with the same move step. Only the global approach could not reach a optimization minimum, as we sampled from the whole solution space with this move. However, better solutions could only rarely be found (see Supplementary Figure 5). Supplementary Figure 1: The dependency graph G can be generated by the union of the circle plot representations of the given structural constraints. The three input structures in dot-bracket notation are first converted to the circle representation depicted on the left hand side and then the union is formed as shown on the right hand side. Vertices represent bases in a given order along the backbone of the molecule. Edges in different shades of gray represent the base pairs of the three input structures. Colors on the vertices show the different connected components into which the graph can be decomposed. The further decomposition and graph coloring approach of this example is shown in Figure 1 ..))))....)))) ((((((((....))))((((....))))....)))) ((((((((....))))((((....))))....)))) .((((....)))).. . For C-local, 85% of the reachable neighborhood was sampled, while for global and random the same absolute number was used (3str: 1.3 · 10 4 , 4str: ≈ 3.5 · 10 5 , 5str: ≈ 1.6 · 10 7 sequences). The P-local approach had a much smaller neighborhood, which was sampled with an exit condition to reach most of the neighbors (3str: 7108, 4str: 6076, 5str: 1.3 · 10 4 sequences).

Supplementary Figures
number of sampled sequences relative mean cost Supplementary Figure 4: Cost change during the optimization procedure using simulated annealing (SA) or an adaptive walk optimization procedure. We minimized function (2) from the main text with ξ = 0.5 to calculate the cost. Furthermore, P-local as well as C-local were used as move steps for comparison. The shadowed area depicts the confidence interval (±σ). The x-axis shows the number of sampled sequences while the y-axis resembles the mean cost from 100 optimization runs, normalized to the mean cost of the initial randomly chosen sequences. We used structural inputs with varying complexity listed in Supplementary Figure 2B. Each row corresponds to a different sampling method, columns represent the neighborhood of differently optimized sequences obtained from Figure 4 in the main text. The degree of optimization is measured in "number of sampled sequences" during the optimization procedure. The density plots depict the cost change to local neighbors reachable with the used move step. The cost difference is split into changes of the two parts objective 1 and objective 2 on the x-axis and y-axis, respectively. The weighted overall cost change to the start sequence can be obtained by the inclined lines, the purple line indicating unchanged cost. The size of the neighborhood varies, for C-local we sampled 85% of unique neighbors and used the same absolute number for global and random. For the P-local move we sampled as many unique sequences as possible in a reasonable time using an exit condition as this neighborhood is very small. The numbers in the boxes display the count of solutions in this quadrant, the purple box the absolute number of neighbors with a cost change smaller than zero, meaning better solutions than the initial sequence.

Supplementary Tables
Supplementary Table 1: Published software to solve the inverse folding problem with single-, two-and multi-target structural input. The following tables show the benchmark results summarized in Table 1 in the manuscript. The benchmarks were adapted from Taneda [24] and calculated using the multi-stable design optimization approach and the weighted objective function. For the two-, three-and four-structure inputs (see Methods section in main text) we generated 100 independent solutions using the ViennaRNA with stop condition set to 1000. For the pseudoknotted structure data sets only 30 solutions were generated using the NUPACK package with the stop condition 100. We used the same measures as in [24] and expanded the table by a probability value. δe 1 and δe 2 are the minimal and maximal energy difference between the evaluated energies of the target structures and the minimum free energy. The values shown in each row of the table are for the solution with the lowest δe 2 and, if multiple solutions existed, also with the the lowest δe 1 . Furthermore, n1, n2, n3, ...nM are the number of solutions such that 1, 2, 3, ...M of the target structures have the lowest free energy. We also introduced a new measure called " prob", which is the sum of the probabilities of all target structures in the Boltzmann ensemble for the solution picked using the δe 1 and δe 2 values. We furthermore list the time it took to construct the dynamic programming tables and to sample and optimize the sequences in seconds (calculated on VSC3: Intel Xeon E5-2650v2, 2.6 GHz, Ivy Bridge-EP family). Note, the latter includes the energy calculations by ViennaRNA or NUPACK. The column named "max dim" shows the maximal number of dimensions of all dynamic programming tables. This number is the main measure for the complexity and memory requirements or the specific problem (see main text for calculation of the complexity). Supplementary