Design of RNAs: comparing programs for inverse RNA folding

Abstract Computational programs for predicting RNA sequences with desired folding properties have been extensively developed and expanded in the past several years. Given a secondary structure, these programs aim to predict sequences that fold into a target minimum free energy secondary structure, while considering various constraints. This procedure is called inverse RNA folding. Inverse RNA folding has been traditionally used to design optimized RNAs with favorable properties, an application that is expected to grow considerably in the future in light of advances in the expanding new fields of synthetic biology and RNA nanostructures. Moreover, it was recently demonstrated that inverse RNA folding can successfully be used as a valuable preprocessing step in computational detection of novel noncoding RNAs. This review describes the most popular freeware programs that have been developed for such purposes, starting from RNAinverse that was devised when formulating the inverse RNA folding problem. The most recently published ones that consider RNA secondary structure as input are antaRNA, RNAiFold and incaRNAfbinv, each having different features that could be beneficial to specific biological problems in practice. The various programs also use distinct approaches, ranging from ant colony optimization to constraint programming, in addition to adaptive walk, simulated annealing and Boltzmann sampling. This review compares between the various programs and provides a simple description of the various possibilities that would benefit practitioners in selecting the most suitable program. It is geared for specific tasks requiring RNA design based on input secondary structure, with an outlook toward the future of RNA design programs.

1 Command Line Interfaces 1.1

RNAinverse
The command line interface of RNAinverse allows subtle optimizations while the main parameters are inserted upon software request. More advanced options exists for custom alphabet, energy parameters and base pairing. Those would not be discussed here as they are a very rare usecase. For the average user, the following are the ones that will be most used.
-T Rescale energy parameters for a given temperature.
-F Select the minimization algorithm. m for energy minimization or p for partition function.
-R The number of output sequence to output for the same structure. Negative number will force the software to continue until a perfect match.
noGU Do not allow GU pairs. noClosingGU Do not allow GU pairs at the end of helices.
Therefore, to nd a maximum of 50 solutions, using both partition and energy minimization algorithms, for the structure ((((...(((....)))...((((....))))...)))) allowing any sequence with a mandatory GC base between the rst and last nucleotide, at 25 Celsius, the command would be as follows: ./RNAinverse -R50 -Fmp -T25 Once the software begins, it will request an input structure and starting sequence. Lowercase letters will be forced into the sequence while uppercase will be considered a starting sequence. If no sequence is inserted a random seed sequence will be used. ((((...(((....)))...((((....))))...)))) gNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNc 1.2 incaRNAfbinv incaRNAfbinv is a combination of two seperate programs, both have specic command line interfaces. It is recommended for most users to use the webserver as it already combines the two interfaces. incaRNAtion generates the seed sequences later inserted into RNAfbinv. To run it, a Python distribution must be installed. The command line interface includes many ne tuning parameters while the main structure input must be inserted in a le. The input le must contain a target structure. In addition to the structure, a multiple sequence alignment (MSA) may be added to allow for sequence information. For the average user, the following are the ones that will be most used.
-d The path for an input le containing secondary structure and optional MSA.
-a A number between 0 and 1 used by the algorithm as a weight. 1 takes into account on the structure while 0 only considers the MSA.
-m Maximum penalty for an invalid pair.
-s_gc This is followed by 2 numbers. The rst, between 0 and 1, forces a given GC content while se second show the minimal number of output sequences required.
-gc_max_err A number between 0 and 1 with the maximal GC dierence between the output sequences and the requested number. 0.1 by default.
To use the incaRNAtion seeds, download the RNAfbinv extended version. The package includes a java GUI interface. The command line option allow for the same options as the GUI version. For the average user, the following are the ones that will be most used.
-i The number of simulated annealing iterations for a single sequence design.
-t Look ahead depth: The maximum number of consecutive mutations that generate a lower score sequence possible before a single simulated annealing iteration is over.

RNAiFold
The command line interface of RNAiFold has over 50 options allowing for an extremely ne tuning of the desired output. For the average user, the following are the ones that will be most used. There is two way to enable those options, or through a le, where the option name is on a line preceded by a # instead of a -, followed on the next line by the desired option. Usually, the option can be simply given as argument on the command line.
-RNAscdstr The target structure. Multiple target can be set, they must be on the same line separated by the pipe | symbol. The structures must have the same length.
-RNAseqcon The admissible sequences, in IUPAC format. It must be one string the same length as the structure.
-maxGCcont The maximal GC content admissible in the sequences.
-minGCcont The minimal GC content admissible in the sequences.
-TimeLimit The amount of time allowed to run (default 600 seconds).
-Cstr The target structure in the dot bracket notation. A fuzzy notation can be used to dene blocks allowed to base pair together using any lowercase and uppercase letter.
-Cseq The admissible sequences, in IUPAC format. It must be one string the same length as the structure.
-tGC Target GC content, in [0, 1], which also serves as a minimum.
-tGCmax Maximal GC content admissible in the sequences.
-tGCvar Variance (σ 2 ) in the case of normal distribution, -tGC serves as the expected value µ.
-t The amount of time allowed to run (default 600 seconds).
-n Number of solutions to be produced.
In addition, all parameters of the ant colony search algorithms can be directly modied through the command line, from the random seed to initiate the search -s, the number of ants exploring (-aps, default 10), the pheromone evaporation rate(-er, default 0.2), and a wealth of others. 1.5

NUPACK
The NUPACK program provides an ensemble of tools, design being the application for inverse folding. It has less options than the previous programs but with his focus for designing long sequences viable in vitro, it can extrapolate the energy parameters for a given concentration of sodium and magnesium.
The program loads the target structure and admissible sequences, in IU-PAC format, from a le PREFIX.fold. The PREFIX can be any name chosen by the user but the extension .fold must be given. Additional parameters are: -material which can be set as rna1995 to use Turner95 energy or rna1999 for Mathews99 energy parameters.
-sodium The sodium concentration.
-magnesium The magnesium concentration.
-prevent The name of a le, which can contain one subsequence per line forbidden in the design.
-loadseed PREFIX.init A le containing one number, the random seed to be used. Each execution of the software will choose a dierent random seed, but the program is deterministic and will always return the same output for a given seed. Note that the name of the le must be the same as the one with the target sequence, followed by the extension .init.
./design -material rna1995 PREFIX Note that the sux .fold is not given. To generate a dierent sequence launch the program again.