TISIGNER.com: web services for improving recombinant protein production

Abstract Experiments that are planned using accurate prediction algorithms will mitigate failures in recombinant protein production. We have developed TISIGNER (https://tisigner.com) with the aim of addressing technical challenges to recombinant protein production. We offer three web services, TIsigner (Translation Initiation coding region designer), SoDoPE (Soluble Domain for Protein Expression) and Razor, which are specialised in synonymous optimisation of recombinant protein expression, solubility and signal peptide analysis, respectively. Importantly, TIsigner, SoDoPE and Razor are linked, which allows users to switch between the tools when optimising genes of interest.


INTRODUCTION
Recombinant protein production is a key process for life science research and the development of biotherapeutics. However, low protein expression and aggregation are the two major bottlenecks of recombinant protein production (1)(2)(3)(4)(5)(6)(7). Since mRNA abundance alone is insufficient to explain protein abundance (8)(9)(10)(11)(12), several features of mRNA sequence have been proposed to affect protein expression. These features are mostly related to codon usage, such as the codon adaptation index and tRNA adaptation index (13)(14)(15)(16)(17), or measures of mRNA secondary structure, such as G+C content, minimum free energy (MFE) of RNA secondary structure, and mRNA:ncRNA interaction avoidance (18)(19)(20)(21)(22)(23). Many of these features are not independent, making it challenging to distinguish the impacts of individual features (24). This, in turn, hinders the development of accurate prediction/optimisation tools. Recent systematic studies suggest that MFE is the most important feature in protein expression (24,25). However, more recent work shows that the mRNA accessibility of translation initiation sites outperforms MFE in predicting relative protein levels from mRNA sequences (26,27). Accessibility is computed by considering all possible structures for a region, weighted by free energy, not just the single structure with the MFE (28).
In addition to high protein expression level, high solubility is preferable for the purification and long-term storage of recombinant proteins. However, almost half of the successfully expressed proteins are insoluble (http://targetdb.rcsb. org/metrics), which makes the recombinant protein production process challenging. A number of methods have been suggested to improve protein solubility, for example, truncation, mutagenesis, and the use of solubility-enhancing tags (2,(29)(30)(31). Nevertheless, accurate solubility prediction could save resources and aid in designing soluble proteins before the experiments. With these in mind, we have recently formulated the solubility-weighted index (SWI), which outperforms recent solubility prediction tools based on machine-learning algorithms (32).
Besides, many recombinant proteins of interest are secretory. The intracellular accumulation of heterologous secretory proteins may be toxic to the host cells. Therefore, the translocation efficiency of these proteins plays an important role in the yield quantity and quality. Secretory proteins usually have a short peptide at the N-terminus called signal peptide (SP), which is responsible for the translocation of secretory proteins via the Sec, signal recognition particle (SRP) or twin arginine transport (Tat) pathways (33)(34)(35)(36). Detection of SPs or fusion of a suitable SP at the Nterminus is useful for optimising protein production (37)(38)(39)(40). In addition, different pathways have different advantages, for example, the SRP dependent pathway can be used for rapidly folding proteins (41). However, the Sec dependent pathway, which is common across all forms of life, has been widely used for recombinant protein expression because of higher protein production capacity and quality (41,42). In addition, the presence of SPs should almost always be checked when planning the expression experiments for uncharacterised proteins.
Existing web tools predict or optimise either protein expression or solubility alone (43)(44)(45)(46)(47)(48)(49)(50)(51). Several web tools exist for predicting SPs (52)(53)(54)(55)(56). Only a very few tools can detect toxic proteins, for example, SpiderP, ClanTox and ToxinPred (57)(58)(59). These tools are either limited to predicting the venoms of certain organisms, such as spiders, or they are not designed to predict the signal peptides of toxins, rather to predict the toxicity of mature peptides. Moreover, these tools are offered through different independent services. We reasoned these functionalities should be integrated in order to assist not only in choosing appropriate expression systems, but also in optimising the expression and solubility levels of recombinant proteins. Here we present TISIGNER.com that integrates the optimisation tools TIsigner (translation initiation coding region designer), SoDoPE (soluble domain for protein expression) for protein expression and solubility, respectively, and Razor for detecting SPs (26,32,60). Our web application provides easy, fast and interactive ways to assist users in planning and designing their experiments.

TIsigner
TIsigner offers tunable protein expression by optimising the mRNA accessibility of translation initiation sites (26). The regions used to calculate accessibility (opening energy) are specific to the expression hosts, which is calculated using RNAplfold (28,61,62). For Escherichia coli, Saccharomyces cerevisiae, and Mus musculus expression hosts, the optimal regions relative to the start codon for optimisation are −24:24, −7:89, −8:11, respectively. For other expression hosts, we provide an option 'Other', which optimises the accessibility of the region −24:89. Since E. coli is the most popular expression host, the default settings aim to optimise protein expression in E. coli with the T7 lac promoter system (see below). In this case, only the protein coding sequence is required for input where the 5 UTR (5 untranslated region) sequence used as default is the most popular, truncated version of the T7 promoter (63) (Figure 1). Otherwise, the 5 UTR sequence is also required. For 5 UTRs shorter than 71 nucleotides, upstream sequences can be used to extend the UTRs.
The settings for TIsigner are grouped by complexity (i.e. general, extra, and advanced). The general settings include the options to modify the expression host, promoter and target expression score. The target expression score ranges from 0 to 100 (i.e. from the minimum to maximum predicted level), which is derived from a logistic regression of the opening energy distribution of 11 430 expression experiments in Escherichia coli from the 'Protein Structure Initiative: Biology' (PSI:Biology) (64,65). Hence, this scoring system is only applicable to the E. coli T7 lac promoter system. Since, there is a non-linear relationship between opening energy and expression score, an interactive plot is also displayed along with the slider to set the target expression score. For other expression hosts and promoters, the target expression level can be either maximised or minimised (i.e. binary). The extra settings have the options to optimise sequence within the translation initiation region or the full-length sequence. The AarI, BsaI, BsmBI restriction modification sites are filtered by default, whereas other sites can be manually supplied (e.g. a Shine-Dalgarno motif or terminator U-tract). The advanced settings allows users to tweak the random seed and sampling options (i.e., quick or deep, which uses different numbers of iterations and parallel processes). Here users can also customise the region for optimisation or disable the terminator checks.
Once the input sequence passes a sanity check, the optimisation task is rapid [O(1) time using RNAplfold v2.4.11 (using parameters -W 210 -u 210)] with our simulated annealing algorithm. A list of optimized sequences are returned after checking for terminators using cmsearch (Infernal v1.1.2) (66) with RMfam models (67,68). If terminators are found, an option to use the full-length sequence for optimisation will be prompted to users. In a default case (E. coli T7 lac promoter system), the optimised sequence closest to the chosen expression level is selected as the first solution ( Figure 2). For other expression hosts and/or promoters, the optimised sequence with the minimum changes in nucleotides is selected as the first solution. The altered nucleotides are highlighted ( Figure 2). The accessibility of translation initiation sites for both the input and optimised sequences is shown as opening energy (kcal/mol). The results can be exported as a PDF or CSV file. When the default settings are used, the opening energy for each sequence is indicated on the distributions of the opening energy of 8780 'success' and 2650 'failure' groups of the PSI:Biology target genes. Furthermore, options for solubility and SP analyses using SoDoPE and Razor, respectively, are available for each sequence on the same results page (Figure 2).

SoDoPE
SoDoPE is our interactive solubility analysis and optimisation tool based on the SWI (32). SoDoPE accepts either a nucleotide or protein sequence (Figure 1). Upon submission, a query is sent to the HMMER web service for domain annotation (69). Successful annotations are displayed as interactive graphics, in which the annotated domains are represented as discorectangles, above a grey band that represents the input protein sequence (Figure 3). Information about a protein domain is shown upon a mouse hover. The domains can be selected for solubility analysis. For a com- plete domain annotation report, a link to the HMMER results page is also provided.
In addition, a two-way slider is available for navigation through any region of interest ( Figure 3). The probability of solubility, flexibility and GRAVY (grand average of hydropathicity) is shown in real-time according to the userselected region. The selected region is optimised for higher solubility using simulated annealing. Only the regions with extended boundaries and also higher probability of solubility is returned. SP analysis can also be done using Razor (see below).
A profile plot of flexibility and/or hydrophilicity corresponding to the user selected region is generated ( Figure  3). This allows an estimation of rigid/flexible regions and possible helices, that may be helpful for mutagenesis experiments. The sequence of the selected region is shown, with the option of sequence conversion between nucleotide and amino acid sequence format. In particular, the nucleotide sequence can be redirected to TIsigner for optimising protein expression (Figures 1 and 3, through the 'view DNA | optimise expression' button).
The contributions of several solubility-enhancing tags to user selected regions can be compared and shown in a bar plot, including thioredoxin (TRX), maltose binding protein (MBP), small ubiquitin-related modifier (SUMO) and glutathione-S-transferase (GST) tags (Figure 3). Users can also input a fusion sequence of interest either in a nucleotide or protein sequence format.

Razor
Razor is our SP prediction tool which is based upon random forest models of protein features from the eukaryotic SP sequences of the SignalP 5.0 dataset and the animal toxin annotation project (52,60,70). Razor accepts either a protein or a nucleotide sequence (Figure 1). After validation, the N-terminal region is checked for the presence of a SP using five random forest models. This gives five SP scores (S-scores) for a given sequence. For detecting the cleavage site, we use a sliding window of 30 residues and our optimised weight matrix for residues around the cleavage site. The scored subsequences are scored by additional five random forest models to give the cleavage site scores (C-scores) along the sequence, which is displayed as a step plot (Figure 4). The Y-score, which is the geometric mean of S-scores and the max of C-scores, is used to infer whether the given sequence has a SP or not. The median of these five Y-scores is displayed as the final score. The cleavage site from the model with the median of max of C-scores is used to annotate the predicted region.
If any of the models detect a SP in the input sequence, we further check whether the SP belongs to toxins, using five random forests trained on toxin SPs. The final toxin score is the median of scores from those random forest models. Furthermore, since we noticed a lack of tools specialising in predicting SPs from fungi, any detected signal peptide is checked for such origin. Similarly, we use five random forests for detecting fungal SPs, with the final fungal score   being the median score of these models. Since we have five random forest models in each step (eukaryotic, toxin and fungal SP detection steps), stars are displayed as an indication of the number of models agreeing on the sequence falling on either category (Figure 4).
Razor is linked with SoDoPE for checking and optimising protein solubility (Figure 4). If a nucleotide sequence was submitted, this sequence can also be optimised for protein expression using TIsigner (Figure 1).

DISCUSSION
Low protein expression and solubility are the major hindrances to a successful recombinant protein production. Based on our comprehensive studies on these two problems, we have developed novel tools to optimise protein expression (TIsigner) and solubility (SoDoPE), and assessed their predictive performance using independent datasets (Supplementary Table S1). Our tools offer some unique features in an interactive way. TIsigner allows tuning of protein expression from low to high levels, whereas SoDoPE allows easy navigation of protein sequence/domains with real-time solubility prediction. Based on our assessment of similar tools, none of the publicly available tools provides these features.
Our third tool, Razor, is designed to check the presence of SPs. Compared to other related tools, Razor also predicts toxin and fungal SPs (Supplementary Table S2). These would be helpful for users in choosing the expression and purification systems that prevent the harmful intracellular accumulation of recombinant secretory proteins/toxins.
Our tools are interactive, fast, and accurate. Importantly, our tools are highly integrated, allowing a seamless transition between the optimisation tools. To make such transi-tion intuitive, our web services limits one input sequence at a time and we aim to remove this input sequence limitation in the future. For optimising a large number of sequence, we provide the command-line version of each of our tools (see below).

GENERAL INFORMATION
Demo input and results are available for new users to get started. A list of frequently asked questions is also available for each tool. The frontend is written in React and uses responsive web design principles. The backend is written in Flask and Python v3.6. The website is hosted on a virtual machine (Red Hat Enterprise Linux 8) running on Intel Xeon (8 × 2.60 GHz) with 4GiB RAM, by the Information Technology Services at the University of Otago.

SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.