Protein–Sol: a web tool for predicting protein solubility from sequence

Abstract Motivation Protein solubility is an important property in industrial and therapeutic applications. Prediction is a challenge, despite a growing understanding of the relevant physicochemical properties. Results Protein–Sol is a web server for predicting protein solubility. Using available data for Escherichia coli protein solubility in a cell-free expression system, 35 sequence-based properties are calculated. Feature weights are determined from separation of low and high solubility subsets. The model returns a predicted solubility and an indication of the features which deviate most from average values. Two other properties are profiled in windowed calculation along the sequence: fold propensity, and net segment charge. The utility of these additional features is demonstrated with the example of thioredoxin. Availability and implementation The Protein–Sol webserver is available at http://protein-sol.manchester.ac.uk.


Introduction
Protein solubility is an important property, from recombinant protein production to the development of biotherapeutics. A number of methods have been used to predict aggregation (Agrawal et al., 2011) and solubility, based on factors such as propensity to form inclusion bodies (Wilkinson and Harrison, 1991) and b-strands (Tartaglia and Vendruscolo, 2008), structural genomics studies (Magnan et al., 2009), and physicochemical properties (Agostini et al., 2014). A web server is presented, Protein-Sol, for predicting protein solubility, based on the observation of a bimodal distribution of protein solubilities for E.coli proteins in cell-free expression (Niwa et al., 2009). These measurements report the amount of a protein that is soluble (in the supernatant subsequent to centrifugation) compared with the total amount of that protein, rather than a thermodynamic property. A wider significance is apparent from two factors. First, that proteins tend to evolve to a point at which their solubility matches that required for their natural abundance (Tartaglia et al., 2007). Second, the properties seen in the current work that associate with more soluble proteins are those seen previously, such as fewer amino acids with aromatic sidechains, favouring negative charge, and a preference for lysine over arginine (Warwicker et al., 2014).

The Protein-Sol server
Protein-Sol is available at http://protein-sol.manchester.ac.uk without account registration or licence. It processes amino acid sequence and calculates predicted solubility and other properties, which returned in a graphical format and as a text file. Thirty-five features are considered in the algorithm, 20 amino acid compositions; 7 composites: K-R, D-E, KþR, DþE, KþR-D-E, KþRþDþE, FþWþY; and 8 further predicted features: length, pI, hydropathy (Kyte and Doolittle, 1982), absolute charge at pH 7, fold propensity (Uversky et al., 2000), disorder (Linding et al., 2003), sequence entropy, and b-strand propensity (Costantini et al., 2006). A linear model combining the 35 features gave an initial fit to the solubility data (Niwa et al., 2009). Weights were then derived from differences between the lower and higher 5% tails of the solubility distribution, recorded as z-scores. Proteins predicted to have a transmembrane (TM) segment (hydropathy > 1.6 in any 21 amino acid segment), were excluded.
For a query sequence, the contribution of each feature to predicted solubility is a linear scaling between its corresponding values averaged within each of the lower and higher subsets, multiplied by feature weight, with feature weights normalized to sum to 1. As there are many correlations between features, and because some features do not contribute to the prediction, overall correlation of prediction to the population of experimental solubilities for 2395 proteins, (without predicted TM regions), was used to assess combinations of features, eliminating first those with least weighting, continuing elimination until the model performance falls. The final prediction scheme consists of 10 features (H, L, V, K-R, DþE, FþWþY, length, absolute charge, fold propensity, sequence entropy), with a correlation coefficient of 0.621 between calculated and experimental values, and 58% predicted solubility giving the best separation threshold of lowest and highest 5% subsets in a receiver operating characteristic (ROC) analysis. In addition to charge-based features, non-polar features are also present in the model. For example, aromatic (FþWþY) composition weights predicted solubility down, whilst valine weights solubility up. In addition, predicted fold propensity and sequence entropy have a negative influence on predicted solubility. Our interpretation is that, in addition to a charged protein surface being favourable for solubility, there may also be a subset of more soluble proteins that have reduced sequence complexity, perhaps similar to intrinsically disordered proteins. Display of the extent to which each feature deviates from population average allows the user to select features that could be targeted to improve solubility. Net charge and fold propensity over a sliding window are displayed as profiles, providing additional information with which to interpret protein behaviour.
Prediction of solubility from sequence is a single step process for the user. Each sequence for calculation is assigned a unique id number, formatted, and stored temporarily on the server. No calculation occurs if the input is invalid and the user is informed of the mismatch. The algorithm generates a text file that is processed using shell scripts and R to produce a graphical interpretation of the results. The predicted protein solubility is not valid for membrane proteins, but the results will be presented, with a warning, if a predicted transmembrane region is identified.
Several tests have been made of the server. Protein expression data from structural genomics projects is often aggregated and heterogeneous. The first test set consists of 679 strongly expressed and well-behaved proteins from a single pipeline, which were used to derive a model for crystallization propensity (Price et al., 2009).
We predict an average solubility of 70.6% for these 679 proteins, with 70.3% of the set above the 58% threshold. A further set of 200 proteins used to test the crystallization model (Price et al., 2009) gives an average of 76.1% predicted solubility with 82.5% of the set above the 58% threshold. Thermophile proteins have evolved to counter particularly stringent tests on solubility (Greaves and Warwicker, 2007). Methanopyrus kandleri is a sequenced archaeon with one of the highest known growth temperatures (80 À 110 C, Slesarev et al. 2002). Excluding those containing a predicted TM segment, solubility predictions for 1294 proteins from UniProt (UniProt Consortium, 2017) averaged 78.6%, with 93.6% of these above 58%.
A link between protein aggregation rates and gene expression levels (Tartaglia et al., 2007) has been reinforced with comparison of the abundant proteins serum albumin and myoglobin with their less abundant paralogues (Warwicker et al., 2014). Quantitative proteomics allows comparison of (log scale) protein abundance and predicted protein solubility, with ROC plot analysis using low and high abundance subsets from the 5% tails. Calculations have been made with whole proteome integrated sets for Escherichia coli, Saccharomyces cerevisiae and Homo sapiens retrieved from PaxDb (Wang et al., 2015). Results are reported in Table 1 (excluding proteins containing predicted TM segments), with the original development set of E.coli protein solubility added for reference. With membrane proteins included (not shown), the measures of agreement increase, an outcome of the importance of charge for protein solubility. Accuracy for the ROC analysis is listed at 58% solubility prediction, since this gives the highest accuracy for the development set. ROC plots are shown in Figure 1.
Through these varied tests, a structural genomics pipeline, the proteome of a hyperthermophile, and protein abundance in organisms across the tree of life, the model consistently demonstrates correlations.

Discussion
Protein-Sol is demonstrated with E.coli thioredoxin, known to enhance solubility of co-produced proteins in E.coli (Yasukawa et al.,  1995). Predicted solubility (scaled from 0 to 1) is plotted ( Fig. 2A) alongside the population average for the experimental dataset (Niwa et al., 2009). Thioredoxin at 0.76 is well above the average of 0.45, consistent with its wider use in co-expression or as a fusion partner. Solubility prediction on the server is given in the 0-1 range for ease of user interpretation. Percentage values, which were used in training and testing, can exceed 100% in the experimental dataset. For reference, thioredoxin predicts at 88% against a population average of 53%. The predicted pI is also displayed. Next, a plot shows deviations from population averages for the 35 features. Although only 10 of these contribute to the prediction, the signed deviations show the characteristics of the input sequence. For example KmR, meaning K-R, is prominent for thioredoxin and contributes to a prediction of highly soluble. To improve solubility, K-R is perhaps more useful than the other 9 features in the final model, since lysine and arginine can generally be swapped with little consequence for protein function or fold. The plot of windowed fold propensity ( Fig. 2A) shows two subdomains, consistent with experimental characterization of thioredoxin folding (Katti et al., 1990). The subdomain structure is also apparent in a novel representation of windowed net charge with negatively charged N-terminal and positively charged C-terminal subdomains (Fig. 2B). Whilst the windowed net charge does not indicate a complete separation of charge between subdomains, it shows the possibility for interactions dependent on the opposite sign of net charges, exemplified by the two salt-bridges shown in Figure 2B.
Protein-Sol provides a fast sequence-based method for predicting protein solubility and lysine and arginine content are highlighted in regard to modifying protein solubility, as K/R swaps are likely to be structurally and functionally neutral. A case study with thioredoxin shows that additional features of the server can be used to interpret subdomain structures and introduces the novel feature of windowed net charge, which may inform on charge-charge interactions between subdomains.  Katti et al. (1990)) is shown colour-coded by subdomain (1-67 and 68-108), with salt-bridges E44-K96 and E48-K100 displayed between the subdomains. Drawn with PyMOL (http://pymol.org)