Protein solubility is controlled by global structural flexibility

Summary Recombinant protein production is a widely used technique in the biotechnology industry and biomedical research, yet only a quarter of target proteins are soluble and can be purified. Failures are largely due to low protein expression and solubility. We have discovered that global structural flexibility, which can be modeled by normalised B-factors, accurately predicts the solubility of 12,216 recombinant proteins expressed in Escherichia coli . We have optimised B-factors, and derived a new set of values for solubility scoring that further improves the prediction accuracy. We call this new predictor the ‘Solubility-Weighted Index’ (SWI). Importantly, SWI outperforms many existing protein solubility prediction tools. We have developed ‘SoDoPE’ (Soluble Domain for Protein Expression), a web interface that allows users to choose a protein region of interest for predicting and maximising both protein expression and solubility. Availability The SoDoPE web server and source code are freely available at https://tisigner.com/sodope and https://github.com/Gardner-BinfLab/TIsigner , respectively. The code and data for reproducing our analysis can be found at https://github.com/Gardner-BinfLab/SoDoPE_paper_2019 . INTRODUCTION High levels of protein expression and solubility are two major requirements of successful recombinant protein production (Esposito and Chatterjee 2006) . However, recombinant protein production is a challenging process because almost half of the proteins fail to be expressed and half of the successfully expressed proteins are insoluble ( http://targetdb.rcsb.org/metrics/ ). These failures hamper protein research, with particular implications for structural, functional and pharmaceutical studies, that require soluble and concentrated protein samples (Kramer et al. 2012, Hou et al. 2018) . Therefore, predicting solubility, and engineering protein sequences for enhanced solubility is an active area of research. Notable protein engineering approaches include mutagenesis, truncation (i.e., expression of partial protein sequences), or fusion with a solubility-enhancing tag (Waldo


INTRODUCTION
High levels of protein expression and solubility are two major requirements of successful recombinant protein production (Esposito and Chatterjee 2006) . However, recombinant protein production is a challenging process because almost half of the proteins fail to be expressed and half of the successfully expressed proteins are insoluble ( http://targetdb.rcsb.org/metrics/ ). These failures hamper protein research, with particular implications for structural, functional and pharmaceutical studies, that require soluble and concentrated protein samples (Kramer et al. 2012, Hou et al. 2018 . Therefore, predicting solubility, and engineering protein sequences for enhanced solubility is an active area of research. Notable protein engineering approaches include mutagenesis, truncation (i.e., expression of partial protein sequences), or fusion with a solubility-enhancing tag (Waldo Protein solubility depends on extrinsic factors such as ionic strength, temperature and pH, as well as intrinsic factors-the physicochemical properties of the protein sequence and structure-molecular weight, amino acid composition, hydrophobicity, aromaticity, isoelectric point, structural propensities and the polarity of surface residues (Wilkinson and Harrison 1991, Chiti et al. 2003, Tartaglia et al. 2004, Diaz et al. 2010 . Many solubility prediction tools have been developed around these features, ranging from the use of simple statistical models (e.g., linear and logistic regressions) to sophisticated machine learning models (e.g., support vector machines and neural networks) (Hirose and Noguchi 2013, Habibi et al. 2014, Hebditch et al. 2017, Sormanni et al. 2017, Heckmann et al. 2018, Yang et al. 2019 . In this study, we investigated the experimental outcomes of 12,216 recombinant proteins expressed in Escherichia coli from the 'Protein Structure Initiative:Biology' (PSI:Biology) (Chen et al. 2004, Acton et al. 2005 . We showed that protein structural flexibility is more accurate than other protein sequence properties in predicting solubility (Vihinen et al. 1994, Craveur et al. 2015 . Flexibility is a standard feature that has previously been overlooked in solubility prediction. On this basis, we derived a set of 20 values for the standard amino acid residues and used them to predict solubility. We call this new predictor the 'Solubility-Weighted Index' (SWI). SWI is a powerful predictor of solubility, and a good proxy for global structural flexibility. In addition, SWI outperforms many protein solubility prediction tools.

Global structural flexibility performs well at predicting protein solubility
To determine which protein sequence properties can accurately predict protein solubility, we examined the experimental outcomes of 12,216 recombinant proteins expressed in E. coli (the PSI:Biology dataset; see Supplementary Table S1A) (Chen et al. 2004, Acton et al. 2005 . These proteins were expressed either with a C-terminal or N-terminal 6xHis fusion tag (pET21_NESG and pET15_NESG expression vectors, N=8,780 and 3,436, respectively). They were previously curated and labeled as 'Protein_Soluble' or 'Tested_Not_Soluble' (Seiler et al. 2014) , based on the soluble analysis of cell lysate using SDS-PAGE (Xiao et al. 2010) . A total of 8,238 recombinant proteins were found to be soluble, in which 6,432 of them belong to the pET21_NESG dataset.
We first computed the standard protein sequence properties, namely molecular weight, isoelectric point, secondary structure composition (sheet, turn, and helix), aromaticity, Grand Average of Hydropathy (GRAVY), global structural flexibility and instability index using the ProtParam module of Biopython (Kyte and Doolittle 1982, Guruprasad et al. 1990, Bjellqvist et al. 1993, Lobry and Gautier 1994, Vihinen et al. 1994, Cock et al. 2009) . We compared the prediction accuracy of these features using Receiver Operating Characteristic (ROC) analysis. To our surprise, flexibility outperformed other features in predicting protein is the normalised B-factor of the amino acid residue at the position , and is the B i i L sequence length. Among these sets of B-factors, solubility scoring using the most recently published set of normalised B-factors produced the highest AUC score ( To improve the prediction accuracy, we initialised an iterative refinement method with the most recently published set of normalised B-factors. This was done by maximising AUC scores with the Nelder-Mead optimisation algorithm (Nelder and Mead 1965) . In order to account for phylogenetic relationships between proteins we clustered all 12,216 PSI:Biology protein sequences by 10% similarity using USEARCH (Fig 2A and Supplementary Fig S3). Cross-validations were conducted in a way that ensures training and testing is performed on unrelated sequences. We calculated the solubility scores for the optimised weights using Equation 1 and the AUC scores for each cross-validation step. Our training and test AUC scores were 0.72 ± 0.00 and 0.71 ± 0.03, respectively, showing an improvement over flexibility in solubility prediction (mean ± standard deviation; Fig 2B and Supplementary Table S3).
The final weights were derived from the arithmetic means of the weights for individual amino acid residues obtained from the cross-validation step (Supplementary Table S4). Interestingly, we observed over a 20% change on the weights for cysteine (C) and histidine (H) residues (Fig 2C and Supplementary Table S4). These results were in agreement with the contributions of cysteine and histidine residues as shown by the AUC scores of the amphiphilic pseudo-amino acid compositions for cysteine and histidine residues (Supplementary Fig S1B). To ensure that these results are not artifacts, in particular due to the presence of polyhistidine-tags in all the sequences, we repeated the iterative refinement method using the same cross-validation sets without His tag sequences. The final weights with and without His tags are nearly identical, suggesting that the approach is not confounded by tag use (Supplementary Table S4, Spearman's rho = 1).

Fig 2. Derivation of the Solubility-Weighted Index (SWI). (A)
Flow chart shows an iterative refinement of the most recently published set of normalised B-factors for solubility prediction (Smith et al. 2003) . The solubility score of a protein sequence was calculated based on an arithmetic mean of the optimised weights as Equation 1 (using instead of W B ). These scores were used to compute the AUC scores for training and test datasets. (B) Training and test performance of solubility prediction using the optimised weights for 20 amino acid residues in a 10-fold cross-validation (mean AUC ± standard deviation). Related data and figures are available as Supplementary Table S3 and Supplementary Fig S3. (C) Comparison between the 20 initial and final weights for amino acid residues. The final weights are derived from the arithmetic mean of the optimised weights from the cross-validation step. These weights are used to calculate SWI, the solubility score of a protein sequence, in the subsequent analyses. Filled circles, which represent amino acid residues, are colored by hydrophobicity (Kyte and Doolittle 1982) . Solid black circles denote aromatic amino acid residues phenylalanine (F), tyrosine (Y), tryptophan (W). Dotted diagonal line represents no change in weight. Related data is available as Supplementary  Table S4. AUC, Area Under the ROC Curve; ROC, Receiver Operating Characteristic; , W arithmetic mean of the weights of an amino acid residue optimised from 1,000 bootstrap samples in a cross-validation step. To validate the cross-validation results, we used an independent dataset known as eSOL (Niwa et al. 2009) . This dataset consists of the solubility percentages of E. coli proteins determined using an E. coli cell-free system (N = 3,198). Solubility scoring using the final weights showed a significant improvement in correlation with E. coli protein solubility over the initial weights (normalised B-factors) [Spearman's rho of 0.50 (P = 9.46 x 10 -206 ) vs 0.40 (P = 4.57 ✕ 10 -120 )]. We call the solubility score of a protein sequence calculated using the final weights as the Solubility-Weighted Index (SWI).
We performed Spearman's correlation analysis for both the PSI:Biology and eSOL datasets. SWI shows the strongest correlation with solubility compared to the standard and 9,920 protein sequence properties (Fig 3 and Supplementary Fig S1). SWI also strongly correlates with flexibility, suggesting that SWI is still a good proxy for global structural flexibility.  Table S5. (B) Correlation matrix plot of the solubility percentages of E. coli proteins and their standard protein sequence properties and SWI. The solubility percentages were previously determined using an E. coli cell-free system (eSOL,N=3,198). Related data is available as Supplementary Table S6. GRAVY, Grand Average of Hydropathy; PSI:Biology, Protein Structure Initiative:Biology; R s , Spearman's rho; SWI, Solubility-Weighted Index.
Next, we asked whether protein solubility can be predicted by surface amino acid residues. To address this question, we examined a previously published dataset for the protein surface 'stickiness' of 397 E. coli proteins (Levy et al. 2012) . This dataset has the annotation for surface residues based on the protein crystal structures. Interestingly, we observed no correlation between the protein surface 'stickiness' and the solubility data from eSOL (Spearman's rho = 0.05, P = 0.34). Optimising weights for surface residues as above led to no further improvements (i.e., the approach used to derive SWI; Spearman's rho = 0.05, P = 0.31). In contrast, the SWI for these sequences has a significant correlation with solubility (Spearman's rho = 0.45, P = 3.88 ✕ 10 -19 ). These results suggest that full-length sequence should be taken into account when predicting protein solubility.
To understand the properties of soluble and insoluble proteins, we determined the enrichment of amino acid residues in the PSI:Biology targets relative to the eSOL sequences (see Methods). We observed that the PSI:Biology targets are enriched in charged residues lysine (K), glutamate (E) and aspartate (D), and depleted in aromatic residues tryptophan (W), albeit to a lesser extend for insoluble proteins ( Fig 4A). As expected, cysteine residues (C) are enriched in the PSI:Biology insoluble proteins, supporting previous findings that cysteine residues contribute to poor solubility in the E. coli expression system (Wilkinson andHarrison 1991, Diaz et al. 2010) .
In addition, we compared the SWI for random sequences with the PSI:Biology and eSOL sequences. In general, soluble proteins have higher SWI than insoluble proteins ( Fig 4B). Interestingly, true biological sequences tend to have higher SWI than random sequences, highlighting a clear evolutionary selection for solubility.  Table S1B). Random sequences were generated from a length of 50 to 6,000 amino acid residues, with an increment of 50 residues. A total of 12,000 random sequences were generated, 100 sequences for each length. PSI:Biology, Protein Structure Initiative:Biology; SWI, Solubility-Weighted Index.

Fig 5. Runtime of protein solubility prediction tools per sequence.
All the tools were run three times using 10 sequences selected from the PSI:Biology and eSOL datasets. A pseudocount of 0.001 s was used because the runtime of our SWI C program is 0.00 s per sequence, which is determined by machine precision. Related data is available as Supplementary Table S7. SWI, Solubility-Weighted Index; s, seconds.
To demonstrate a use case for SWI, we developed the Soluble Domain for Protein Expression (SoDoPE) web server (see Methods and https://tisigner.com/sodope ). Upon sequence submission, the SoDoPE web server enables users to navigate the protein sequence and its domains for predicting and maximising protein expression and solubility.

DISCUSSION
The B-factor or temperature factor of the atoms in a crystalline structure is the measure of vibration around their mean position that reflects the uncertainty in X-ray scattering u) ( structure determination (Schlessinger and Rost 2005, Bramer and Wei 2018, Carugo 2018 . The profile of normalised B-factors along a protein sequence can be used to infer the flexibility and dynamics of the protein structure (Karplus and Schulz 1985, Vihinen et al.  1994) . Protein structural flexibility has been associated with conformal variations, functions, thermal stability, ligand binding and disordered regions (Vihinen 1987, Teague 2003, Radivojac 2004, Ma 2005, Schlessinger and Rost 2005, Yuan et al. 2005, Yin et al. 2011) . However, the use of flexibility in solubility prediction has been overlooked although their relationship has previously been proposed (Tsumoto et al. 2003) . In this study, we have shown that flexibility strongly correlates with solubility (Fig 3). Based on the normalised B-factors used to compute flexibility, we have derived a new position and length independent weights to score the solubility of a given protein sequence. We call this protein solubility score as SWI.
Upon further inspection, we observe some interesting properties in SWI. SWI anti-correlates with helix propensity, GRAVY, aromaticity and isoelectric point (Fig 2C and 3). Amino acid residues with a lower aromaticity or hydrophilic are known to improve protein solubility (Han et al. n.d., Wilkinson and Harrison 1991, Trevino et al. 2007, Niwa et al. 2009, Kramer et al. 2012, Warwicker et al. 2014 . Consistent with previous studies, the charged residues aspartate (D), glutamate (E) and lysine (K) are associated with high solubility, whereas the aromatic residues phenylalanine (F), tryptophan (W) and tyrosine (Y) are associated with low solubility (Fig 2C and 4A). Interestingly, histidine residue (H) appears as one of the heavily weighted residues in scoring solubility, which might be due to its positive charge. In contrast, cysteine residue (C) has been strongly downweighted, probably because disulfide bonds couldn't be properly formed in the E. coli expression hosts (Stewart et al. 1998, Aslund and Beckwith 1999, Rosano and Ceccarelli 2014, Jia and Jeon 2016 . The weights are likely different if the solubility analysis was done using the reductase-deficient, E. coli Origami host strains, or eukaryotic hosts. Higher helix propensity has been reported to increase solubility (Idicula- Thomas andBalaji 2005, Huang et al. 2012) . However, our analysis has shown that helical and turn propensities anti-correlate with solubility, whereas sheet propensity lacks correlation with solubility, suggesting that disordered regions may tend to be more soluble (Fig 3). In accordance with these, SWI has stronger negative correlations with helix and turn propensities. These findings also suggest that protein solubility can be largely explained by overall amino acid composition, not just the surface amino acid residues. This idea aligns with our understanding that protein solubility and folding are closely linked, and folding occurs cotranscriptionally, a complex process that is driven various intrinsic and extrinsic factors (Wilkinson and Harrison 1991, Chiti et al. 2003, Tartaglia et al. 2004, Diaz et al. 2010 . However, it is unclear why sheet propensity has little contribution to solubility because β-sheets have been shown to link closely with protein aggregation (Idicula-Thomas and Balaji 2005) .
We conclude that SWI is a well-balanced index that is relatively simple and easy to use. To demonstrate the usefulness of SWI, we developed the SoDoPE web server for predicting solubility and designing protein sequences (see Methods and https://tisigner.com/sodope ). In addition, SoDoPE is integrated with TIsigner, our gene optimisation web server for protein expression. This pipeline provides a holistic approach to improve the outcome of recombinant protein expression.

Protein sequence properties
The standard protein sequence properties were calculated using the Bio.SeqUtils.ProtParam module of Biopython v1.73 (Cock et al. 2009) . All miscellaneous protein sequence properties were computed using the R package protr v1.6-2 (Xiao et al. 2015) .

Protein solubility prediction
We used the standard and miscellaneous protein sequence properties to predict the solubility of the PSI:Biology and eSOL targets (N=12,216 and 3,198, respectively) (Niwa et al. 2009, Seiler et al. 2014 . For method comparison, we chose the protein solubility prediction tools that are scalable (Table 1). Default configurations were used for running the command line tools.
To benchmark the runtime of these solubility prediction tools, we selected 10 sequences with a large range of lengths from the PSI:Biology and eSOL datasets (from 36 to 2389 residues). All the tools were run and timed using a single process without using GPU on a high performance computer [/usr/bin/time <command>; CentOS Linux 7 (Core) operating system, 72 cores in 2× Broadwell nodes (E5-2695v4, 2.1 GHz, dual socket 18 cores per socket), 528 GiB memory]. Single sequence fasta files were used as input files.

SWI
To improve protein solubility prediction, we optimised the most recently published set of normalised B-factors using the PSI:Biology dataset (Smith et al. 2003) (Fig 2). To avoid bias due to protein sequence homology, we first clustered the PSI:Biology targets using USEARCH v11.0.667, 32-bit (Edgar 2010) . His tag sequences were removed from all sequences before clustering to minimise bias. We obtained 4,368 clusters using the parameters: -cluster_fast <input_file> -id 0.1 -msaout <output_file> -threads 4. These clusters were divided into 10 groups with approximately 1,200 sequences per group. The subsequent steps were done with or without His tag sequences. We used the normalised B-factors as the initial weights to maximise AUC using these 10 groups with a 10-fold cross-validation. Since AUC is non-differentiable, we used the Nelder-Mead optimisation method (implemented in SciPy v1.2.0), which is a derivative-free, heuristic, simplex-based optimisation (Nelder and Mead 1965, Oliphant 2007, Millman and Aivazis 2011 . For each step in cross-validation, we did bootstrap resampling for 1,000 times with each sample containing 1,000 soluble and 1,000 insoluble proteins. Optimisation was done for each sample, giving 1,000 sets of weights. The arithmetic mean of these weights was used to determine the training and test AUC for the cross-validation step (Fig 2A).

Bit score
To compute the bit scores for each amino acid residue in the PSI:Biology soluble and insoluble groups (Fig 4A), we normalised the count of each residue in each group by the x) ( total number of residues in that group. We used the normalised count of amino acid residues using the eSOL sequences as the background. The bit score of residue for soluble or x) ( insoluble group is then given by the following equation: where is the normalised count of residue in the PSI:Biology soluble or insoluble (x) f i x) ( group and is the normalised count in the eSOL sequences. (x) f eSOL For control, random protein sequences were generated by incrementing the length of sequence, starting from a length of 50 residues to 6,000 residues with a step size of 50 residues. A hundred of random sequences were generated for each length, giving a total of 12,000 unique random sequences.

The SoDoPE web server
To estimate the probability of solubility using SWI, we fitted the following logistic regression to the PSI:Biology dataset: (4) robability of solubility where, is the SWI of a given protein sequence, and . The x 81.1496 a = 2.8379 b = − 6 P-value of log-likelihood ratio test was less than machine precision. Equation 4 can be used to predict the solubility of a protein sequence given that the protein is successfully expressed in E. coli .
On this basis, we developed a solubility prediction webservice called the Soluble Domain for Protein Expression (SoDoPE). Our web server accepts either a nucleotide or amino acid sequence. Upon sequence submission, a query is sent to the HMMER web server to annotate protein domains ( https://www.ebi.ac.uk/Tools/hmmer/ ) (Potter et al. 2018) . Once the protein domains are identified, users can choose a domain or any custom region (including full-length sequence) to examine the probability of solubility, flexibility and GRAVY. This functionality enables protein biochemists to plan their experiments and opt for the domains or regions with high probability of solubility. Furthermore, we implemented a simulated annealing algorithm that maximised the probability of solubility for a given region by generating a list of regions with extended boundaries. Users can also predict the improvement in solubility by selecting a commonly used solubility tag or a custom tag.
We linked SoDoPE with TIsigner, which is our existing web server for maximising the accessibility of translation initiation site (Bhandari et al. 2019) . This pipeline allows users to predict and optimise both protein expression and solubility for a gene of interest. The SoDoPE web server is freely available at https://tisigner.com/sodope .