LocTree3 prediction of localization

The prediction of protein sub-cellular localization is an important step toward elucidating protein function. For each query protein sequence, LocTree2 applies machine learning (profile kernel SVM) to predict the native sub-cellular localization in 18 classes for eukaryotes, in six for bacteria and in three for archaea. The method outputs a score that reflects the reliability of each prediction. LocTree2 has performed on par with or better than any other state-of-the-art method. Here, we report the availability of LocTree3 as a public web server. The server includes the machine learning-based LocTree2 and improves over it through the addition of homology-based inference. Assessed on sequence-unique data, LocTree3 reached an 18-state accuracy Q18 = 80 ± 3% for eukaryotes and a six-state accuracy Q6 = 89 ± 4% for bacteria. The server accepts submissions ranging from single protein sequences to entire proteomes. Response time of the unloaded server is about 90 s for a 300-residue eukaryotic protein and a few hours for an entire eukaryotic proteome not considering the generation of the alignments. For over 1000 entirely sequenced organisms, the predictions are directly available as downloads. The web server is available at http://www.rostlab.org/services/loctree3.

Section S1: LocTree3 assessment on multi-localized proteins 5 Figure S1: PSI-BLAST sequence identities to LocTree3 reliability scores 6 Figure S2: E-value thresholds for the homology-based inference 7 Table S4: Strategies for annotation transfer by homology 8 Figure S3: Class-wise performance comparison of LocTree3 to its sources 9 Table S5: LocTree3 assessment on sequence-unique sets 10 Table S6: Performance comparison on LocTree3's development data 11 Table S7: Performance comparison on a human protein set 12 Table S8: Proteome-wide localization predictions using PSI- BLAST 12 Section S2: LocTree3 is much more reliable than blind homology-inference 13 Section S3: Possible sources for PSI-BLAST mis-predictions 14

Material
Appendix p. 2   Data: number of sequences per localization class in the sets of SWISS-PROT proteins used for the independent/additional testing of LocTree3.
*1 "New2013_hval0" set: at HVAL≤0 redundancy reduced sets of 273 eukaryotic and 57 bacterial proteins, thus containing no protein pair with >20% pairwise sequence identity over 250 residues aligned. Redundancy reduced set of archaeal proteins was too small (18 proteins) to provide meaningful performance estimates and was thus excluded from the analysis. "New2014" set: all eukaryotic proteins added to SWISS-PROT between releases 2013_11 and 2014_03, not redundancy reduced. Because the number of corresponding bacterial proteins was too small (10 proteins), they were excluded from the analysis. Appendix p. 5 Section S1: LocTree3 assessment on multi-localized proteins LocTree2 and LocTree3 were developed on proteins from the Swiss-Prot release 2011_04. The number of multi-localized proteins in this release was 48 for bacteria (all annotated with two localization classes) and 4556 for eukaryota (4376 with two localization classes, the others with ≥3). Due to the small number, we dropped bacteria. Reducing redundancy at HVAL≤0 on these 4556 left us with 72 sequenceunique proteins. We applied LocTree3 to these and considered the prediction correct if one of the experimentally observed classes had been predicted. Result: Q18=65±12%; while similar to the performance of LocTree2 on the 1682 crossvalidate proteins, it compared less favourable to 80±3% for LocTree3. Why did performance drop on those proteins? Clearly, the random expectation was the opposite, i.e. since we allow one mistake we have a higher random performance: picking one right from 18 is tougher than picking 2 and choosing the best-of-two. In short, our suspicion is that today's double annotations as a whole set are not good enough.
We looked at LocTree3 predictions for five misclassified proteins (i.e. proteins for which none of the experimentally annotated localization classes could be picked by LocTree3) with the highest reliability scores (RIs). One of the five proteins (YG4O_YEAST, RI=38) was an uncharacterized protein while for the remaining four we were able to find the experimental evidence for the predicted localization classes in other sources rather than Swiss-Prot: (1) ZYM1_SCHPO is a metallothionein, which is annotated to be localized to the nucleus and the cytoplasm in SWISS-PROT. LocTree3 predicts this protein to be secreted with the RI=98, we found an experimental evidence for metallothioneins to be secreted in Moltedo et al. (8); (2) GPX41_MOUSE is annotated to localize to the mitochondrion and the cytoplasm, while LocTree3 predicts nucleus with the RI=93, which is confirmed by Yant et al. (9); (3) NPC2_ASPOR is annotated to be cytoplasmic and a Golgi apparatus protein, LocTree3 however predicts it to be vacuolar with the RI=43, which is true for the protein's ortholog NPC2_YEAST; (4) PEN2_CAEEL is annotated to be localized to the ER membrane and Golgi membrane, LocTree3 predicts mitochondria membrane with RI=36 which is true according to the work of Hansson et al. (10). Interestingly, for the protein with the lowest prediction reliability index (CSN4_BRAOL, RI=6) and the predicted localization class chloroplast we could find an evidence in Xiangjun et al. (11) stating that the protein acts as a suppressor of chloroplast development. SWISS-PROT annotates the protein to be nuclear and cytoplasmic.
From these findings we conclude that the number of sequence-unique multilocalized proteins as we have them today in SWISS-PROT is rather small and the annotations of multiple localization may be fuzzy and incomplete. Therefore, assessing prediction methods on these proteins may lead to underestimated results and incorrect implications.
Appendix p. 6 Localization annotation from sequence homologs is more accurate at higher PSI-BLAST pairwise sequence identity (PIDE) values. Here we show the percentage Accuracy/Coverage (Methods) at the given sequence identity thresholds for 995 eukaryotic and 202 bacterial proteins that had a PSI-BLAST hit with E-value≤10 -3 (6, 7). Since method's performance did not change for PIDE<20, we formed LocTree3's reliability index by normalizing the sequence identity values according to (PIDE-20)*10/8. Note, the slight decrease of the Accuracy curves at PIDE approaching 100% results from the changed annotations in SWISS-PROT between releases 2011_04 and 2013_11. Though these proteins are predicted to be localized correctly in 2013_11, they are considered as false predictions in the current evaluation (Eukaryota: AIM37_YEAST, ECP_MACFA; Bacteria: ESPR_MYCTU).
Appendix p. 7 PSI-BLAST Evalue thresholds reached their peak at high E-value≤10. However, in order to determine the threshold at which value to use LocTree2 and at which PSI-BLAST, we also need to consider the performance of the final merger LocTree3 at the same threshold. The optimal threshold for LocTree3 seemed to be much more conservative, namely at E-value≤10 -3 .
Appendix p. 8    Data sets and the LocTree3 performance estimation as in Figure S3. Abbreviations used: Nprot, the number of proteins with known localization; Acc, accuracy; Cov, coverage; gAv, geometric coverage of Acc and Cov; Q(n), overall prediction accuracy. Standard errors were estimated by bootstrapping (Methods). Note 1: Q(n) is a six-state value for bacteria, i.e. the overall accuracy for classification in one of six localization classes, and an eighteen-state value for Eukaryota (Methods). Note 2: Only performances for localization classes containing more than ten proteins are reported. * = unrealistic upper or lower bound given by the standard error due to the small data set size.
Appendix p. 11 *1 Cello 2.5: employs a system of Support Vector machines to classify eukaryotic proteins in 12 and bacterial in 5 classes using sequence-derived features (13) *2 PSORTb 3.0: predicts four classes for Gram-positive and five classes for Gram-negative bacteria through a combination of several classifiers into a Bayesian network (14) *3 Wolf Psort: k-nearest neighbour classifier that predicts 12 localization classes for eukaryotes from sequence-derived features (15) *4 YLoc: uses sequence-derived features together with GO terms to classify eukaryotic proteins in 11 localization classes through Naïve Bayes (16) *5 LocTree2: de novo machine learning-based method, results valid for cross-validation *6 LocTree3: combines de-novo (LocTree2) and homology-based (PSI-BLAST) searches; it uses PSI-BLAST predictions (lookup at E-value≤10 -3 in a database of experimentally annotated proteins) if available and LocTree2 (results from the cross-validation setting), otherwise *7 data set Eukaryota: 1682 sequence-unique eukaryotic proteins in SWISS-PROT release 2011_04; for 995 of those we found PSI-BLAST hits, for 687 we did not *8 data set Bacteria: 479 sequence-unique bacterial proteins in SWISS-PROT release 2011_04; for 202 of those we found PSI-BLAST hits, for 227 we did not Note: Q is the overall prediction accuracy (Eqn. 3, Methods); "±" values refer to standard errors (Eqn. 4, Methods); bold face: "winner in each column" Appendix p. 12 Methods as in Table S6 *5 data set "Human proteins": 5016 human proteins with an experimental annotation of exactly one localization class in SWISS-PROT release 2014_03. A vast majority of these proteins constitutes the training sets of the methods tested. Note: "±" values refer to standard errors (Eqn. 4, Methods); bold face: "winner in each column" Section S2: LocTree3 is much more reliable than blind homology-inference. Two recent advances in molecular biology make it impossible to blindly trust annotations. The first are high-throughput experiments that typically change the value of an annotation from, e.g. "protein Q is native in the Golgi" to "protein Q has been detected to have entered the secretory pathway with a probability of 0.7". Clearly, using the second statement to annotate Q as extra-cellular would be very wrong. But what if we added "secretory pathway" as a new "class", should we then annotate it as in that class, or should we maintain the probability? If we maintained the probability: should this be counted as "localization annotated"? What about a protein Q2 that is sequence similar to Q: should we annotate its localization also to be "secretory pathway with 70% chance"? One simple experimental data point generates so many questions that cannot be answered without generating new problems! Thousands of such data points are being created by modern molecular biology every month.
The second problem is contained in the first, but much more prevalent in today's databases that are still heavily based upon detailed biochemical experiments. Assume that we have a reliable annotation for Q as Golgi: how to treat proteins that are related to Q? For instance, those related in terms of sequence similarity. This brings up the argument of Imai & Nakai (17), namely that PSI-BLAST predicts localization more accurately than de novo methods. Here we showed that this is true to some extent (Table 1: for some proteins PSI-BLAST is better than LocTree2), but that if predictions are forced, the opposite becomes true (Table 1: averaged over all proteins PSI-BLAST is much less accurate than LocTree2). Clearly, the tool we make available now, LocTree3, settles the discussion. Even if we were right that LocTree3 is the best method currently available to predict protein localization, should we apply it to annotate localization in databases that are exclusively based on experiments such as SWISS-PROT (1)? We suggest a negative answer: leave experimental annotations as clean as possible. Should we then remove almost 90% of (stand Feb. 2014) all annotations about localization in SWISS-PROT (i.e. those based on non-experimental findings)? What about a database that pulls in automated annotations such as UniProt and/or GO (18)? Naïve users querying UniProt might get the impression that over 5m (million) proteins have annotations for localization when the best we can do to develop prediction methods is dig out a list of may be 25k (thousand), i.e. 200 times fewer than suggested by that naïve sieve through UniProt. Clearly, we argue that it would be better to remove the 5m-25k inferred annotations and replace those by LocTree3 predictions marked as predictions and by possibly augmenting this with predictions for all other 45m proteins in today's UniProt (total 52m in Feb. 2014).

Section S3: Possible sources for PSI-BLAST mis-predictions
The idea behind LocTree3 is to use PSI-BLAST if it finds hits and LocTree2, otherwise. Thus if a prediction of the sub-cellular localization is incorrect and is derived from PSI-BLAST, it cannot be 'corrected' by LocTree2 anymore.
Nevertheless, we looked into the cases for which PSI-BLAST annotated proteins incorrectly. In our development eukaryotic data of 1682 eukaryotic proteins, 995 proteins were classified by PSI-BLAST and for remaining 687 proteins it failed to identify a homolog in the data set of all experimentally annotates proteins. Of 995 predicted proteins 69 were misclassified. The most commonly mis-classified pairs of classes (one observed, the other looked up from homolog) were: mitochondria and chloroplast (9 times), plastid and chloroplast (8 times), cytoplasm and nucleus (8 times), cytoplasm and secreted (6 times), cytoplasm and mitochondria (5 times).
These pairs either resembled compartments that are either close in space (e.g. cytoplasm and nucleus), closely related (chloroplasts present one of the three types of plastid) or are very similar in their structure (chloroplast and mitochondria). Therefore, the PSI-BLAST mis-classifications may originate from incorrect experimental annotations, as well as from similarity in translocation signals. About 33% of the mistakes originated from "honest orthologs" (e.g. RK32_EUGGR annotated chloroplast but predicted plastid as its ortholog RK32_ASTLO). The misclassification with the highest score (PIDE=88%) was made for ECP_MACFA, a protein for which the SWISS-PROT has changed since LocTree3 development from cytosol to be secreted, the latter correctly identified by PSI-BLAST. In other word, this mistake was based on an incorrect earlier annotation.