DeepLoc 2.1: multi-label membrane protein type prediction using protein language models

Abstract DeepLoc 2.0 is a popular web server for the prediction of protein subcellular localization and sorting signals. Here, we introduce DeepLoc 2.1, which additionally classifies the input proteins into the membrane protein types Transmembrane, Peripheral, Lipid-anchored and Soluble. Leveraging pre-trained transformer-based protein language models, the server utilizes a three-stage architecture for sequence-based, multi-label predictions. Comparative evaluations with other established tools on a test set of 4933 eukaryotic protein sequences, constructed following stringent homology partitioning, demonstrate state-of-the-art performance. Notably, DeepLoc 2.1 outperforms existing models, with the larger ProtT5 model exhibiting a marginal advantage over the ESM-1B model. The web server is available at https://services.healthtech.dtu.dk/services/DeepLoc-2.1.


DATA CURATION AND PARTITIONING
To obtain a high-quality dataset we performed various steps to curate the data and ended with a final dataset containing 25,240 eukaryotic protein samples (see Supplementary Table 2).From the raw dataset, we first removed duplicates arising from overlaps between the search criteria at UniProtKB.From the unique samples, we meticulously removed positive labels where the subcellular location was pointing to isoforms (as only canonical sequences were retained in the dataset).Furthermore, all multi-label (>1 positive label) samples were screened and excluded if they exhibited any of the following terms in their annotation for subcellular localization: isoform, cleav[ed, ing], secreted form, process [ed] and shed[ded, ding] (stemmed words used for screening).The result from this screening led to the deletion of 349 positive labels and a total loss of 16 samples.Five partitions were constructed following homology partitioning using GraphPart (1).A maximum pairwise cross-partition identity of 30% was employed which led to the removal of 387 samples.At the time of homology partitioning we were not aware of the problem arising with multi-chain samples.As it proved laborious to apply proper heuristics to cut these proteins while retaining each chain's true membrane association, we removed these samples from the partitions to avoid introducing any artificial multi-label samples.
Moreover, sequences containing ambiguous amino acids [B, U, Z, X] were also removed after homology partitioning.
Lastly, prokaryotic samples were also removed.Prokaryotic samples were initially included for other experiments during training.However, DeepLoc 2.1 was solely trained and based on eukaryotic data.

Transformer models
We use two publicly available transformer models, the 33layer ESM-1B model with 650M parameters (2), and the 3B parameter ProtT5-XL-UniRef50 model (3), referred to as ESM-1B and ProtT5.As the ESM-1B model is not able to generate sequence embeddings for sequences longer than 1022 residues, we decided to cut longer sequences into shorter segments and merge them together afterwards, to retain long sequences for the training of this model.To avoid getting cuts near the ends of the sequences, which could potentially lead to artifacts in the sequence domains expressing potential sorting or signal sequences, we employed a method to ensure sequences were cut closer to the middle of the protein.

Implementation details
For details regarding the implementation and theory behind the focal loss function and DCT-prior-based regularization, that has been employed in the architecture of DeepLoc 2.1, please refer to the Supplementary Material of DeepLoc 2.0 (4), as the properties of the implementation used in DeepLoc 2.1 are identical to those used in DeepLoc 2.0.

Training details
Every model was trained for a maximum of 30 epochs if training had not been ceased by an early stopping criterion implemented with a patience of 5 epochs.The focal loss (see Section 2.2) was used for the AdamW optimizer.We implemented a learning rate scheduler for the optimizer, which would halve the learning rate if no improvement in the focal loss was observed for four epochs.The training was performed using 4-fold cross-validation, leaving out the fifth partition solely for testing and benchmarking against external tools.The PyTorch-lightning (5) library was used for the training and testing of the models.The training was carried out on an HPC-cluster using Nvidia Tesla V100 GPUs of 16GB VRAM.

Model architecture and hyperparameter optimization
The models developed for DeepLoc 2.1 generally follow the same methodology and architecture as the models of DeepLoc 2.0.A schematic overview of the models can be seen in Figure 1 in the Appendix Section 5.1.For hyperparameter optimization we made use of the Optuna framework (6) in combination with PyTorch-Lightning.
Optuna employs a Tree-Structured Parzen Estimator (TPE) algorithm for sampling new hyperparameters between each trial, to efficiently narrow down the search space for optimal hyperparameters.The hyperparameters that we decided to include can be seen in Table 5 along with their search space.A pruner was also implemented in the objective function for the hyperparameter optimization task, to interrupt unpromising trials caused by a choice of less optimal hyperparameters during hyperparameter optimization, to reduce training time (7).Moreover, dropout was applied to the outputs of the attention head and other hidden layers between the attention head and the final output layer to improve generalization and reduce the risk of overfitting.

BASELINE EVALUATION
Baseline assessment was carried out using MMSeqs2 (8) for sequence alignment along with a alignment-based classifier, where positive labels were inferred where the sequence from another partition with the highest alignment score had true positive labels.The results from this analysis can be seen in Supplementary Table 6 and be compared to Table 2 in the main article.

BENCHMARKING EXISTING METHODS
To benchmark DeepLoc 2.1 against other established tools for membrane protein type predictions we had to modify the test set (partition V in Table 4), in order to allow a fair comparison.This was done to avoid reporting artificially inflated performance of the homology-based models caused by "homology-leakage" between the test set and the searchdatabase used by the model.MemPype ( 9) also has a different structured output format, as it does not distinguish peripheral membrane proteins from soluble proteins.Additionally, the MemPype and Memtype-2L (10) servers are single-label predictors, which the comparison also had to account for.
For external model comparison, we used the Selenium library in Python to automate submitting sequences to servers and extracting predictions.

Mem-ADSVM
Mem-ADSVM is a two-layer multi-label homology-based model that uses a support vector machine (SVM) to make its membrane-type prediction (11).It does so based on frequencies of occurrences of the associated GO-terms of the sequence.The GO-terms are retrieved from a compact database, referred to as the ProSeq-GO database, by searching for homologous sequences.We did not have access to the database that Wan et al. are referring to regarding this model (11), and therefore we could not assess which samples from the independent test set that were possibly present in the database, giving the search engine a huge advantage when comparing performance.Furthermore, at the time of submitting this paper we are no longer able to access the server and it does not appear to be available anymore.Evaluating the models on the entire test set showed that the Mem-ADSVM was outperforming our models.To assess if the high performance was due to homology overlap between the test set and ProSeq-GO database, we removed all samples from the test set that were included in the database of UniProt before 2015 (assuming the database had not been updated since the release of the paper).This left us with a subset of 803 independent samples.The new comparison led to highly different results, yielding a more truthful measure of the generalization capability of Mem-ADSVM.The output of the Mem-ADSVM server is reported in a format that is conveniently translated into the four membrane protein types that DeepLoc 2.1 distinguishes between.The translation between outputs used for the comparison can be seen in Table 7.

MemPype
The MemPype server is a single-label predictor developed for membrane-type predictions of eukaryotic proteins (9).The model uses two pipelines for its predictions.The main pipeline, that we compare our performance to, includes a multi-step prediction stage, that utilizes other available prediction tools for various tasks, e.g.signal peptide prediction with the SPEPlip server (12) and prediction of GPI-anchor propeptide with PredGPI (13).For comparison we use the specific output under the prediction summary of the server.The output distinguishes different types of transmembrane and lipid-anchored proteins, but the outputs are straightforwardly converted to the classes used for this project.The translation between the MemPype and DeepLoc outputs can be seen in Table 8.As the server is designed specifically for single-label predictions and accommodates only eukaryotic sequences, we isolated eukaryotic accessions from the test set featuring single-label annotations for peripheral, transmembrane, and lipid-anchored proteins.In addition to these server specifications, the MemPype pipeline lacks the ability to differentiate between non-membrane proteins and peripheral membrane proteins.Consequently, the server consistently attempts to infer some membrane type for the protein.Considering a protein that is exclusively positive for the soluble class the MemPype model will generate an output that incorporates phrases such as "cell membrane", "internal membrane", or "organelle membrane", and concludes by adding the term "globular".Due to this limitation, we decided to also include multi-label samples positive for peripheral and soluble labels, and merge these with the single-label peripheral and soluble classes, resulting in a multi-class classification encompassing three membrane types for comparative analysis.This approach yielded a test set comprising 4431 samples.As a consequence of this class-construction, we merged the multi-label Sigmoid predictions of samples with a true label for the soluble and peripheral classes, and inferred a correct prediction if our models had predicted one of the two positive classes.

MemType-2L
MemType-2L is a two-layer single-label predictor that employs a pseudo position-specific scoring matrix (Pse-PSSM) and optimized evidence-theoretic K nearest neighbors ensemble classifier (OET-KNN) (10).Similarly to the other tools that were assessed for comparison, this model also distinguishes between different types of transmembrane and lipid-anchored membrane proteins that are conveniently translated into the membrane classes of this project.The translation between the MemType-2L and DeepLoc outputs can be seen in Table 9.To allow a fair comparison we removed all multi-label samples from the independent test set along with sequences shorter than 50 AAs which was the shortest sequence length allowed for submission on the server.This resulted in a test set of 4414 sequences.Finally, we applied SoftMax to the raw outputs of the final layer of DeepLoc 2.1 to get multi-class predictions.

Non-available models
In the literature various other models were also described, and stated to be, or become, public available.However, most of these models are currently inaccessible, due to expired or nonworking server-links.Additionally, some models were also not relevant for comparison, as they infer only membrane bound or non-membrane bound.In situations where servers did not appear available or the links seemed to not be working, we made attempts to contact the authors for access to the model.Despite our efforts we did not receive any responses, and therefore we excluded those models from our analysis.The models we tried to access were; • iMem-Seq (14) (link not working), • iMem-2SLAAC (15) (establishment of web server mentioned as a future work), • PMMBF (16) (establishment of web server mentioned as a future work), • BinMemPredict (17) (link not working), • ProtLoc (18) (from 1997, deemed outdated from the results of other studies (15)), • Toot-M (19) (not included as it only infers membranebound or not), • Ali and Hayat (20) (establishment of web server mentioned as future work), and

Table 1 .
Estimated time usage on the web server per sequence.The time for prediction and plotting increases proportionally with the number of sequences, while the model load time is constant for any number of sequences.

Table 2 .
Steps of data curation.

Table 3 .
UniProt dataset: Number of proteins and translation between membrane protein type and UniProt sublocations.
Supplementary Table4.Distribution of data between partitions.Partition V is the held-out test set, while the other partitions were used for cross-validation during training.µ is the average.

Table 5 .
Overview of hyperparameters that were tuned during training along with their search space.