PractiCPP: a deep learning approach tailored for extremely imbalanced datasets in cell-penetrating peptide prediction

Abstract Motivation Effective drug delivery systems are paramount in enhancing pharmaceutical outcomes, particularly through the use of cell-penetrating peptides (CPPs). These peptides are gaining prominence due to their ability to penetrate eukaryotic cells efficiently without inflicting significant damage to the cellular membrane, thereby ensuring optimal drug delivery. However, the identification and characterization of CPPs remain a challenge due to the laborious and time-consuming nature of conventional methods, despite advances in proteomics. Current computational models, however, are predominantly tailored for balanced datasets, an approach that falls short in real-world applications characterized by a scarcity of known positive CPP instances. Results To navigate this shortfall, we introduce PractiCPP, a novel deep-learning framework tailored for CPP prediction in highly imbalanced data scenarios. Uniquely designed with the integration of hard negative sampling and a sophisticated feature extraction and prediction module, PractiCPP facilitates an intricate understanding and learning from imbalanced data. Our extensive computational validations highlight PractiCPP’s exceptional ability to outperform existing state-of-the-art methods, demonstrating remarkable accuracy, even in datasets with an extreme positive-to-negative ratio of 1:1000. Furthermore, through methodical embedding visualizations, we have established that models trained on balanced datasets are not conducive to practical, large-scale CPP identification, as they do not accurately reflect real-world complexities. In summary, PractiCPP potentially offers new perspectives in CPP prediction methodologies. Its design and validation, informed by real-world dataset constraints, suggest its utility as a valuable tool in supporting the acceleration of drug delivery advancements. Availability and implementation The source code of PractiCPP is available on Figshare at https://doi.org/10.6084/m9.figshare.25053878.v1.


S1 Evaluation metrics
For experiments where the ratio of positives to negatives is 1:1, we adopt four metrics: accuracy, sensitivity (or recall), specificity and the Matthews correlation coefficient (MCC) to evaluate models.Note that sensitivity is crucial for evaluating how well the model captures positive occurrences, and specificity indicates the model's ability to discern true negative instances accurately.
In contrast, for experiments where the ratio of positives to negatives is 1:1000, some evaluation metrics like accuracy can be misleading [1], thus we use metrics that account for this data imbalance: recall (or sensitivity), precision, F1 score and FP per correct.In addition, AUPR is used to capture the trade-off between precision and recall.Note that precision is crucial to monitor how accurately the model predicts the rate of positive cases because given the large number of negative instances, even a small percentage of misclassifications can translate into a large absolute number of false positives.
The evaluation metrics used in the paper are computed as follows:

S2 Hardness study
Towards hard negative sampling in PractiCPP, we conduct a hardness study to take a closer look at how the hard negatives contribute to the model's performance improvements.Notably, a larger K generally indicates harder negatives selected.Specifically, a larger K means a larger negative candidate pool from the overall negative set, raising the likelihood of including truly hard negatives.Thus, the top 3 × |positives in batch| negatives scored by the model are more likely to be truly challenging negatives.Note that at K = 3, the hard negative sampling strategy degrades to a uniform sampling approach.
In Figure S1a and Figure S1b, we show AUPR and AUROC values under various K setting (K in (3,9,15,21,30)).In our evaluations, the optimal performance for PractiCPP is observed at K = 9, where the AUPR reaches a peak value of 0.6400 and the AUROC achieves 0.9600.Then with K increasing, AUPR drops to around 0.6200, and AUROC even drops from 0.96 to 0.93 at K = 30.It is not the case that harder negatives yield better results.In fact, selecting an appropriate K is critical.One potential explanation for this phenomenon is that excessively challenging negatives could include potential positive samples, thus misleading the model training.

S3 The importance of Mogran fingerprint
The observations from Fig. 3 in the manuscript are as follows: • Peptides are grouped into distinct clusters.This could be due to specific chemical functional groups or substructures within the peptides, causing them to have similar fingerprint representations and thus cluster together.
• The fingerprint distributions of CPPs and non-CPPs are close but exhibit a shift (in Fig. 3a).This may be because the non-CPPs sourced from CPP924 are peptides that resemble CPPs in their attributes but demonstrate low cell-penetrating capability in wet-lab experiments.
The shift in their fingerprint distributions may provide insights into classifying CPPs from non-CPPs.
• Unlabeled peptides display a wider clustering, where CPPs and non-CPPs sourced from CPP924 center in several certain clusters (in Fig. 3b).This suggests the greater structural and attribute diversity of extensive natural peptides, and Morgan fingerprint information is beneficial in distinguishing CPPs from the vast array of unlabeled peptides.
From the perspective of biology, in the process of peptide penetrating cellular membranes, various mechanisms are typically involved, such as passive penetration, translocation, and endocytosis [2].Following the entry of peptides into the cell membrane via endocytosis, an important subsequent process might also occur, namely endosomal release [3].The occurrence and efficiency of these processes are largely dependent on the specific functional groups present in the peptides.For instance, charged amino acids may engage in charge-charge interactions with glycoproteins on the surface of the cellular membrane [4], facilitating subsequent membrane penetration processes, while hydrophobic groups aid in the penetration of the hydrophobic cell membrane.Overall, the presence of specific functional groups in peptides is a critical factor in assessing their ability to penetrate cell membranes.The Morgan fingerprint, as a characteristic widely used in molecular description, inherently includes detailed information about functional groups and is thus extensively employed in predicting the properties of molecules [5].

Figure
Figure S1: (a) The influence of the selected negatives' hardness level on the AUPR performance of PractiCPP.(b) The influence of the selected negatives' hardness level on the AUROC performance of PractiCPP.A larger K indicates harder negative selection.