A two-stage computational framework for identifying antiviral peptides and their functional types based on contrastive learning and multi-feature fusion strategy

Abstract Antiviral peptides (AVPs) have shown potential in inhibiting viral attachment, preventing viral fusion with host cells and disrupting viral replication due to their unique action mechanisms. They have now become a broad-spectrum, promising antiviral therapy. However, identifying effective AVPs is traditionally slow and costly. This study proposed a new two-stage computational framework for AVP identification. The first stage identifies AVPs from a wide range of peptides, and the second stage recognizes AVPs targeting specific families or viruses. This method integrates contrastive learning and multi-feature fusion strategy, focusing on sequence information and peptide characteristics, significantly enhancing predictive ability and interpretability. The evaluation results of the model show excellent performance, with accuracy of 0.9240 and Matthews correlation coefficient (MCC) score of 0.8482 on the non-AVP independent dataset, and accuracy of 0.9934 and MCC score of 0.9869 on the non-AMP independent dataset. Furthermore, our model can predict antiviral activities of AVPs against six key viral families (Coronaviridae, Retroviridae, Herpesviridae, Paramyxoviridae, Orthomyxoviridae, Flaviviridae) and eight viruses (FIV, HCV, HIV, HPIV3, HSV1, INFVA, RSV, SARS-CoV). Finally, to facilitate user accessibility, we built a user-friendly web interface deployed at https://awi.cuhk.edu.cn/∼dbAMP/AVP/.


DistancePair
The DistancePair encoding incorporates the amino acid distance pair coupling information and the amino acid reduced alphabet profile into the general pseudo amino acid composition vector.For the reduced alphabet profile, they are cp(13), cp( 14), and cp(15) as defined below: cp(13) = {MF; IL;V;A;C;WYQHP;G;T; S;N;RK;D; E} cp(14) = {EIMV; L;F;WY;G; P;C;A; S; T;N;HRKQ; E;D} sp(15) = {P;G; E;K;R;Q;D; S;N; T;H;C; I;V;W;YF;A; L;M} where the single letters without a semicolon (;) to separate them mean belonging to a same cluster.

CKSAAGP
The Composition of k-Spaced Amino Acid Group Pairs (CKSAAGP) encoding is a variation of the CKSAAP encoding, which is our own proposal.It calculates the frequency of amino acid group pairs separated by any k residues (the default maximum value of k is set as 5).Taking k = 0 as an example, there are 25 0-spaced group pairs (i.e., g1g1, g1g2, g1g3, … g5g5).Thus, a feature vector of CKSAAGP can be defined as: The value of each descriptor denotes the composition of the corresponding residue group pair in a protein or peptide sequence.For instance, if the residue group pair g1g1 appears m times in the protein, the composition of the residue pair g1g1 is equal to m divided by the total number of 0-spaced residue pairs (Ntotal) in the protein.

QSOrder
For each amino acid type, a quasi-sequence-order descriptor can be defined as: where fr is the normalized occurrence of amino acid type r and w is a weighting factor (w = 0.1), nlag and have the same definitions as described above.These are the first 20 quasi-sequence-order descriptors.The other 30 quasi-sequence-order descriptors are defined as: .

DDE
The Dipeptide Deviation from Expected Mean (DDE) encoding offers a statistical perspective on the dipeptide distribution in proteins.It measures the deviation between the observed frequency of a dipeptide and its theoretically expected occurrence based on codon frequencies.DDE can reveal patterns or biases in protein composition.If a specific dipeptide appears more or less frequently than expected, it suggests that there might be evolutionary, functional, or structural reasons for this deviation.The DDE encoding is formulated by three parameters: dipeptide composition (Dc), theoretical mean (Tm), and theoretical variance (Tv).The above three parameters and the DDE are computed as follows.Dc(r,s), the dipeptide composition measure for the dipeptide 'rs', is given as where Nrs is the number of dipeptides represented by amino acid types r and s and N is the length of the protein or peptide.Tm(r,s), the theoretical mean, is given where Cr is the number of codons that code for the first amino acid and Cs is the number of codons that code for the second amino acid in the given dipeptide 'rs '.CN is the total number of possible codons, excluding the three stop codons (i.e., 61).Tv (r,s), the theoretical variance of the dipeptide 'rs', is given by: Finally, DDE(r,s) is calculated as:      In this study, we also trained models without reducing the number of negative samples, which total 5116 for the non-AVP-unbalanced dataset and 4979 for the non-AMP-unbalanced dataset.

Figure S3 .
Figure S3.Visualization of positive and negative samples in different modules of the model on non-AVP dataset.The visualized modules include (A) and (B) Input features (C) Contrastive learning module (D) Feature-enhanced transformer module (E) Prediction module.The blue dots denote AVPs and the yellow dots denote non-AVPs.

Figure S4 .
Figure S4.Contribution of different encodings in the feature-enhanced transformer module to first-stage identification.(A) Ratio of the total impact of different peptide encodings on the prediction.(B) The normalized average feature importance associated with the dimension of that peptide encoding.

Figure S5 .
Figure S5.The top 30 important features in the first stage of identification of the feature-enhanced transformer module.The left bar plot represents the feature importance calculated by the averaged absolute Shapley value.The right beeswarm plot gives the effect of different feature values on the prediction.

Figure S6 .
Figure S6.Diagram of the web interface using example sequences.This includes the following steps: sequences input, model selection, function type selection and prediction.The prediction results page will then be displayed, including the submission summary, results statistics and results display.