Protein–DNA binding sites prediction based on pre-trained protein language model and contrastive learning

Abstract Protein–DNA interaction is critical for life activities such as replication, transcription and splicing. Identifying protein–DNA binding residues is essential for modeling their interaction and downstream studies. However, developing accurate and efficient computational methods for this task remains challenging. Improvements in this area have the potential to drive novel applications in biotechnology and drug design. In this study, we propose a novel approach called Contrastive Learning And Pre-trained Encoder (CLAPE), which combines a pre-trained protein language model and the contrastive learning method to predict DNA binding residues. We trained the CLAPE-DB model on the protein–DNA binding sites dataset and evaluated the model performance and generalization ability through various experiments. The results showed that the area under ROC curve values of the CLAPE-DB model on the two benchmark datasets reached 0.871 and 0.881, respectively, indicating superior performance compared to other existing models. CLAPE-DB showed better generalization ability and was specific to DNA-binding sites. In addition, we trained CLAPE on different protein–ligand binding sites datasets, demonstrating that CLAPE is a general framework for binding sites prediction. To facilitate the scientific community, the benchmark datasets and codes are freely available at https://github.com/YAndrewL/clape.


Introduction
The interaction of protein and ligands dominate almost all the life activities in organisms, including interactions of protein-protein, protein-small molecules, and protein-nucleic acids.As carriers of genetic information, DNA molecules binding with proteins play a crucial role in many biological processes, including DNA transcription, replication, expression, signal transduction, and metabolism 1,2 .In prokaryote and eukaryote species, approximately 3% and 7% of genomes encode DNA-binding proteins, respectively 3 .Transcription factors (TFs) are a representative group of DNAbinding proteins that regulate transcription by binding to specific DNA sequences known as motifs.TFs are involved in various biological processes, including immune response 4 , maintenance of pluripotency of stem cells 5 , and the dysfunctions of TFs are related to numerous human diseases, such as various types of cancer and neurodegenerative diseases 6,7 .Additionally, other DNA-binding proteins such as histone, DNA polymerase, and DNA topoisomerase, also play critical roles in biological activities and are associated with human diseases 8,9 .
Identifying the DNA-binding sites of a protein is the initial step for modeling protein-DNA binding properties.Several experimental approaches have been developed for identifying protein-DNA interaction in vivo or in vitro, such as systematic evolution of ligands by exponential enrichment (SELEX) and chromatin immunoprecipitation (ChIP) 10,11 .In addition, structural biology approaches have been applied to determine the DNA-binding residues and areas, including X-ray crystallography and nuclear magnetic resonance (NMR).Although experimental methods based on molecular biology have made significant contributions over the past few decades, these methods are time-consuming and resource-intensive.Therefore, computationally predicting DNA-binding residues with machine learning methods is attractive.
The vital step in building a predictor is representation learning, where discriminative features play a crucial role in improving model performance.Typically, models utilize features extracted from a collection of protein sequences to fully leverage evolutionary information.The commonly used methods involve PSI-BLAST 12 and HHblits 13 , which produce multiple sequence alignment (MSA) described as a position-specific scoring matrix (PSSM).Extensive studies show that evolutionary information leads to significant improvement in DNA-binding prediction tasks 14,15 .The secondary structure information of the given protein can also be applied as the initial feature, which can be generated by DSSP 16 using protein structure or PSIPRED 17 using protein sequence.A number of models have been developed to complete the task and can be roughly divided into sequence-based and structure-based models.Sequence-based models extract features from protein sequences alone, while structure-based models use features of crystal protein structures.BindN 18 used several amino acid properties as sequence features, applying a support vector machine (SVM) model to classify the DNA-binding residues, BindN+ 14 improved the model performance by adding the PSSM feature.
Currently, advanced predictors are focused on deep learning methods, with DeepDISE 19 and DBPred 15 using a convolutional neural network (CNN) as the classifier, EL_LSTM 20 applying a recurrent neural network (RNN) as the backbone network, and ProNA2020 21 using a multi-layer perceptron (MLP).A few models start with predicted protein structures or experimentally solved structures.NucBind 22 predicted protein structures by template-based models and then used an SVM to complete the downstream prediction.GraphBind integrated sequence-based and structure-based features, employing graph neural network (GNN) as the classifier.
Protein structures contain all the necessary information derived from the protein sequence.Hence, in general, structure-based models demonstrate better performance than sequence-based models.However, to ensure model performance, structure-based models require accurate protein structures as input.Consequently, the prediction of DNA-binding sites based on protein sequences remains an important and pressing research problem.Currently, the performance of existing sequence-based models is still unsatisfactory for practical application, and the feature extraction process often relies on manual design, which fails to generate a refined initial representation 23 .As a result, there is a pressing need to develop an end-to-end model that without using handcrafted

The model architecture of CLAPE
The existing models for identifying protein-DNA binding sites could be divided into two categories.The first category combines handcrafted features and classification models (Figure 1a).Handcrafted features may include amino acid physicochemical properties and protein structural information, while the models may include machine learning models such as support vector machine and random forest.The second category aims to predict DNA-binding sites in an end-to-end fashion (Figure 1a) and often employs large-scale deep learning models.One may either train a classification model from scratch or fine-tune a pre-trained protein language model, such as ProtBert, with a simple downstream neural network such as linear layers.However, the first approach typically necessitates laborious manual feature extraction processes, while the second approach demands high computational resources and training time.
We took advantage of both approaches to propose CLAPE, a protein-ligand binding sites prediction framework to generate the binding probabilities of a given protein sequence.The overall architecture of CLAPE was depicted in Figure 1b CLAPE is a highly flexible framework, allowing for the customization of each basic component.In the loss computation module, one may employ different contrastive loss functions such as that proposed in DrLIM 26 , triplet loss 27 , or lifted structure loss 28 .
While our experiments indicated that 1DCNN was the most effective backbone network for CLAPE, other backbone models, such as MLP (multi-layer perceptron) and RNN (recurrent neural network), were also suitable for use with CLAPE.
The pre-trained model was used as a feature extractor to avoid tedious manual feature extraction procedures.However, researchers may choose to fine-tune the pre-trained model, which has been shown to produce better performance but requires higher computational and time consumption 29 resembling the training scheme described in Figure 1a.Furthermore, multiple pre-trained protein language models can be applied to the sequence embedding module 30 .

CLAPE-DB accurately predicted the DNA-binding sites with a better generalization ability
We evaluated the performance of the proposed CLAPE-DB (CLAPE DNA-binding) model on two protein-DNA datasets, as described in Table 1.To assess the performance of CLAPE-DB, we conducted experiments on both Dataset1 and Dataset2 using independent testing set TE46 and TE129, respectively.We compared the results with existing DNA-binding sites prediction tools based on protein sequence input.CLAPE-DB outperformed other methods on both datasets (Table 2 and Table 3).Specifically, in TE46, CLAPE-DB trained on TR646 outperformed the second-best model DBPred 15 by a large margin, achieving a specificity of 0.835, a recall of 0.747, a precision of 0.306, an F1-score of 0.434, an MCC of 0.401, and an AUC of 0.871 in Dataset1 (Table 2), yielding a significant improvement over DBPred by 6.5%, 5.5%, 25.9%, 19.9%, 25.3%, 9.6%.Notably, DBPred used a manual feature extraction process, and a similar CNN model as CLAPE-DB, highlighting the advantages of using pre-trained models over handcrafted features.
Moreover, we trained and evaluated CLAPE-DB on Dataset2, and compared the performance with other existing tools (Table 3).CLAPE-DB also achieved a better predictive capability on this dataset, particularly in the recall metric, indicating the ability to accurately identify true DNA-binding sites.Furthermore, our proposed model demonstrated a significant improvement in the MCC metric over the previous models, suggesting a superior ability to handle imbalanced data.Dataset2 was proposed as a benchmark dataset for structure-based models, and we compared the metrics of several structure-based models (Supplementary Table 1).Although CLAPE-DB did not incorporate any structure information, it outperformed the structure-based models, such as COACH-D, NucBind, and DNAbind.Notably, the GraphBind model used predicted protein structure exhibited a poor performance with an AUC of 0.816, lower than that of CLAPE-DB.The results suggested that structure-based models required accurate protein structure to achieve acceptable prediction results.Moreover, compared to the structure-based models, CLAPE-DB used only a pre-trained language model and a simple backbone network to process the data, which reduced the model complexity and enhanced the inference speed, while maintaining accuracy.Our results also implied that pre-trained language models could capture some structural information from sequence inputs alone.
To test the generalization ability of CLAPE-DB, we trained a model on Dataset1 and tested it on Dataset2 (Figure 2a-b), as the protein sequence embeddings of Dataset1 and Dataset2 should have similar data distribution, DBPred was also tested for comparison.
The prediction metrics of CLAPE-DB surpassed DBPred by a large margin, AUC and AUPR of CLAPE-DB were 0.865 and 0.394, respectively, while the metrics of DBPred standalone version were 0.526 and 0.068, which was slightly higher than a random choice result.Besides, the result of CLAPE-DB was merely lower than CLAPE-DB trained on TR573 (0.871 and 0.881 respectively), showing our proposed model had a superior generalization ability compared to the second-best sequence-based model, DBPred.For further clarification of the generalization ability of CLAPE-DB, we selected the dataset TE181 (Supplementary Table 2) created by Yuan et al. 31 and tested the performance of CLAPE-DB trained on TR573.CLAPE-DB still showed a better performance than other sequence-based models and most structure-based models (Supplementary Table 3, Supplementary Table 4).To be noted, the COACH-D and NucBind did not perform well on the TE181 dataset, showing the model performance was restricted to the limited accuracy of protein structure prediction based on homology modeling and molecular docking method.

Backbone network comparison and feature visualization of CLAPE-DB
We conducted experiments to compare the performance of different mainstream backbone networks, including MLP, RNN, and CNN, in predicting DNA-binding sites, and we used an LSTM model to represent the RNN model.1DCNN model achieved the best performance among the three commonly-used models (Figure 2c).Although RNN was specifically designed for sequence modeling tasks, our finding suggested that the CNN was more suitable for predicting DNA-binding sites.This might be because RNN models process sequential data from left to right, whereas DNA-binding residues are predominately determined by spatial structures rather than simple sequential order.
While CNN models the protein sequences using sliding windows, which incorporate relative positional information of amino acids inherently, amino acids are treated as independent tokens in RNN models 32 .
To visualize the embedding space, we compared the embeddings generated by the CLAPE-DB model and an untrained, randomly initialized 1DCNN model, and we utilized t-SNE (t-distributed Stochastic Neighbor Embedding) dimension reduction method.Our results showed that CLAPE-DB learned a discriminative embedding space, while the data points were randomly distributed in the space after being processed by the untrained model (Figure 2d-e).Moreover, CLAPE-DB was able to effectively distinguish the DNA-binding and non-binding samples in the embedding space of each layer, with the distinction becoming more pronounced as the convolutional layer approached the output layer (Supplementary Figure 1).Additionally, we plotted the dimension reduction result of the raw features generated by ProtBert, which showed that the raw features were not well separated before model processing.Our results showed that CLAPE-DB was effective at distinguishing data samples from different classes (Figure 2f).

Contrastive learning improved the model performance
In the loss computation module, CLAPE-DB utilized a combination of triplet center loss (TCL) 34 and class-balanced focal loss 35,36 .To analyze the effectiveness of the loss functions, we performed ablation studies.TCL and focal loss generated discriminative embeddings in high-dimensional space, and both loss functions led to better performance than the commonly-used cross-entropy loss (Table 4).Furthermore, the improvement in the AUC value of the TCL function over the focal loss function was slightly smaller than that of cross-entropy loss (0.006 and 0.012, respectively), which might be attributed to the focal loss achieving better performance in modeling imbalanced datasets.The improvement in the AUPR value indicated that class-balanced focal loss and contrastive learning methods showed a better ability to cope with imbalanced datasets.
We also visualized the embeddings of the shape of 1024 generated by the first layer using focal loss only and jointly using focal loss and TCL.As expected, though the embeddings of DNA-binding sites and non-binding sites separated to a certain extent, the embeddings generated by joint loss functions showed a single clustering center, and the positive and negative samples were more discriminative (Supplementary Figure 2).
The single and uniform cluster center could benefit the classification performance according to the previous studies 34,37 .

Parameter impact of loss functions
The hyperparameters utilized in TCL and class-balanced loss matter in model training and inference, therefore, we analyzed and adjusted the hyperparameters in the loss functions.In the class-balanced focal loss, we implemented the effective number 36 as a reweight for samples from different classes, thereby we adjusted the hyperparameter γ.As γ was an exponential parameter, an increase in the value would lead to a simultaneous reduction in the values of both hard and easy samples.Furthermore, this reduction would be more obvious in the case of hard samples 35 .We tested values of γ from 1 to 10 and observed the AUC and AUPR values remained relatively stable within a specific range, but with an increase of γ , both metrics displayed a significant decline (Figure 3a).To verify our findings, we conducted further tests with γ values of 0.5 and 20.Finally, we adopted a γ value of 5.It is worth noting that as the total loss value decreased, a lower learning rate should be specified to ensure convergence.
During the optimization process of TCL, the cluster centers were randomly initialized, and we tested the model performance by adjusting the parameter learning rate and margin (m).Previous studies suggested that the learning rate for optimizing the cluster center should be relatively large 37 .However, we found that the AUC value was the highest when the learning rate was set to a relatively small value of 0.01 (Figure 3b).Margin was another crucial hyperparameter in TCL.If the margin was too large, the model might fail to recognize subtle differences between samples from different classes, and the convergence time might be prolonged.On the other hand, the loss of a lot of samples would be 0 when the margin was too small.Intuitively, we visualized the distance distribution of TR646 to guide our choice of parameter m.The distances from negative to positive and positive to negative were distributed from 7 to 12 (Figure 3c).Thus, we adjusted the margin value based on the distribution plot.We found that the AUC was maximized when the margin was set to 9, which was consistent with our expectations (Figure 3d).It is worth noting that the appropriate margin value was largely influenced by the data.Several attempts were made to adjust the margin according to the data distribution.For instance, Zhao et al 38

CLAPE-DB captured the properties distribution of amino acids
Based on previous studies, it is widely acknowledged that protein-DNA binding preferences are reflected in the sequences and structures of proteins and DNA 40 .For instance, proteins can bind DNA modules via hydrogen bonds and hydrophobic interactions.Such biological phenomena are related to the amino acid composition and properties of proteins.Therefore, in this study, we evaluated the predictive ability of CLAPE-DB using the TE129 dataset.
To this end, we performed a statistical analysis of the amino acid composition of DNA-binding sites and non-binding sites.Lysine, arginine, and tyrosine were the predominant amino acid types in the DNA binding sites, while alanine and leucine were the primary amino acid types in the non-binding sites (Figure 4a).Furthermore, we compared the amino acid type distribution of predicted results and the ground truth (Figure 4b) and used the Kullback-Leibler (KL) divergence to measure the distance of discrete distributions.The shapes of distributions of prediction and ground truth were found to be quite similar, and the forward and reverse KL divergence were 0.024 and 0.028, respectively, which were close to 0, indicating that the two distributions were semblable.Our results demonstrated that CLAPE-DB could accurately capture the amino acid composition features of DNA-binding residues.
Additionally, we analyzed the physicochemical properties of amino acids by extracting features from protein sequence and structure, and subsequently tested several selected properties, such as hydrophobicity, charge, secondary structure, and solvent accessibility.The t-SNE dimension reduction visualization revealed that different types of amino acid physiochemical properties were segregated into various clusters (Supplementary Figure 3a-d).Our results illustrated that the large-scale pre-trained protein language model ProtBert was capable of effectively learning the properties of amino acids.Such models were identified as appropriate feature extractors to replace handcrafted descriptors, which is congruent with previous studies 23 .
Moreover, CLAPE-DB was proved successful in predicting not only the distribution of amino acids but also their properties.CLAPE-DB showed a similar distribution of different types of properties like the majority of amino acids were polar and positively charged.Furthermore, the binding sites predicted by CLAPE-DB exhibited a similar composition of different properties to the real DNA-binding sites (Figure 4c-f).Taken together, CLAPE-DB accurately captured the amino acid information that was analogous to real binding sites.

Comparative and empirical case study
To intuitively visualize and compare the prediction performance of DNA-binding residues of CLAPE-DB, we selected two protein structures for illustration purposes: multiple antibody resistance regulator (MarR) families (PDB ID: 5H3R, chain A, denoted as 5H3R_A) and transcription repressor protein CouR (PDB ID: 6C2S, chain A, denoted as 6C2S_A).CLAPE-DB made an accurate prediction of DNA-binding sites, while DBPred only captured a limited number of true positive sites, highlighting the superior prediction ability of CLAPE-DB.In addition, the majority of false positive sites were located in close proximity to binding sites (Figure 5a-f).Our results suggested that CLAPE-DB effectively learned the amino acid properties that were spatially adjacent and the structural information without relying on protein structures.
DNA molecules are negatively charged and tend to bind the positively charged regions of proteins.The structure of the protein-DNA binding area could be divided into several domains with specific patterns 41 .Empirical observations and computational properties can be utilized to infer the DNA-binding sites from the protein structure.
However, such methods have significant limitations.Firstly, some proteins, such as intrinsically disordered proteins (IDP), are unstructured when not bound by ligands like DNA 42 .Therefore, the DNA-binding sites could not be inferred from the structure of such proteins.Secondly, the inferred probable DNA-binding sites using the surface charge distribution and protein structure are often quite different from the real binding sites.To illustrate the limitations of empirical analysis, we selected two protein structures: the transcription regulatory protein FadR (PDB ID: 5GPC, chain A, denoted as 5GPC_A) and bacteria quorum-sensing repressor protein RsaL (PDB ID: 5J2Y, chain A, denoted as 5J2Y_A).In both protein structures, multiple possible binding sites were identified based on the charge distribution (Figure 5g and Figure 5j), and it was difficult to determine which part of the protein would bind the major or minor groove of DNA.
However, CLAPE-DB precisely distinguished the binding sites, and the false positive sites were not influenced by the other positively charged locations (Figure 5h-i and Figure 5k-l).It should be noted that the empirical binding site identification relied on the experimental structures, which was limited when lacking protein structures or using inaccurately predicted structures.Furthermore, some DNA-binding proteins, such as transcription activator-like effector nuclease (TALEN), are not typical in common empirical analyses.Therefore, precise DNA-binding site prediction using CLAPE-DB is necessary instead of relying on empirical inference.

CLAPE was a general ligand-binding sites prediction framework
CLAPE could serve as a general framework for predicting other ligand-binding sites, including protein-RNA and antibody-antigen binding sites.(Figure 6a-b).To evaluate the prediction ability of CLAPE for these types of binding sites, we collected benchmark datasets of protein-RNA and antibody-antigen binding sites (Supplementary Table 5), and trained CLAPE on these datasets.The resulting models were denoted as CLAPE-RB (CLAPE RNA-binding) and CLAPE-AB (CLAPE-Antibody).Both CLAPE-RB and CLAPE-AB performed well on the testing sets, with CLAPE-AB achieving the AUC of 0.920 (Supplementary Table 6), which was relatively high and could be applied to accurately predict the paratope of a given antibody sequence.It should be noted that the prediction capability of antigen-agnostic paratope was limited and could be improved by adding the epitope information 43 .The prediction task of RNA-binding sites was complicated due to the flexibility of the RNA structure, and the metrics of CLAPE-RB were relatively low compared to CLAPE-AB.Nevertheless, the AUC of CLAPE-RB trained on TE161 was 0.830 (Supplementary Table 6), which surpassed the existing sequence-based RNA-binding sites models 44,45 .We also plotted the ROC and AUC curves to visualize the overall model performance of CLAPE-RB and CLAPE-AB (Figure 6c-d).
To evaluate the performance of our model, we trained CLAPE-RB on a separate protein-RNA dataset comprising TR495 and TE117, which were widely used benchmarks for structure-based models.CLAPE-RB outperformed existing sequencebased models in predicting RNA-binding sites on TE117.While the performance of CLAPE-RB was marginally lower than that of the structure-based model GraphBind, it performed better than Nucleic, a CNN model predicting RNA binding sites based on grids of the protein surface (Supplementary Table 8).Similarly, CLAPE-RB outperformed GraphBind based on inaccurately predicted protein structure, which highlighted the potential of CLAPE to overcome the limitations of structure-based models.Our results indicated that CLAPE was a versatile framework that could predict ligand-binding sites of a given protein sequence for a range of ligands.Furthermore, our experiments demonstrated that CLAPE, based on a large-scale pre-trained language model, was an effective predictor of ligand-binding sites, even in the absence of structural information, achieving relatively high performance.

CLAPE-DB exclusively predicted DNA-binding sites
Previous studies demonstrated that different ligands tend to bind to different sites on proteins 46 , implying that the performance of the CLAPE model trained on specific ligands, such as DNA, may be inferior compared to other types of ligands.To validate this hypothesis, we evaluated the performance of CLAPE-DB on RNA-binding sites prediction (using TE117) and antibody-antigen binding sites prediction (using TE259) tasks.
The various metrics of CLAPE-DB on DNA-binding sites, including precision, recall, F1-score, and MCC, were much higher than those of other ligand-binding sites (Supplementary Figure 4a), indicating CLAPE-DB was a specific predictor for DNAbinding sites.Notably, CLAPE-DB showed a poor ability to predict paratopes, which was consistent with the unique characteristics of antibody protein sequences.Furthermore, according to the predictive results (Supplementary Table 6), CLAPE-AB achieved an AUC value greater than 0.9, providing further evidence that CLAPE-DB learned discriminative features of DNA-binding protein sequences and that CLAPE was a general predictor for various ligand-binding sites.
We also plotted the ROC curve of the CLAPE-DB model on different ligands (Supplementary Figure 4b), and the AUC value of CLAPE-DB on DNA was still higher than on other binding sites.Notably, the AUC of CLAPE-DB on RNA reached 0.775, which was comparable to the performance of existing models trained on RNA datasets, such as RNABindPlus, SVMnuc, and GraphBind based on predicted protein structures.
Our results suggested that RNA and DNA had similarities in terms of their protein binding patterns, given their nucleic-acid nature.However, RNA molecules are typically single-chain and have more complicated conformations, which increases the difficulty of prediction.

Discussion
Protein-DNA binding plays an important role in many life activities, and studies on the binding properties contribute to the understanding of genome transcription and regulation.Accurate identification of DNA-binding sites of proteins is a crucial step in modeling the protein-DNA interactions.Various models have been developed using machine learning and deep learning techniques to identify DNA-binding sites from protein sequence or structure 15,47 .However, current tools rely on tedious manual feature extraction processing, which is time-consuming and redundant.Additionally, the accuracy of sequence-based models still needs to be increased, and the performance of the structure-based models is largely affected by the accuracy of protein structure, restricting their widespread application.Given these limitations, it is imperative to develop a satisfactory sequence-based model that utilizes protein sequence information alone to predict DNA-binding sites.To address the existing challenges and improve the performance of the sequence-based models, we proposed CLAPE, a deep learning framework that combines a large-scale pre-trained protein language model and contrastive learning technique to accurately predict DNA-binding sites of a given protein sequence.We performed multiple experiments to evaluate the performance and effectiveness of our proposed model.
In this study, we presented the overall architecture of CLAPE, which was comprised of three main components.Firstly, we utilized a pre-trained model, ProtBert, without fine-tuning, to conduct feature extraction.Secondly, we employed a 1DCNN to process the sequence feature and generate the classification score.Finally, we jointly optimized a class-balanced focal loss and a contrastive triplet center loss to address the issue of imbalanced data, which resulted in a more discriminative embedding space with a single cluster center.
The proposed CLAPE-DB model for predicting DNA-binding sites demonstrated superior performance compared to existing sequence-based models on two benchmark datasets, as indicated by all metrics, with an AUC of 0.871 and 0.881, respectively.Furthermore, in cases where accurate protein crystal structures were unavailable, CLAPE-DB outperformed structure-based models by a large margin.Additionally, we evaluated the generalization ability of the CLAPE-DB model on independent datasets and found that CLAPE-DB exhibited better generalization performance than the second-best model, DBPred.These results suggested that CLAPE-DB effectively learned the underlying latent distribution of DNA-binding sites.
To mitigate the effects of imbalanced data, we implemented the class-balanced focal loss in our proposed CLAPE model.There were several augmentation approaches from the aspect of the dataset, such as SMOTE 48 (synthetic minority over-sampling technique) to interpolate new data in the embedding space.The sequence and structure alignment methods could be used to transfer the annotation of the known DNA-binding sites of the matched proteins.Furthermore, incorporating the newly solved protein-DNA complexes into the dataset could enhance the prediction performance and generalization ability of the model.Additionally, various contrastive loss functions, such as the lifted structure loss and N-pair loss, could be employed for these tasks.The lifted structure loss considered all negative samples of the batch in a single optimization procedure, while N-pair loss included pairs from all classes as negative samples.CLAPE-DB showed a more discriminative embedding space via the visualization of dimension reduction of the hidden layers.In addition, the feature generated by ProtBert could capture the amino acid physicochemical properties and distributions.Our study demonstrated that a large-scale pre-trained protein language model could extract protein sequence features effectively, eliminating the need for designing handcrafted features.In this study, we only evaluated the ProtBert as the feature extractor, but other pre-trained protein models such as RITA 49 and ESM-2 50 , as reviewed in detail by Hu et al 30 .could also be used for feature generation.Here, we tested the performance of CLAPE-DB applying a larger protein language model ESM-2 as the feature extractor, which contained more parameters than ProtBert, and the model performance of CLAPE-DB was improved using ESM-2 which was consistent with our expectation (Supplementary Table 9).
In the computer vision and natural language processing fields, there is a trend toward utilizing a unified large model for addressing multiple downstream tasks 51 , namely artificial general intelligence (AGI), which could also be applied to generate embeddings of protein sequences.Although the CLAPE-DB was designed for sequence-based prediction tasks, it's possible to use the features generated by the pretrained model for the structure-based model, as demonstrated in related studies 52 .
Based on the results of our experiments, we conclude that CLAPE is a general prediction framework for identifying ligand-binding sites of a given protein sequence.
In addition, CLAPE-RB and CLAPE-AB demonstrated satisfactory performance on their respective datasets.Moreover, we showed that CLAPE-DB could exclusively predict the DNA-binding sites, which was not generally influenced by the information from other ligand-binding sites.The simple 1DCNN model used in all CLAPE series models effectively captured the neighboring information of the targeted residue.Given the flexibility of CLAPE, various backbone models could be applied, such as attentionbased models for modeling the long-range relationship of residues 53 .
Overall, the deep learning model CLAPE proposed in our study achieved high performances in predicting both DNA-and ligand-binding sites by combining pretrained models with contrastive learning methods.The promising and general framework can be applied in future studies to facilitate protein function annotation, protein engineering, and drug discovery.

Datasets description
In this study, we evaluated and compared the performance of our proposed model, CLAPE, with existing classifiers using two widely used benchmark datasets, denoted as Dataset1 and Dataset2.The training and testing datasets were denoted as TR and TE, respectively.Both datasets were preprocessed by similar procedures to improve the robustness of models and avoid bias due to the imbalanced data distribution, such as reducing the sequence similarity using a cutoff of 30% with CD-HIT 54 .The binding sites were defined similarly in both datasets as residues with a distance less than 0.5 plus the sum of the Van der Waals radius of the two nearest atoms between the residue and the nucleic acid molecule.Table 1 provides a summary of the benchmark datasets, and the details of both datasets are described below.
Dataset1 was introduced by the study of the DBPred model, a sequence-based deep learning method for predicting DNA-binding residues 15 .The dataset was collected from hybridNAP 55 and ProNA2020 21 and was composed of 646 proteins as the training set (TR646) with 15636 DNA-binding sites and 298503 non-binding sites, and 46 proteins as the testing set (TE46) with 956 DNA-binding sites and 9911 non-binding sites.
Dataset2 was originally proposed by the study of GraphBind, a structure-based graph neural network (GNN) model for identifying nucleic-acid-binding residues 47 .This dataset consisted of protein-DNA complex structural data extracted from the BioLiP database 56 , with 573 proteins as a training set (TR573) with 14479 DNA-binding residues and 145404 non-binding residues, and 129 proteins as a testing set (TE129) with 2240 DNA-binding residues and 35275 non-binding residues.GraphBind employed a data augmentation approach on the training set to alleviate the impact of the data imbalanced issue, hence we used the same augmented data annotations as GraphBind.
To assess the prediction capability of our proposed model CLAPE on diverse ligandbinding sites, we gathered three different datasets comprising protein-RNA and antibody-antigen interactions.The protein-RNA datasets were created by Xia et al.
based on the GraphBind model, and Patiyal et al., based on the pprint2 model 45 .The antibody-antigen dataset was collected from the SAbDab database 57 .To ensure a fair comparison with existing models, we applied the same data preprocessing procedure as used for defining DNA-binding sites.

Protein sequence embedding
The protein sequences were first input into ProtBert 25 , a pre-trained model, to generate high-dimensional embeddings.ProtBert is a member of the ProtTrans family of pre-trained models and is based on the BERT architecture.The ProtTrans models were trained on large-scale protein sequences and have been commonly used for predicting protein structure and properties.BERT employed a masked language modeling strategy to train a Transformer encoder 58,59 , which could effectively embed target tokens with contextual information.This approach is often used in token-level NLP tasks such as named entity recognition (NER).Since the task of predicting DNAbinding residues in protein sequences was also a token-level classification task, we utilized ProtBert, which has 400 million parameters, as a feature extractor.The dimension of the protein embedding generated by ProtBert was 1024.It is important to note that ProtBert was not fine-tuned during subsequent training steps, and the sequence embedding process was performed using HuggingFace's Transformers Python package 60 .

Backbone 1DCNN model and classification head
We utilized a 1DCNN (one-dimensional convolutional neural network) as our backbone model to obtain a residue-level classification score.The convolutional neural network used a convolutional kernel to capture neighboring information and used operations like max pooling for down-sampling.In a single step of 1DCNN operation, the shape of input and output features were [, ,  −1 ] and [, ,   ], respectively, where  stood for the batch size of input data,  was the maximum length of the protein sequence,  −1 and   were the dimension of the last layer and the current layer, respectively.To maintain the same length of input and output protein sequence and obtain a unified token-level classification result, we applied padding for different convolutional kernel sizes.The stride of every layer was set to 1, and we utilized ReLU

Binary classification loss function
The classification loss function is crucial in neural network design as it measures the difference between predicted and true labels.Cross entropy is a commonly used loss function in binary classification tasks.However, protein-DNA binding data confronted a data imbalance issue as shown in Table 1, thus we applied a class-balanced focal loss to address this problem.
The focal loss was introduced by Lin et al. 35 and places more emphasis on classes with fewer samples in the loss function.It also considers the difficulty of samples based on the classification probability provided by the classifier.Specifically, if the classification probability was high enough, the sample would be defined as an easy sample.The focal loss is formulated as follows: , where   is the classification probability of a particular class, 1 −   is the modulator,  is a hyperparameter to adjust the weight of hard and easy samples.In the original paper, α is also a parameter to give the weight of minority and majority samples, which is influenced by .We applied an effective number to reweight the focal loss which was proposed by Cui et al 36 , the basic hypothesis behind this idea was, with the increasing number of samples, the overlapping of embeddings would lead to information redundancy, thus effective number was proposed to model the real space covered by all samples, which could be used as a weight for imbalanced data.The classbalanced focal loss can be formulated as: (   ) (2) , 61 where   = (1 − β  )/(1 − β) refers to the effective number of the class, we set β to 0.999 in our study according to Cui et al 36 .The class-balanced focal loss was jointly optimized with contrastive loss, as described in the following parts.

Contrastive learning loss
Contrastive learning aims to identify an embedding space where similar samples are positioned close to each other, while the dissimilar ones are far apart.Contrastive learning techniques have been extensively used in computer vision and natural language processing, and several models have shown promising results in representation learning, such as MoCo 62 and SimSiam 61 .In our study, we applied a contrastive loss, namely triplet center loss (TCL) 34 , which is a supervised approach that takes into account the labels of the training data.TCL is a combination of center loss 37 and triplet loss 27 , which could be described as:  (6) .
The backpropagation stops at the embedding generated by ProtBert, which means we did not fine-tune the pre-trained language model.

Evaluation metrics
In this study, we employed several classification evaluation metrics to ensure consistency with the previous studies.The metrics included specificity (Spe), precision features.Pre-training and contrastive learning are two widely-used representation learning techniques.Pre-training utilizes the information of a large scale of unlabeled data to train the model in an unsupervised manner and transfers the model parameters to downstream tasks for fine-tuning or feature extraction 24 .Contrastive learning aims to discover a representation space where samples from the same class are close to each other, while those from the different classes are distant.In this study, we integrated pre-training and contrastive learning techniques to devise the CLAPE (Contrastive Learning And Pre-trained Encoder), which enabled the prediction of ligand-binding sites of a protein sequence.Specifically, we trained CLAPE-DB on DNA-binding datasets and demonstrated that it surpassed current sequence-based models by learning a discriminative embedding space.Additionally, we illustrated that CLAPE could serve as a general framework for predicting ligandbinding sites exclusively based on protein sequence information, thereby improving the comprehension of the feature extraction process and the development of the model architecture for future research.
, which comprised three main modules: the sequence embedding module, the backbone network module, and the loss computation module.The sequence embedding module utilized ProtBert 25 , a pre-trained protein language model, to encode protein sequences in FASTA format and generate features with a dimensionality of 1024.The features were then passed through the backbone network, which in CLAPE was a 4-layer 1DCNN.The backbone network module generated a 2-dimensional matrix.The loss computation module employed a contrastive loss function, guided by binary classification loss, to optimize the model parameters.Finally, the classification head utilized a Softmax function to transform the prediction scores of the backbone network into the classification probabilities, which was common-used in current approaches.
used the true distance to model the margin and Cheng et al 39 used a self-adaptive margin by a Gaussian prior distribution.

(
rectified linear unit) as an activation function to introduce nonlinearity to the model.We applied dropout and batch normalization techniques to enhance the robustness and generalization ability of the model.Our CLAPE-DB model consisted of 4 1DCNN layers as the backbone model.The raw dimension was 1024, and the output dimension of the 4 layers were 1024, 128, 64, and 2, respectively.The classification head part contained a Softmax function to scale the output value between 0-1 as a mutually exclusive prediction score, representing the classification probability of DNA-binding sites.
Pre), recall (Rec), F1-score, and Matthews correlation coefficient (MCC).The metrics can be formulated as follows: FP, TN, and FN stand for true positive (number of residues that are correctly classified as DNA-binding sites), false positive (number of residues that are incorrectly classified as DNA-binding sites), true negative (number of residues that correctly classified as non-binding sites) and false negative (number of residues that incorrectly classified as non-binding sites), respectively.Specifically, specificity indicates the portion of correctly predicted non-binding sites, precision measures the accuracy of residues predicted as DNA-binding sites, recall measures the portion of DNA-binding residues successfully discovered by the model, and F1-score is the harmonic mean of precision and recall.MCC evaluates the prediction ability of both positive and negative classes of the model and is commonly used in imbalanced data.Besides, we plotted the ROC (receiver operating characteristic) curve and precisionrecall curve to illustrate the overall performance of a model and used two thresholdagnostic metrics AUC (area under ROC curve) and AUPR (area under PR curve) as numerical evaluations of both curves.Cui, Y., Jia, M., Lin, T.-Y., Song, Y. & Belongie, S. J. Class-Balanced Loss Based on Effective Number of Samples.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9260-9269 (2019).

Figure 2 :
Figure 2: Evaluation of CLAPE-DB model performance.(a) Receiver operating characteristic (ROC) curves of DBPred and CLAPE-DB models.CLAPE-DB showed larger area under ROC curve (AUC) than DBPred, indicating a better generalization ability.(b) Precision-recall (PR) curves of DBPred and CLAPE-DB models.(c) Comparison of different backbone models, where we used an LSTM model to represent RNN.(d) t-SNE dimension reduction plot of the first layer output of a randomly initialized 1DCNN model.(e) t-SNE dimension reduction plot of the first layer output of CLAPE-DB.(f) t-SNE dimension reduction plot of the original sequence features generated by ProtBert.All of (c-f) were tested and plotted using TE46, with creamcolored and red data points indicating non-binding sites and DNA-binding sites, respectively.

Figure 3 :
Figure 3: Hyperparameter Optimization of Loss Functions.(a) Trends in the AUC and AUPR metrics with varying parameter .Both metrics reached their maximum values when  was set to 5. (b) Trend in the AUC metric with varying learning rate of the Triplet Center Loss (TCL).The AUC reached its maximum value when the learning rate was set to 0.01.(c) Distance distribution of negative and positive samples, where the distance was defined as the maximum Euclidean distance between a given sample and the sample from the opposite class.The embedding used to calculate the distance was the raw sequence embedding generated from ProtBert.(d) Trend in the AUC metric with varying margin of TCL.The AUC reached its maximum value when the margin was set to 9.

Figure 4 :
Figure 4: Analysis of amino acid composition and properties.(a) Distribution of amino acid composition in DNA-binding sites and non-binding sites.(b) Comparison of the distribution of experimental DNA-binding sites with predicted binding sites.(c-f) Comparison of the distribution of amino acid physicochemical properties and structural properties of real DNA-binding sites with predicted binding sites.(c-f) represents hydrophobicity, secondary structure, charge, solvent accessibility, respectively.

Figure 5 :
Figure 5: Comparative and empirical case studies.(a-c) Analysis of the DNA-binding sites for protein 5H3R_A, where (a) represents the experimental result, (b) and (c) represent the results predicted by CLAPE-DB and DBPred, respectively.(d-f) Analysis of the DNA-binding sites for protein 6C2S_A, where (d) represents the experimental result, (e) and (f) represent the results predicted by CLAPE-DB and DBPred, respectively.Magenta residues indicate the experimental binding sites and true

Figure 6 :
Figure 6: General binding sites prediction ability of CLAPE.(a-b) Binding diagrams of protein-RNA (PDB ID: 5GAN) and antibody-antigen (PDB ID: 1OAY), demonstrating the ability of CLAPE to predict protein-ligand binding sites.(c-d) ROC and PR curves of CLAPE-RB and CLAPE-AB models.CLAPE-RB achieved an AUC of 0.830 and an AUPR of 0.511, while CLAPE-AB achieved an AUC of 0.920 and an AUPR of 0.568.
where  is the margin value, the expected distance of a given positive and negative sample,    is the center of the given class   , and   is an anchor point, while  + and  − are positive and negative data samples, respectively.TCL makes the positive and negative samples far away from each other and forces the samples of different classes to be close to respective cluster centers.The formulation of TCL can be mathematically expressed as follows:  = ∑max ((  ,    ) +  − min ≠   (  ,   ), 0)   refers to the classification probability predicted by the model. indicates the Euclidean distance between data points: (  ,    ) =

Table 1 :
Summary of benchmark protein-DNA binding datasets

Table 2 :
Comparison of CLAPE-DB with other sequence-based methods on TE46

Table 3 :
Comparison of CLAPE-DB with other sequence-based methods on TE129

Table 4 :
Model performance using different loss functions