Abstract

Allergies have become an emerging public health problem worldwide. The most effective way to prevent allergies is to find the causative allergen at the source and avoid re-exposure. However, most of the current computational methods used to identify allergens were based on homology or conventional machine learning methods, which were inefficient and still had room to be improved for the detection of allergens with low homology. In addition, few methods based on deep learning were reported, although deep learning has been successfully applied to several tasks in protein sequence analysis. In the present work, a deep neural network-based model, called DeepAlgPro, was proposed to identify allergens. We showed its great accuracy and applicability to large-scale forecasts by comparing it to other available tools. Additionally, we used ablation experiments to demonstrate the critical importance of the convolutional module in our model. Moreover, further analyses showed that epitope features contributed to model decision-making, thus improving the model’s interpretability. Finally, we found that DeepAlgPro was capable of detecting potential new allergens. Overall, DeepAlgPro can serve as powerful software for identifying allergens.

INTRODUCTION

Allergy is a chronic inflammatory disease that refers to an abnormal immune response to certain chemicals, namely allergens. Common allergens include chemical allergens (e.g. dyes and skincare products), aeroallergens (e.g. dust mites and pollens) and food allergens (e.g. milk and eggs), which are responsible for 90% of allergic diseases [1]. Allergies are estimated to affect approximately 20% of the world’s population [2]. Typical symptoms of allergies include respiratory issues, skin reactions, gastrointestinal problems and cardiovascular disease. Unintentional exposure of an allergen to an allergic individual can result in fatal anaphylaxis [3]. The most common allergic reaction, immunoglobulin (IgE)-mediated type I hypersensitivity, is a reaction that occurs within minutes when a sensitized individual is re-exposed to the same allergen, which then binds to specific IgE via epitopes. Currently available diagnostic and therapeutic approaches aim to relieve symptoms; however, medication does not provide long-term relief from allergic diseases [4]. Thus, the prevention principle of this type I allergic reaction mediated by IgE is to identify allergens, avoid re-contact and cut off or interfere with links in the process of the hypersensitivity reaction so as to terminate the follow-up reaction [5, 6]. However, if the causing allergen is rare or unknown, it may be difficult to avoid exposure to the causing allergen. Therefore, it is critical to evaluate potentially causing allergens, especially today when the number of modified proteins in food, therapeutic drugs and biopharmaceuticals is increasing rapidly [7].

To date, many computational methods for identifying potential allergens have emerged. In 2001, the United Nations Food and Agriculture Organization (FAO) and the World Health Organization (WHO) pointed out that there are two criteria for evaluating the structural similarity between new food proteins and known allergens: the first is to use a window covering 80 amino acids for sliding search, and the similarity of 35% is taken as the criterion for identity; the second is to evaluate short-chain amino acids and evaluate whether the new protein contains 6–8 consecutive amino acids that are the same as known allergens, which is called 6–8 mers hit [8]. However, the false positive rate of this method, which is only based on homologous alignment, is high, and many effective resources may be wasted due to the incorrect allergic protein evaluation [9]. After that, allergen prediction tools based on bioinformatics and machine learning have been widely explored. For instance, AllerHunter took pairwise sequence similarity as a feature and used support vector machines for classification [10]. Both AllergenFP [11] and AllerTop v.2 [12] used amino acid E-descriptors to represent proteins, and the former one used the fingerprint approach for classification, while the latter one used a variety of machine learning methods for classification and proved that K-nearest neighbor (KNN) performed best. In addition, AllerCatPro [13] was developed in 2019 by combining the k-mer hit principle and epitope information to predict allergens and has been upgraded to version 2.0 to achieve more accurate prediction [14]. Moreover, AlgPred 2.0 predicted the allergenicity of proteins using a random forest (RF) model when BLAST search, motif enrichment, and Motif EmeRging and with Classes Identification failed to hit the epitope datasets [15]. Nedyalkova et al. [16] proposed a new chemometric approach to explore the allergenicity of food proteins and found the support vector machine (SVM) to be the best classifier. However, existing approaches suffer from several limitations.

First, new allergens with little similarity to known allergens in the existing database are likely to be omitted when only the sequence-similarity-based approach is utilized. Second, by using the similarity method, many proteins are incorrectly labeled as allergens, leading to a waste of effective resources. For example, only one of the 200 probable allergens predicted by the WHO/FAO proposed principles was actually an allergen [9]. In addition, today’s machine learning methods for identifying allergens are relatively shallow and were developed on small datasets. Therefore, these methods may have difficulty in generalizing to massive allergens in the wild. Furthermore, the majority of these approaches were not appropriate for analyzing large-scale proteins because most of them were only available online with very limited input sequence numbers (e.g. AllergenFP, AllerTop v.2 and AllerCatPro 2.0).

Deep learning enables computational models made up of several nonlinear modules to learn data representations with different degrees of abstraction [17]. Due to its powerful feature extraction ability, deep learning has been widely applied in the study of biological sequences, such as AlphaFold [18, 19], a ground-breaking accomplishment, which can predict the structure of proteins. Also, a protein sequence design method, ProteinMPNN, has proven to have excellent performance in computer and experimental testing [20]. Meanwhile, deep learning is widely employed to forecast proteins with certain characteristics and functions, such as antimicrobial peptides [21]. These studies show the great potential of deep learning for protein data mining by characterizing the intrinsic, abstract and complicated patterns of proteins [22, 23]. Therefore, it is highly motivated to explore the application of deep learning to allergy identification, where the intrinsic characteristics of allergens are extracted accurately so that significantly lower allergen prediction error rates will be achieved, even for those allergens with low homology.

In fact, several studies using deep learning to identify allergens have emerged in recent years. Wang et al. conducted a comparative analysis using transformer-based deep learning and ensemble learning models to identify food allergens, comparing their respective strengths [24]. Shanthappa et al. proposed ProAll-D [25], a deep neural network that used E-descriptors to describe proteins and leveraged long short-term memory (LSTM) to identify allergies. The former, however, only targeted food allergies, and the latter was also built on a small dataset.

Here, we propose a software package called DeepAlgPro that implements a model that combines a convolutional neural network (CNN) with multi-headed self-attention (MHSA) and is suitable for large-scale prediction of allergens. In addition to its high prediction accuracy, DeepAlgPro is superior to existing methods for predicting allergens in terms of large-scale prediction. Meanwhile, DeepAlgPro is biologically interpretable and capable of capturing linear epitope features. Furthermore, we found that DeepAlgPro has an excellent ability to identify potential novel allergens. For researchers to implement DeepAlgPro, an open-source package is available at https://github.com/chun-he-316/DeepAlgPro. We expect that DeepAlgPro will contribute to the identification of allergens.

MATERIALS AND METHODS

Datasets

The datasets used in the present study consisted of 3550 allergens and 3550 non-allergens. Allergen datasets were extracted from Structural Database of Allergenic Proteins (SDAP) (https://fermi.utmb.edu/SDAP/), IUIS Allergen Nomenclature (http://allergen.org/), AllergenOnline (https://www.allergome.org/) and COMPARE (https://comparedatabase.org/), coupled with allergens from UniProt release 2022_2 (https://www.uniprot.org/) using the keyword ‘allergen AND reviewed: yes AND Protein Existence: Evidence at protein level’. Besides, we obtained the datasets used in previous studies: 2427 and 10 075 allergens from AllerTOP v.2 and AlgPred 2.0, respectively. Then, we removed 17 071 duplicate allergen sequences with CD-HIT [26], conditioned on a sequence identity threshold of 1 and an alignment coverage of 0.8 for the shorter sequence. In addition, 37 allergens with a length of more than 1000 amino acids (AAs) were removed. The non-allergens were collected from UniProt using the query ‘Protein Existence: Evidence at protein level NOT allergen NOT allergenic NOT allergy NOT cancer NOT antigen AND reviewed: yes’. In order to avoid the learning bias caused by redundant proteins, the non-allergens with a mutual sequence similarity higher than 40% and more than 40% similarity to allergens were deleted using CD-HIT. Finally, we randomly selected 3550 non-allergens for further studies. Note that the proteins that contain non-standard characters (i.e. ‘BJOUXZ’) were removed from the datasets.

To train the model and perform 10-fold cross validation, 80% of allergen and non-allergen data were randomly selected and used as training data. The rest of the data was used as test data to evaluate the final model. Overall, the training data and test data consisted of 5680 and 1420 samples, respectively (Figure 1A).

An overview of DeepAlgPro. (A) The details of data processing. The collected allergens and non-allergens shorter than 1000 amino acids were filled with N in front. The final dataset contains 3550 allergens and 3550 non-allergens. 80% of the positive and negative samples were used for training and 10-fold cross-validation, respectively, and 20% were used for testing. (B) Framework of DeepAlgPro. DeepAlgPro mainly consisted of a convolutional module and an attention layer. The input is a protein sequence of length 1000. The output is a prediction score ranging from 0 to 1. If the predicted score was higher than 0.5, the protein is an allergen and otherwise a non-allergen.
Figure 1

An overview of DeepAlgPro. (A) The details of data processing. The collected allergens and non-allergens shorter than 1000 amino acids were filled with N in front. The final dataset contains 3550 allergens and 3550 non-allergens. 80% of the positive and negative samples were used for training and 10-fold cross-validation, respectively, and 20% were used for testing. (B) Framework of DeepAlgPro. DeepAlgPro mainly consisted of a convolutional module and an attention layer. The input is a protein sequence of length 1000. The output is a prediction score ranging from 0 to 1. If the predicted score was higher than 0.5, the protein is an allergen and otherwise a non-allergen.

Model architectures

Our model was implemented in Pytorch 1.12.1 and trained on an NVIDIA GeForce RTX 3090. The first layer was a one-hot encoding layer that converts protein sequences into a matrix of |$N\times L$| dimensions, where |$L$| is the longest sequence length of 1000 and |$N$| corresponds to the 20 standard amino acids, and 0 for padding. Then, the matrices were passed into a convolutional layer (in_channels: 21, out_channels: 16, stride: 1, kernel_size: 5) with a non-linear activation function (ReLU) and a maxpool layer (pool_size: 5, stride: 5), which takes the largest value in the local receptive field. A dropout layer was then used to prevent overfitting. The next part was the MHSA mechanism [27] with 8 heads and 24 hidden layers. In the final layer, we used the fully connected layer with the sigmoid function to transform outputs into values between 0 and 1 (Figure 1B). The convolutional layer can be represented as the following:

where |$i$| is the index of output location and |$j$| is the index of the fliter. Each convolution fliter |$\mathrm{E}$| indicates the encoding matrix, and |$\mathrm{W}$| is a matrix of size |$k\times e$|⁠, where |$k$| is the kernel size and |$e$| is the encoding dimension. The |$\mathrm{ReLU}$| can be expressed as the following:

The self-attention mechanism can be represented as the following:

where |${\mathrm{W}}^{\mathrm{Q}}$|⁠, |${\mathrm{W}}^{\mathrm{K}}$| and |${\mathrm{W}}^{\mathrm{V}}$| indicates the weight matrix for learning of the query, key and value, respectively [27]. So, the MHSA can be represented as the following:

where |$n$| indicates the number of heads.

During training, we used BCELoss as the loss function and Adam with default parameters as the optimizer, batch_size = 72, learning rate = 0.0001 and epochs = 120. When a protein's prediction score was greater than 0.5, it was considered as a candidate allergen. The model for evaluation was saved when the F1 score achieved the largest value during validation.

Evaluation of performance

To comprehensively evaluate the performance of the different methods for predicting allergens, we analyzed several metrics, including accuracy, precision, recall and F1 score. They can be defined as follows:

where |$\mathrm{TP}$|⁠, |$\mathrm{TN}$|⁠, |$\mathrm{FP}$| and |$\mathrm{FN}$| mean the number of true positive, true negative, false positive and false negative samples, respectively. In addition, we also used the function of precision_recall_curve and roc_curve from the sklearn.metrics module (Python 3) to evaluate models with different architectures.

Finding key regions

We used Gradient-weighted Class Activation Mapping (Grad-CAM) [28] to explore the key regions that affect the final decision of the neural network during the convolutional operation. We found seven amino acids upstream and downstream of the positions with the Grad-CAM weight higher than 80% of the max value and used MEME 5.5.0 (https://meme-suite.org/meme/tools/meme) to find extensively existing motifs (−nmotifs 200). At the same time, we collected linear epitope sequences from AllerBase [29], IgPred [30], and the Immune Epitope Database (IEDB) [31] and found motifs by using MEME 5.5.0 (−evt 0.05). Then Tomtom 5.5.0 (https://meme-suite.org/meme/tools/tomtom) was employed to compare the motifs found above.

RESULTS

Overview of the DeepAlgPro framework

DeepAlgPro is a model that supports prediction of protein allergenicity. Considering the role of convolutional kernels as motif detectors, which is similar to the 6–8 mers hit principle for allergen identification, we constructed the model based on convolutional neural networks [32–34]. The integrated model combines encoding, feature extraction and two-label classification modules (Figure 1B). In brief, the first layer used one-hot encoding to transform protein sequences into matrices. Then, the encoding result was passed to the conventional and pooling layers for feature extraction. In order to extract key features, the MHSA mechanism was then used to capture connections between amino acids. Finally, we used a fully connected layer to determine whether a protein is allergenic. In general, given a set of allergens and non-allergens, the binary classification model learns a mapping between their key sequence features and categories. Once this relationship is learned, allergenicity predictions for an unknown protein can be made.

Performance comparison

The methods were compared in terms of accuracy, precision, recall and F1 score (Table 1, Supplementary Tables 15 available online at http://bib.oxfordjournals.org/). As shown, the accuracy, precision, recall, and F1 of DeepAlgPro were able to reach 91.62, 92.76, 90.28, and 0.915, respectively; among which, accuracy, precision and F1 were all second only to AllerCatPro 2.0, a workflow that combined k-mer matches, gluten-like Q-repeats, and 3D epitope similarity, which was only about 2% higher than DeepAlgPro. It is obvious that DeepAlgPro outperformed the LSTM-based method—ProAll-D [25], whose metrics were only about 85%. AlgPred 2.0 achieved the highest recall, reaching 97.61%, but its precision was the lowest, at only 85.14%. In the method of representing proteins by physicochemical properties, AllerTOP v.2 could achieve 91.24% precision using KNN to classify allergens and non-allergens, in contrast to AllergenPF, which classified by Tanimoto coefficient similarity and only achieved 86.58%. In summary, DeepAlgPro, AllerCatPro 2.0 and AlgPred 2.0 performed well overall, with DeepAlgPro ranking second and excelling in all evaluation metrics.

Table 1

The test performance of DeepAlgPro and reported methods

Accuracy (%)Precision (%)Recall (%)F1
DeepAlgPro91.6292.7690.280.915
AlgPred 2.090.2885.1497.610.909
AllerCatPro 2.093.8093.3194.360.938
AllerTOP v.289.7991.2488.030.896
AllergenPF87.6186.5889.010.878
ProAll-D85.9986.7684.930.858
Accuracy (%)Precision (%)Recall (%)F1
DeepAlgPro91.6292.7690.280.915
AlgPred 2.090.2885.1497.610.909
AllerCatPro 2.093.8093.3194.360.938
AllerTOP v.289.7991.2488.030.896
AllergenPF87.6186.5889.010.878
ProAll-D85.9986.7684.930.858
Table 1

The test performance of DeepAlgPro and reported methods

Accuracy (%)Precision (%)Recall (%)F1
DeepAlgPro91.6292.7690.280.915
AlgPred 2.090.2885.1497.610.909
AllerCatPro 2.093.8093.3194.360.938
AllerTOP v.289.7991.2488.030.896
AllergenPF87.6186.5889.010.878
ProAll-D85.9986.7684.930.858
Accuracy (%)Precision (%)Recall (%)F1
DeepAlgPro91.6292.7690.280.915
AlgPred 2.090.2885.1497.610.909
AllerCatPro 2.093.8093.3194.360.938
AllerTOP v.289.7991.2488.030.896
AllergenPF87.6186.5889.010.878
ProAll-D85.9986.7684.930.858
The results of the ablation experiment. Without MHSA indicates a model that removes the MHSA layer from DeepAlgPro. Without CNN indicates a model that removes the convolutional layer. (A) The performance of Area Under the Receiver Operating Characteristic Curves (AUROCs) on the ablation analysis of DeepAlgPro. (B) The performance of Area Under the Receiver Operating Characteristic Curves (AUPRCs) on the ablation analysis of DeepAlgPro.
Figure 2

The results of the ablation experiment. Without MHSA indicates a model that removes the MHSA layer from DeepAlgPro. Without CNN indicates a model that removes the convolutional layer. (A) The performance of Area Under the Receiver Operating Characteristic Curves (AUROCs) on the ablation analysis of DeepAlgPro. (B) The performance of Area Under the Receiver Operating Characteristic Curves (AUPRCs) on the ablation analysis of DeepAlgPro.

Ablation experiments

To investigate the contributions of different parts of our proposed model, we conducted ablation experiments. We tried to delete the attention mechanism and convolution layer on the basis of the original model, respectively, and tested their performance. It was obvious that compared with the full DeepAlgPro (CNN + MHSA) pipeline, in which the Area Under the Receiver Operating Characteristic Curve (AUROC) and the Area Under the Precision Recall Curve (AUPRC) could reach 0.963 and 0.969, respectively, AUROC and AUPRC were only 0.857 and 0.848 without the conventional layer, respectively (Figure 2). However, when there was no attention mechanism, the corresponding evaluation index only dropped by 0.012 and 0.009, respectively. In a similar vein, when the conventional layer is removed from the model, the performance is significantly lower, particularly accuracy and precision, which decreased by 14.44 and 18.08%, respectively. However, the accuracy and recall only dropped by 3.03 and 1.41%, respectively, after the MHSA was removed, and the F1 score only dropped by 0.029, still capable of reaching 0.886. The biggest change in precision also dropped by less than 5% (Table 2). These results proved that the conventional layer played a vital role in our predictive model, which may have learned key features of allergens. This may be due to the similarity of the convolution operation to the k-mer principle, which is a principle FAO/WHO proposed to predict allergens [32–34]. Therefore, we conclude that the combination of CNN and MHSA performed best in our study.

Model visualization

To better reflect the classification performance of the model, we extracted and visualized the input matrices, outputs of the conventional layer, outputs of the attention layer and outputs of the fully connected layer. We used Umap [35] to reduce dimensions and visualize the distribution of positive and negative samples. It clearly demonstrated that the original features were disorderly and chaotic (Figure 3A). After the mapping of convolutional and attention layers, the distribution of allergens and non-allergens tended to be separated (Figure 3B and C). It was obvious that the outputs of the fully connected layer could clearly separate allergens from non-allergens (Figure 3D). However, at the same time, some overlaps resulting in incomplete boundaries could be observed in Figure 3D. We thought that these heterogeneous samples located near the feature space may imply crosstalk points and some potentially undetected allergens. These results indicated that our model indeed extracted latent features of proteins that suffice to aid in decision-making for identifying allergens.

Table 2

The results of the ablation experiments

Accuracy (%)Precision (%)Recall (%)F1
DeepAlgPro91.6292.7690.280.915
Without MHSAa88.5988.3888.870.886
Without CNNb77.1874.6882.250.783
Accuracy (%)Precision (%)Recall (%)F1
DeepAlgPro91.6292.7690.280.915
Without MHSAa88.5988.3888.870.886
Without CNNb77.1874.6882.250.783

aWithout MHSA indicates a model that removes the MHSA mechanism from DeepAlgPro.

bWithout CNN indicates a model that removes the convolutional layer.

Table 2

The results of the ablation experiments

Accuracy (%)Precision (%)Recall (%)F1
DeepAlgPro91.6292.7690.280.915
Without MHSAa88.5988.3888.870.886
Without CNNb77.1874.6882.250.783
Accuracy (%)Precision (%)Recall (%)F1
DeepAlgPro91.6292.7690.280.915
Without MHSAa88.5988.3888.870.886
Without CNNb77.1874.6882.250.783

aWithout MHSA indicates a model that removes the MHSA mechanism from DeepAlgPro.

bWithout CNN indicates a model that removes the convolutional layer.

Distribution of allergens and non-allergens in two-dimensions feature space. Red and blue dots represent allergens and non-allergens, respectively. (A) Visualization of input features. (B) Feature visualization of outputs from the convolutional layer. (C) Feature visualization of outputs from the attention layer. (D) Feature visualization of outputs from the fully connected layer.
Figure 3

Distribution of allergens and non-allergens in two-dimensions feature space. Red and blue dots represent allergens and non-allergens, respectively. (A) Visualization of input features. (B) Feature visualization of outputs from the convolutional layer. (C) Feature visualization of outputs from the attention layer. (D) Feature visualization of outputs from the fully connected layer.

Biological interpretability of the convolutional layer

Epitopes are chemical groups present on the surface of allergens with a specific structure and immunological activity that determine the antigen’s specificity [36]. IgE epitopes are responsible for inducing specific antibody production and thus allergic reactions in animals, which means that a protein is considered an allergen if it contains an IgE epitope or has excellent sequence similarity to an IgE epitope [15, 37]. Since our framework can effectively identify allergens, we hypothesized that this framework would learn epitope information, which is the key feature for telling allergens apart from non-allergens. Therefore, Grad-CAM, a convolutional network visualization method based on gradient localization, which translates the neural network’s classification basis in the form of a heat map, was employed to verify this hypothesis.

First of all, the motif software Tomtom was used to compare the motifs extracted from regions with high Grad-CAM weights with those extracted from collected epitopes. There were 11 pairs of matched motifs with statistical significance (p-value < 1e−03), and the alignments of the five most matched pairs were shown in Figure 4A–E (Supplementary Table 7 available online at http://bib.oxfordjournals.org/). In each case, the top panel was the motif extracted from epitopes, and the bottom panel was the motif extracted from regions with high Grad-Cam weights. It was observed that the motifs extracted from those regions that led the model to judge proteins as allergens were very similar to those in epitopes. Among them, the Tomtom p-values of match in Figure 4A and B were 1.20e-08 and 2.60e-08, respectively, representing the most similar motifs in our study.

Interpretability of convolutional layers. (A–E) The alignment for the five most matched pairs of motif in epitopes and regions with high Grad-CAM weights. In each case, the top panel was the motif extracted from epitopes, and the bottom panel was the motif extracted from regions with high Grad-CAM weights. Tomtom p-value (lower means better match) of each match is shown at the bottom. (F) Grad_CAM results of Gad m 1. Epitope sequences were marked below. The numbers on the upper left and right of the sequence represent its start and end sites in the allergen. The same below. (G) Grad_CAM results of Fel d 1(chain 2). (H) Grad_CAM results of Ara h 1.
Figure 4

Interpretability of convolutional layers. (AE) The alignment for the five most matched pairs of motif in epitopes and regions with high Grad-CAM weights. In each case, the top panel was the motif extracted from epitopes, and the bottom panel was the motif extracted from regions with high Grad-CAM weights. Tomtom p-value (lower means better match) of each match is shown at the bottom. (F) Grad_CAM results of Gad m 1. Epitope sequences were marked below. The numbers on the upper left and right of the sequence represent its start and end sites in the allergen. The same below. (G) Grad_CAM results of Fel d 1(chain 2). (H) Grad_CAM results of Ara h 1.

Next, we explored the details of the epitopes in individual allergens. Gad m 1, a kind of allergenic parvalbumin from Atlantic cod, Gadus morhua. The recognition of sera from 13 fish-allergic patients with synthetic peptides in Gad m 1 has demonstrated that peptide 94–109 (GDGKIGVDEFGAMIKA) is a major IgE-binding site in Gad m 1 [38]. We found that the peptides with the highest Grad-CAM weight were in the epitope region (Figure 4F). Also, the peptide 32–45 (FAVANGNELLLDLS) in Fel d 1 (chain 2), an allergen from cat (Felis domesticus) saliva, showed specific binding to human IgE (Figure 4G) [39]. In our study, it is clear that peptides 32–45 contain the region that caused our framework to determine Fel d 1 as an allergen. Similarly, the three linear epitope sequences (AKSSPYQKKT, EQEERGQRRW and QEPDDLKQKA) of Ara h 1, a major allergen in peanut hypersensitivity, were also successfully captured by our model (Figure 4H) [40]. In addition, we also found many highly weighted sequence locations that have not yet been proven to be epitopes, and we assumed that they may be unproven epitope regions.

In sum, by utilizing Grad-CAM to uncover motifs and produce visual explanations for the model’s judgments, we demonstrated that epitope information has an important impact on the final decision when our framework performed allergen prediction, consistent with the idea that epitope information can be utilized as a differentiating criterion between allergens and non-allergens. Therefore, in addition to predicting allergens, our framework is also able to identify epitope sequences. Additionally, by increasing the model’s transparency in this way, we demonstrated the biological interpretability of deep learning applied to protein sequences.

Identification of novel allergens

Limiting the use of potentially allergenic proteins has become especially crucial with the rising usage of modified proteins in life. However, new allergens that are poorly homologous to known proteins would inevitably go undetected if homology-based approaches were used to identify allergenic proteins. To comprehensively evaluate our proposed framework, we assessed the capacity of existing tools to recognize novel allergens. We collected 24 novel allergens that were recently reported or submitted to IUIS Allergen Nomenclature but were not included in the training and test datasets, then used DeepAlgPro and other methods to predict their allergenicity [41–48]. As shown in Supplementary Table 6 available online at http://bib.oxfordjournals.org/, DeepAlgPro, AllerCatPro 2.0 and AlgPred 2.0 successfully identified 21–22 allergens, respectively. In contrast, AllerTOP v.2, AllergenPF and ProAll-D only considered 11–13 proteins to be allergic. It is noteworthy that among the 24 allergens, ten have less than 40% similarity with the 3550 positive samples used for training and testing, and DeepAlgPro and AlgPred 2.0 successfully identified eight of them as allergens, followed by AllerCatPro 2.0 with seven. AlgPred 2.0 performed well, possibly due to the application of machine learning, while AllerCatPro 2.0 performed well, possibly due to the use of 3D epitope mapping. It should be highlighted that the allergens that could not be successfully identified by the three best tools above were not overlapping; therefore, in practical applications, we recommend combing the best worlds of these tools together. In general, DeepAlgPro’s ability to identify new allergens is excellent among existing tools.

DISCUSSION

Due to its widespread prevalence, enormous burden on patients’ quality of life, and socioeconomic impact, allergy has become a major health problem. As a result, it is critical to identify allergens and prevent contact with them at the source. Here, we propose DeepAlgPro, a deep learning framework that uses a convolutional neural network combined with the MHSA mechanism to achieve successful allergenicity prediction of proteins by learning the characteristics of allergens.

By testing with the same data, we discovered that the overall performance of DeepAlgPro in identifying allergens came in the second place among the tested approaches, only slightly worse than AllerCatPro 2.0. Also, we noticed that the other methods only support the analysis of a small amount of data, which is not conducive to large-scale analysis, despite being in a webpage form that is more user-friendly for practitioners to utilize. For example, AllerTOP v.2, AllergenPF and ProAll-D can only analyze one protein sequence at a time, and AllerCatPro 2.0 can only take 50 proteins at a time. Thus, they may not be suitable for large-scale prediction, which is required for tasks such as allergen identification at the genome level. Here, we present DeepAlgPro, a method for rapidly predicting the allergenicity of a large number of proteins, with a GPU runtime of 0.287 s for 1000 proteins in our device and network state. Meanwhile, we show the prediction score of 0–1 in the result, and users can adjust the threshold for identifying allergens according to the score, which is not just a simple and direct classification. In summary, our study illustrates that DeepAlgPro is convenient software with high accuracy that is suitable for predicting the allergenicity of large-scale proteins.

FAO/WHO proposed two guidelines based on sequence similarity and amino acids hit within windows, respectively; however, the false positives are too high. Many tools proposed afterwards also mostly used the blast-based sequence similarity approach and the k-mer hit principle, such as AllerCatPro and AlgPred 2.0. But not all potential allergens have a high similarity to known allergens. Although some methods using machine learning were proposed afterwards, it was clear that they were still unable to achieve good predictions because of the shallow model [10, 12, 15, 49]. In the present work, we propose a deep neural network framework, to overcome the limitations of conventional machine learning methods. We tested with the new allergens collected, including those with less than 40% similarity to known allergens, and the results showed that our proposed DeepAlgPro is clearly more effective than ProAll-D, which is also a deep-learning-based method reported recently [25]. In the test of identifying new allergens, we found that the general performances of DeepAlgPro, AllerCatPro 2.0 and AlgPred 2.0 were comparable, while the accuracy of their identification of different allergens varied. Of course, despite the fact that our dataset already contains allergens from a variety of sources, including mites, insects, edible plants, fungus, food animals, venom and saliva, we cannot completely rule out the possibility that future predictions of new proteins will be incorrect due to possible imbalances in the data [23]. Therefore, for practical application, we conjectured that different approaches can be taken into account in combination.

One of the main drawbacks of deep learning models is that they are not as easy to interpret as simpler regression models, particularly in biology, because it is challenging to decipher what each neuron means and what factors are crucial for modeling success [50]. But despite being a ‘black box’, the deep learning community is working hard to create explanations for deep learning [51]. In our study, since epitopes can be used to distinguish between allergens and non-allergens, we made the assumption that the network may have learned epitope information. Indeed, by identifying the important regions that affect model judgments with the Grad-CAM, we found that epitopes contributed to correct prediction. These results illustrated that DeepAlgPro may focus on learning epitope information, representing the interpretability of applying artificial intelligence to biological sequences. However, in addition to linear epitopes, allergens can also bind to immune cells through conformational epitopes composed of discontinuous amino acids, which were not considered in our model [52–54]. If protein structure information from AlphaFold2 [19] is also used as input data and thus the features of conformational epitopes are fully learned by the model, the prediction accuracy may be further improved, offering a better foundation for allergy-related clinical treatment, prediction and prevention [13, 52–54].

Key Points
  • DeepAlgPro is a deep learning-based tool for predicting allergenic proteins.

  • DeepAlgPro displays great precision and accuracy and is suitable for large-scale prediction.

  • Epitope features contribute to model decision-making, making it biologically interpretable.

  • DeepAlgPro is capable of mining novel allergens with low homology to known allergens.

ACKNOWLEDGEMENTS

We thank Dr Qinglu Zhong at Shanghai Institute for Advanced Study (Zhejiang University) and Dr Yu Sun at University of Rochester for comments on this research.

FUNDING

Key Program of National Natural Science Foundation of China (NSFC) (Grant no. 31830074 to G.Y.Y.), the Program of NSFC (Grant no. 32202376 to X.H.Y.), the Program for Chinese Innovation Team in Key Areas of Science and Technology of Ministry of Science and Technology of the People’s Republic of China (Grant no. 2016RA4008 to G.Y.Y.), the China Postdoctoral Science Foundation (Grant no. 2021 M700125 to X.H.Y.) and the Young Elite Scientists Sponsorship Program by China Association for Science and Technology (Grant no. 2022QNRC001 to X.H.Y.).

AUTHOR CONTRIBUTIONS

G.Y.Y., F.W. and X.H.Y. designed and managed the project. C.H. collected the data for training, built the model and performed the main analysis. L.Y.H. and Y.X.S. participated in building the model and interpretability analysis. Y.Y., X.X.Z., and L.F.C. participated in discussions. C.H., X.H.Y., Y.Y., Y.W., Q.F., F.W., and G.Y.Y. wrote and revised the manuscript. All authors read and approved the manuscript.

DATA AVAILABILITY

The source codes and data are available at https://github.com/chun-he-316/DeepAlgPro.

Author Biographies

Chun He is a master student at the College of Agriculture and Biotechnology, Zhejiang University. Her research interests include deep learning and omics analysis.

Xinhai Ye is a Postdoctoral Fellow at the Shanghai Institute for Advanced Study and the Institute of Artificial Intelligence at Zhejiang University. His research focuses on omics analysis, molecular evolution and deep learning.

Yi Yang is a Postdoctoral Fellow at the College of Agriculture and Biotechnology, Zhejiang University. His research interests lie in omics analysis and molecular evolution.

Liya Hu is a master student at the College of Computer Science and Technology, Zhejiang University. Her research focuses on deep learning and computational pedagogy.

Yuxuan Si is a PhD candidate jointly trained by the Zhejiang University School of Medicine and the College of Computer Science and Technology. Her research interests include single-cell computational biology and gene regulatory network inference.

Xianxin Zhao is a Postdoctoral Fellow at the College of Agriculture and Biotechnology, Zhejiang University. Her research focuses on omics analysis and molecular evolution.

Longfei Chen is a PhD candidate at the College of Agriculture and Biotechnology, Zhejiang University. His research focuses on biological data analysis and deep learning.

Qi Fang is an associate professor at the Institute of Insect Sciences, Zhejiang University. His research interests include agricultural insects and pest control, biological control of plant pests, insect physiology, biochemistry and molecular biology.

Ying Wei is an assistant professor at the Department of Computer Science, City University of Hong Kong. She is interested in developing algorithms that equip machines with more general intelligence via knowledge transfer.

Fei Wu is “Qiu-Shi” Distinguished Professor at Zhejiang University and Director of the Institute of Artificial Intelligence in the College of Computer Science and Technology. His research areas include artificial intelligence, multimedia analysis and retrieval, and statistical learning theory.

Gongyin Ye is “Qiu-Shi” Distinguished Professor at Zhejiang University and Vice Director of the State Key Laboratory of Rice Biology and Breeding. His research interests include insect physiology, biochemistry, molecular biology, insect functional omics, transgenic organism safety and plant pest biocontrol.

REFERENCES

1.

Aldakheel
 
FM
.
Allergic diseases: a comprehensive review on risk factors, immunological mechanisms, link with COVID-19, potential treatments, and role of allergen bioinformatics
.
Int J Environ Res Public Health
 
2021
;
18
:
12105
.

2.

Singh
 
B
,
Karnwal
 
A
,
Tripathi
 
A
, et al.  Food allergens and related computational biology approaches: a requisite for a healthy life. In: Upadhyay AK, Sowdhamini R, Patil VU (eds).
Bioinformatics for Agriculture: High-throughout Approaches
. Singapore: Springer,
2021
,
145
60
.

3.

Turner
 
PJ
,
Jerschow
 
E
,
Umasunthar
 
T
, et al.  
Fatal anaphylaxis: mortality rate and risk factors
.
J Allergy Clin Immunol Pract
 
2017
;
5
:
1169
78
.

4.

Pramod
 
SN
. Immunological basis for the development of allergic diseases-prevalence, diagnosis and treatment strategies. In: Singh B (ed).
Cell Interaction—Molecular and Immunological Basis for Disease Management
. London, UK: IntechOpen,
2021
.

5.

Umetsu
 
DT
,
Rachid
 
R
,
Schneider
 
LC
.
Oral immunotherapy and anti-IgE antibody treatment for food allergy
.
World Allergy Organ J
 
2015
;
8
:
20
.

6.

Sicherer
 
SH
,
Sampson
 
HA
.
Food allergy: epidemiology, pathogenesis, diagnosis, and treatment
.
J Allergy Clin Immunol
 
2014
;
133
:
291–+
.

7.

Fernandez
 
A
,
Mills
 
ENC
,
Koning
 
F
,
Moreno
 
FJ
.
Allergenicity assessment of novel food proteins: what should be improved
.
Trends Biotechnol
 
2021
;
39
:
4
8
.

8.

FAO/WHO
.
Evaluation of Allergenicity of Genetically Modified Foods. Report of a Joint FAO/WHO Expert Consultation on Allergenicity of Foods Derived from Biotechnology
.
Rome, Italy
: Food and Agriculture Organization of the United Nations (FAO) and World Health Organization (WHO),
2001
, 22–25.

9.

Stadler
 
MB
,
Stadler
 
BM
.
Allergenicity prediction by protein sequence
.
FASEB J
 
2003
;
17
:
1141
3
.

10.

Muh
 
HC
,
Tong
 
JC
,
Tammi
 
MT
.
AllerHunter: a SVM-pairwise system for assessment of allergenicity and allergic cross-reactivity in proteins
.
PloS One
 
2009
;
4
:
e5861
.

11.

Dimitrov
 
I
,
Naneva
 
L
,
Doytchinova
 
I
,
Bangov
 
I
.
AllergenFP: allergenicity prediction by descriptor fingerprints
.
Bioinformatics
 
2014
;
30
:
846
51
.

12.

Dimitrov
 
I
,
Bangov
 
I
,
Flower
 
DR
,
Doytchinova
 
I
.
AllerTOP v.2—a server for in silico prediction of allergens
.
J Mol Model
 
2014
;
20
:
2278
.

13.

Maurer-Stroh
 
S
,
Krutz
 
NL
,
Kern
 
PS
, et al.  
AllerCatPro—prediction of protein allergenicity potential from the protein sequence
.
Bioinformatics
 
2019
;
35
:
3020
7
.

14.

Nguyen
 
MN
,
Krutz
 
NL
,
Limviphuvadh
 
V
, et al.  
AllerCatPro 2.0: a web server for predicting protein allergenicity potential
.
Nucleic Acids Res
 
2022
;
50
:
W36
43
.

15.

Sharma
 
N
,
Patiyal
 
S
,
Dhall
 
A
, et al.  
AlgPred 2.0: an improved method for predicting allergenic proteins and mapping of IgE epitopes
.
Brief Bioinform
 
2021
;
22
:
bbaa294
.

16.

Nedyalkova
 
M
,
Vasighi
 
M
,
Azmoon
 
A
, et al.  
Sequence-based prediction of plant allergenic proteins: machine learning classification approach
.
ACS Omega
 
2023
;
8
:
3698
704
.

17.

LeCun
 
Y
,
Bengio
 
Y
,
Hinton
 
G
.
Deep learning
.
Nature
 
2015
;
521
:
436
44
.

18.

Hiranuma
 
N
,
Park
 
H
,
Baek
 
M
, et al.  
Improved protein structure refinement guided by deep learning based accuracy estimation
.
Nat Commun
 
2021
;
12
:
1340
.

19.

Jumper
 
J
,
Evans
 
R
,
Pritzel
 
A
, et al.  
Highly accurate protein structure prediction with AlphaFold
.
Nature
 
2021
;
596
:
583
9
.

20.

Dauparas
 
J
,
Anishchenko
 
I
,
Bennett
 
N
, et al.  
Robust deep learning-based protein sequence design using ProteinMPNN
.
Science
 
2022
;
378
:
49
56
.

21.

Ma
 
Y
,
Guo
 
Z
,
Xia
 
B
, et al.  
Identification of antimicrobial peptides from the human gut microbiome using deep learning
.
Nat Biotechnol
 
2022
;
40
:
921
31
.

22.

Angermueller
 
C
,
Pärnamaa
 
T
,
Parts
 
L
, et al.  
Deep learning for computational biology
.
Mol Syst Bio
 
2016
;
12
:
878
.

23.

Min
 
S
,
Lee
 
B
,
Yoon
 
S
.
Deep learning in bioinformatics
.
Brief Bioinform
 
2017
;
18
:
851
69
.

24.

Wang L, Niu D, Zhao X, et al.  

A comparative analysis of novel deep learning and ensemble learning models to predict the allergenicity of food proteins
.
Foods
2021;
10
:809.

25.

Shanthappa
 
PM
,
Kumar
 
R
.
ProAll-D: protein allergen detection using long short term memory—a deep learning approach
.
ADMET DMPK
 
2022
;
10
:
231
40
.

26.

Fu
 
L
,
Niu
 
B
,
Zhu
 
Z
, et al.  
CD-HIT: accelerated for clustering the next-generation sequencing data
.
Bioinformatics
 
2012
;
28
:
3150
2
.

27.

Vaswani
 
A
,
Shazeer
 
N
,
Parmar
 
N
, et al.  Attention is all you need. In:
Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17)
.
Curran Associates Inc.
,
Red Hook, NY, USA
.
2017
,
6000
10
.

28.

Selvaraju
 
RR
,
Cogswell
 
M
,
Das
 
A
, et al.  Grad-CAM: visual explanations from deep networks via gradient-based localization.
Int J Comput Vis
2020;
128
:336–59.

29.

Kadam
 
K
,
Karbhal
 
R
,
Jayaraman
 
VK
, et al.  
AllerBase: a comprehensive allergen knowledgebase
.
Database (Oxford)
 
2017
;
2017
:
bax066
.

30.

Gupta
 
S
,
Ansari
 
HR
,
Gautam
 
A
, et al.  
Identification of B-cell epitopes in an antigen for inducing specific class of antibodies
.
Biol Direct
 
2013
;
8
:
27
.

31.

Vita
 
R
,
Mahajan
 
S
,
Overton
 
JA
, et al.  
The immune epitope database (IEDB): 2018 update
.
Nucleic Acids Res
 
2019
;
47
:
D339
43
.

32.

Alipanahi
 
B
,
Delong
 
A
,
Weirauch
 
MT
, et al.  
Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning
.
Nat Biotechnol
 
2015
;
33
:
831
8
.

33.

Quang
 
D
,
Xie
 
X
.
DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences
.
Nucleic Acids Res
 
2016
;
44
:
e107
.

34.

Trabelsi
 
A
,
Chaabane
 
M
,
Ben-Hur
 
A
.
Comprehensive evaluation of deep learning architectures for prediction of DNA/RNA sequence binding specificities
.
Bioinformatics
 
2019
;
35
:
i269
77
.

35.

McInnes
 
L
,
Healy
 
J
.
UMAP: uniform manifold approximation and projection for dimension reduction
. arXiv preprint arXiv:1802.03426,
2018
.

36.

Kapingidza
 
AB
,
Kowal
 
K
,
Chruszcz
 
M
.
Antigen-antibody complexes
.
Subcell Biochem
 
2020
;
94
:
465
97
.

37.

Fu
 
Z
,
Lin
 
J
.
An overview of bioinformatics tools and resources in allergy
.
Methods Mol Biol
 
2017
;
1592
:
223
45
.

38.

Perez-Gordo
 
M
,
Pastor-Vargas
 
C
,
Lin
 
J
, et al.  
Epitope mapping of the major allergen from Atlantic cod in Spanish population reveals different IgE-binding patterns
.
Mol Nutr Food Res
 
2013
;
57
:
1283
90
.

39.

van
 
Milligen
 
J
,
Van
 
W
.
IgE epitopes on the cat (Felis domesticus) major allergen Fel d I: a study with overlapping synthetic peptides
.
J Allergy Clin Immunol
 
1994
;
93
:
34
43
.

40.

Burks
 
AW
,
Shin
 
D
,
Cockrell
 
G
, et al.  
Mapping and mutational analysis of the IgE-binding epitopes on Ara h 1, a legume vicilin protein and a major allergen in peanut hypersensitivity
.
Eur J Biochem
 
1997
;
245
:
334
9
.

41.

González Mahave
 
I
,
Lobera
 
T
,
López-Matas
 
MA
, et al.  
Sensitization to vitis vinifera pollen in a wine production area. Identification of the allergens involved
.
J Investig Allergol Clin
 
2022
;
33
:
0
.

42.

Ling
 
X-J
,
Zhou
 
Y-J
,
Yang
 
Y-S
, et al.  
A new cysteine protease allergen from Ambrosia trifida pollen: proforms and mature forms
.
Mol Immunol
 
2022
;
147
:
170
9
.

43.

Ortega-Martín
 
L
,
Sastre
 
B
,
Rodrigo-Muñoz
 
J
, et al.  
Anaphylaxis after mango fruit intake: identification of new allergens
.
J Investig Allergol Clin Immunol
 
2022
;
32
:
401
3
.

44.

Wang
 
Y
,
Zhang
 
Y
,
Lou
 
H
, et al.  
Hexamerin-2 protein of locust as a novel allergen in occupational allergy
.
JAA
 
2022
;
15
:
145
55
.

45.

Xu
 
Z-Q
,
Zhu
 
L-X
,
Lu
 
C
, et al.  
Identification of Per a 13 as a novel allergen in American cockroach
.
Mol Immunol
 
2022
;
143
:
41
9
.

46.

Yang
 
Y-S
,
Xu
 
Z-Q
,
Zhu
 
W
, et al.  
Molecular and immunochemical characterization of profilin as major allergen from Platanus acerifolia pollen
.
Int Immunopharmacol
 
2022
;
106
:
108601
.

47.

Brassea-Estardante
 
HA
,
Martínez-Cruz
 
O
,
Cárdenas-López
 
JL
, et al.  
Identification of arginine kinase as an allergen of brown crab, Callinectes bellicosus, and in silico analysis of IgE-binding epitopes
.
Mol Immunol
 
2022
;
143
:
147
56
.

48.

Zhu
 
C
,
Wang
 
C
,
Zhou
 
J
, et al.  
Purification and identification of globulin-1 S allele as a novel allergen with N-glycans in wheat (Triticum aestivum)
.
Food Chem
 
2022
;
390
:
133189
.

49.

Dimitrov
 
I
,
Flower
 
DR
,
Doytchinova
 
I
.
AllerTOP—a server for in silico prediction of allergens
.
BMC Bioinform
 
2013
;
14
:
S4
.

50.

Sapoval
 
N
,
Aghazadeh
 
A
,
Nute
 
MG
, et al.  
Current progress and open challenges for applying deep learning across the biosciences
.
Nat Commun
 
2022
;
13
:
1728
.

51.

Montavon
 
G
,
Samek
 
W
,
Müller
 
K-R
.
Methods for interpreting and understanding deep neural networks
.
Digit Signal Process
 
2018
;
73
:
1
15
.

52.

Bragin
 
AO
,
Demenkov
 
PS
,
Kolchanov
 
NA
, et al.  
Accuracy of protein allergenicity prediction can be improved by taking into account data on allergenic protein discontinuous peptides
.
J Biomol Struct Dyn
 
2013
;
31
:
59
64
.

53.

Barlow
 
DJ
,
Edwards
 
MS
,
Thornton
 
JM
.
Continuous and discontinuous protein antigenic determinants
.
Nature
 
1986
;
322
:
747
8
.

54.

Scheurer
 
S
,
Toda
 
M
,
Vieths
 
S
.
What makes an allergen?
 
Clin Exp Allergy
 
2015
;
45
:
1150
61
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/pages/standard-publication-reuse-rights)

Supplementary data