Evidential meta-model for molecular property prediction

Abstract Motivation The usefulness of supervised molecular property prediction (MPP) is well-recognized in many applications. However, the insufficiency and the imbalance of labeled data make the learning problem difficult. Moreover, the reliability of the predictions is also a huddle in the deployment of MPP models in safety-critical fields. Results We propose the Evidential Meta-model for Molecular Property Prediction (EM3P2) method that returns uncertainty estimates along with its predictions. Our EM3P2 trains an evidential graph isomorphism network classifier using multi-task molecular property datasets under the model-agnostic meta-learning (MAML) framework while addressing the problem of data imbalance.   Our results showed better prediction performances compared to existing meta-MPP models. Furthermore, we showed that the uncertainty estimates returned by our EM3P2 can be used to reject uncertain predictions for applications that require higher confidence. Availability and implementation Source code available for download at https://github.com/Ajou-DILab/EM3P2.

• MAML [1] optimizes the initial parameters of a model in order to quickly adapt it to new tasks.MAML enables the model to perform well in multi-task, few-shot scenarios by fine-tuning these parameters with a few gradient steps on new tasks.• Pre-GNN [2] has emerged as a notable technique for optimization-based few-shot and meta-learning involving graph-structured data.It involves pre-training a graph neural network on various tasks under the MAML framework, thereby enhancing its ability to generalize to new tasks involving graph data.• Property-aware Relation Network (PAR) [7] uses negative and positive examples to obtain the property-aware embedding function to transform the generic molecular embeddings into substructure-aware space.Then, to estimate the relation graph between the few-shot examples, an adaptive relation graph learning module is generated.The whole process is trained under a meta-learning approach.• Siamese Networks [3] have been extensively studied for similarity learning tasks in a one-shot learning setting.In the training process, two inputs are processed through shared weights, allowing the network to learn a distance metric that measures the similarity between pairs of instances.If both inputs are positive, the relationship is considered positive or similar, otherwise negative.• Prototypical Networks [6] learns a metric space where instances of the same class are closer to each other, allowing efficient classification with limited labeled data.The model uses prototypes, which are the average of embeddings in the same class, to measure the similarity between the prototypes and a new sample to determine which class the sample belongs to.

Learning Curve and Uncertainty Quantification in Training
In the context of model training, several factors play a crucial role in determining the effectiveness of the training process.These factors include the belief vector, vacuity, dissonance, and wrong belief [5].As shown in 1, the ultimate goal of model training is to minimize dissonance (dis), wrong evidence (wbv), and uncertainty (vac), while simultaneously promoting the emergence of correct evidence (cbv).When these dynamics are accurately depicted in a graph, it signifies that the model has been trained correctly or is well-regularized.By systematically following the training trend, the model becomes more reliable and capable of providing accurate predictions and evidence.

Full Ablation Study Result Including Calibration Errors
Table 4 shows the performance of each variant of our EM3P2.
In addition to the area under the receiver operating characteristic curve (ROC-AUC), we compared the performance of the methods using accuracy (ACC).We also measured the calibration error using two measurements based on test results binned by confidence values (Eq.??).The expected calibration error (ECE) are defined as follows [4].
where acc(B m ) and conf (B m ) are the average accuracy and average confidence for the test data in the mth bin B m .The ideal result will have both a high accuracy value and a low calibration error.Results with any calibration error value will not matter with low accuracy and vice versa.

Effect of Positively Biased and Negatively Biased Tasks on Accuracy vs Calibration Curve
In testing the SIDER dataset, we found that test task 23 resulted in an increase in accuracy as uncertainty increased.To investigate whether this anomaly was due to a mixture of negatively and positively biased tasks, we conducted the following two experiments.First, we used balanced and positively biased tasks to train the metamodel and the negatively biased tasks for testing.The left figure in Fig. 2 shows that when a meta-model is trained with negatively biased tasks, testing the model with positively biased tasks confuses the uncertainty values.
Second, we reversed the labels of the negatively biased items so that all items were positively biased.The right plot in Fig. 2 shows the results.We can see that making all the items one-sided helps to improve accuracy as uncertainty decreases.However, task 23 (T23) still has a low overall accuracy.We suspect that this is due to the fact that Task 23 is the side effect category for "pregnancy, puerperium & perinatal conditionsholds", which is a significantly different categorization than most of the other tasks that map side effects to system organ classes.

Additional Empirical Studies
Structure of captafol (left) and oxymetholone (right) for MPP prediction evaluated for Tables 5 (main text) and Table S5 and S6 5 and 6.
Table 5 shows the prediction results for oxymetholone in the SIDER dataset.The true label was all negative for the six test tasks.Again, the MLP-based prediction predicted wrong with a very high class probability, while our EM3P2 using EMLP predicted 'I don't know (?)'.
Table 6 shows the prediction results for Compound CID:659783 in the MUV dataset.The true label was all positive for the three test tasks.Again, the MLP-based prediction predicted wrong with a very high class probability, while our EM3P2 using EMLP predicted 'I don't know (?)'.

Table 1 .
Data and Task Description of Tox21 dataset.

Table 2 .
Task Description of SIDER Dataset.

Table 3 .
Data Description of MUV Dataset

Table 4 .
Ablation studies.Accuracy (ACC) and expected calibration error (ECE) are computed for our EM3P2 variants.QB is query balancing, BR is belief regularizer, and AvUC is accuracy versus uncertainty curve regularizer.