Variation benchmark datasets: update, criteria, quality and applications

Abstract Development of new computational methods and testing their performance has to be carried out using experimental data. Only in comparison to existing knowledge can method performance be assessed. For that purpose, benchmark datasets with known and verified outcome are needed. High-quality benchmark datasets are valuable and may be difficult, laborious and time consuming to generate. VariBench and VariSNP are the two existing databases for sharing variation benchmark datasets used mainly for variation interpretation. They have been used for training and benchmarking predictors for various types of variations and their effects. VariBench was updated with 419 new datasets from 109 papers containing altogether 329 014 152 variants; however, there is plenty of redundancy between the datasets. VariBench is freely available at http://structure.bmc.lu.se/VariBench/. The contents of the datasets vary depending on information in the original source. The available datasets have been categorized into 20 groups and subgroups. There are datasets for insertions and deletions, substitutions in coding and non-coding region, structure mapped, synonymous and benign variants. Effect-specific datasets include DNA regulatory elements, RNA splicing, and protein property for aggregation, binding free energy, disorder and stability. Then there are several datasets for molecule-specific and disease-specific applications, as well as one dataset for variation phenotype effects. Variants are often described at three molecular levels (DNA, RNA and protein) and sometimes also at the protein structural level including relevant cross references and variant descriptions. The updated VariBench facilitates development and testing of new methods and comparison of obtained performances to previously published methods. We compared the performance of the pathogenicity/tolerance predictor PON-P2 to several benchmark studies, and show that such comparisons are feasible and useful, however, there may be limitations due to lack of provided details and shared data. Database URL: http://structure.bmc.lu.se/VariBench


Introduction
Development and testing of computational methods are dependent on experimental data. Only in comparison to existing knowledge can method performance be assessed. For that purpose, benchmark datasets with known and verified outcome are needed. During the last few years, such datasets have been collected for a number of applications in the field of variation interpretation. VariBench (1) and VariSNP (2) are the two existing databases for variation benchmark datasets for variation interpretation. VariBench contains all kinds of datasets while VariSNP is a dedicated resource for variation sets from dbSNP database for short variations (3).
Benchmark datasets are used both for method training and testing. We can divide testing approaches into three categories ( Figure 1). The most reliable are systematic benchmark studies. Quite often the initial method performance assessment is done on somewhat limited test data or does not report all necessary measures. The third group includes studies for initial method and hypothesis testing typically with a limited amount of data. An example for this kind of testing is Critical Assessment of Genome Interpretation (CAGI, https://genomeinterpretation.org/), which has organized several challenges for method developers. These contests with blind data, when the participants do not know the true answer, have been important e.g. for testing new ideas and methods, as well for tackling novel application areas.
High-quality benchmark datasets are valuable and may be difficult, laborious and time consuming to generate. Already from the point of view of reasonable use of resources it is important to share such datasets. Secondly, comparison of method performance is reliable only when using the same test dataset. According to the FAIR principles (4), research data should be made findable, accessible, interoperable and reusable. VariBench and VariSNP provide variation data according to these principles and include relevant metadata.
It is still quite common that authors collect and use extensive datasets for their published papers, but do not share and make the datasets available. This practice prevents others from comparing additional tools to those used in the paper. Even when the data are made available, it may be in a format that makes reuse practically impossible. An example is the datasets used for testing the MutationTaster2 tolerance predictor (5). They were published as figures and at very low resolution. Now, these datasets are available in VariBench.

Criteria for benchmarks
We defined criteria for a benchmark when the VariBench database was first published (1). These criteria were more extensive than previously used and have been found very useful and still form the basis for inclusion of data and for their representation in VariBench. The criteria are as follows.
Relevance. The dataset has to capture the characteristics of the investigated property. Not all available data may be relevant for the phenomenon or may be only indirectly related to it. The collected cases have to be for the specific effect or mechanism under study.
Representativeness. The datasets should cover the event space as well as possible, thus preferably containing examples from all the regions relevant to the effect. The actual number of cases for achieving this coverage may vary widely depending on the effect. The dataset should be of sufficient size to allow statistical studies but may not need to include all known instances.
Non-redundancy. This means excluding overlapping cases within each dataset.
Experimentally verified cases. Method performance comparisons have to be based on experimental data, not on predictions, otherwise the comparison will be about the congruence of methods, not about their true performance.
Positive and negative cases. Comprehensive assessment has to be based both on positive (showing the investigated feature) and negative (not having effect) cases.
Scalability. It should be possible to test systems of different sizes.
Reusability. As datasets are expensive to generate, they should be shared in such a way that they can be used for other investigations. This may mean similar applications or usage in new areas.
Most of the criteria are rather easy to fulfill, but some others are more difficult to take into account. We recently investigated the representativeness of 24 tolerance datasets from VariBench in the human protein universe by analyzing the distribution and coverage of cases in chromosomes, protein structures, CATH domains and classes, Pfam families, Enzyme Commission (EC) categories and Gene Ontology annotations (6). The outcome was that none of the datasets were well representative. When correlating the training data representativeness to the performance of predictors based on them, no clear correlation was found. However, it is apparent that representative training data would allow training of methods that have good performance for cases distributed throughout the event space.
To test the relevance of the tolerance datasets, we investigated how many disease-causing variations could be found from neutral training data. A small number of such variants were found, 1.13-1.77% (6). These numbers are so small that they do not have a major impact on method performances. VariBench datasets are reusable and scalable, contain experimental cases and are typically non-redundant. However, how redundancy should be defined may depend on the application. For example, when using domain features in variant predictors, variants even in related domain members would be redundant.

Dataset quality
The quality of benchmark datasets is of utmost significance. This is naturally dependent on the quality of the data sources. There are not many quality schemes in this field. For locus-specific variation databases (LSDBs) there is a quality scheme that contains close to 50 criteria in four main areas including database quality, technical quality, accessibility and timeliness (23). However, these guidelines are not yet widely followed and similar criteria are missing for other types of variation data resources.
Systematics within datasets and databases can significantly improve their quality and usability. For variation data there are a number of systematics solutions available. These include systematic gene names available for human from the HUGO Gene Nomenclature Committee (HGNC) (24), Human Genome Variation Society (HGVS) variation nomenclature (25), Locus Reference Genomic (LRG) and (26) RefSeq reference sequences (27), and Variation Ontology (VariO) variation type, effect and mechanism annotations (28).
Quality relates to numerous aspects in the datasets, the correctness of variation and gene/protein and disease information, relevance of references, etc. We recently selected cases from ProTherm (29) to build an unbiased dataset for the protein variant stability predictor PON-tstab (30). We were aware that the database had some problems, however, were surprised with the extent of problematic cases. While making the selection, we noticed numerous issues, such as cases of two-stage denaturation pathways where values for all the steps and then the total value were provided; there were errors in sequences, variants, recorded measuring temperatures, G values and their signs and units, and in indicated PDB structures; and so on. The uncorrected and wrong data have been used for development of tens of prediction methods. This is probably an extreme exception (ProTherm was taken away from the internet after our paper was published); however, this indicates that one has to be careful even when using popular data. When including datasets to VariBench we performed several quality controls, however, we also list datasets that may contain problems e.g. numerous ProTherm sub-selections that have been published and sometimes used in several papers. They have been included for comparative purposes.

How to test predictor performance
The use of a benchmark dataset is just one of the requirements for systematic method performance assessment. Proper measures are needed to find out the qualities of performance. Most of the currently available prediction methods are binary, distributing cases into two categories. There are guidelines for how to test and report method performance (31)(32)(33). There is also a checklist what to report when using such methods in publications.
Results for binary methods are presented in a contingency (also called for confusion) table out of which different measures can be calculated. The most important ones are the following six, which according to the guidelines (32) have to be provided for comprehensive assessments. Specificity, sensitivity, positive and negative predictive values (PPV and NPV) use half of the data in the matrix, while accuracy and Matthews correlation coefficient (MCC) use data from all the four data cells. Additional useful measures include area under curve when presenting Receiver Operating Characteristic curves and Overall Performance Measure. Good methods display a balanced performance and their values for measures differ only slightly.
In case there is an imbalance in the number of cases in the classes, it has to be mitigated (31). Several approaches are available for that. Cases used for testing method performance should not have been used for training them, otherwise there is circularity that overinflates performance measures (14). A scheme has been presented on how datasets should be split for training and testing as well as for blind testing (34). When there are more than two predicted classes additional measures are available (31,32). In addition to these measures, method assessment can contain other factors such as time required for predictions, as well as user friendliness and clarity of the service and results.
Datasets used for assessment have to be of sufficient size. There are a number of reasons for this requirement. Widely used machine learning methods are statistical by nature and require a relatively large number of cases for reliable testing. If we think the event space, in the case of human proteins, there are 380 different amino acid substitution types, 150 of which are more likely due to emerging because of a single nucleotide substitution within the coding region for a codon. These substitutions can appear in numerous different contexts, thus too small test datasets should be avoided. There are several performance assessments, especially for variants in a single protein or a small number of genes/proteins that do not have any statistical power. The smallest dataset we have seen contained just nine substitutions, based on which a detailed analysis was performed to recommend the best performing tools! Variation interpretation is often carried out in relation to human diseases. It is important to note that diseases are not binary states (benign/disease) instead there is a continuum and certain disease state can appear due to numerous different combinations of disease components, see the pathogenicity model (35). This aspect has not been taken into account in benchmark datasets apart from the training data for PON-PS (36) and clinical data for cystic fibrosis (37).

Variation datasets
We have collected from literature, websites and databases datasets, which have been used for training and benchmarking various types of variations and their effects ( Table 1). The new datasets come from 109 papers. There are 419 new separate datasets containing altogether 329 014 152 variants. One paper can contain more than one dataset. The number of unique variants is smaller as many of the datasets are different subsets of commonly used datasets such as ClinVar or ProTherm or VariBench itself. The total number is dominated by VariSNP cases. The original VariBench version contained 17 datasets from 10 articles representing five variation categories, thus the growth in the database size has been substantial.
VariBench datasets are freely available at http:// structure.bmc.lu.se/VariBench/ and can be downloaded separately. The website contains basic information about the datasets, their origin and for what purpose they were initially used for. There is also information about in how many genes, transcripts or proteins the variants appear. Datasets are categorized similar to Table 1 for easy access. The contents of the datasets vary depending on information in the original source. We have enriched many of them e.g. by mapping to reference sequences or PDB structures and some contain VariO annotations. Columns in the original sources irrelevant for VariBench were removed.
The available datasets have been categorized into 20 groups and subgroups as indicated in Figure 2. The figure shows also the relationships of the datasets in different categories. Variants are often described at three molecular levels (DNA, RNA and protein) and sometimes also at protein structural level, including relevant cross references and variant descriptions. VariBench utilizes and follows a number of standards and systematics including HGVS variation nomenclature, HGNC gene names (not in all databases due to mapping problems) and VariO annotations in some datasets.
Links are available to data in some external databases, including AmyLoad (38) and WALTZ-DB (39) for protein aggregation, DBASS3 and DBASS5 (40,41) for splicing variants, SKEMPI (42), cancer datasets in KinMutBase (43), Kin-Driver (44), dbCPM (45), DoCM (46), OncoKB (47) and tolerance predictor training set in DANN (48). The latter has a link due to its huge size, the others since    they are databases and as such easy to use directly and updated by third parties. We excluded datasets used in CAGI experiments, since they are available for registered participants only. LSDBs were excluded because data from these sources usually have to be manually selected before using as benchmark. Most of the time, they do not contain clear information for variant relevance to disease(s). Datasets for structural genomic variants were excluded, because they usually lack information about exact variation positions. Unfortunately, many papers, even those reporting on benchmarking, do not contain and share the data, which does not allow others to extend the analyses and reuse the datasets.

Variation type datasets
Variation types include insertions and deletions, coding and non-coding region substitutions, which are divided into training and test datasets, structure mapped variants, as well as synonymous, and benign variants. There are now data from four amino acid insertion effect predictors, mainly for short alterations. Only datasets added after the release of the first version of VariBench are discussed here. In Table 1 is shown how many datasets and publications in each category appeared in the first edition.
Training datasets have mainly been used for development of machine learning predictors, there are 17 new datasets. They typically also contain test sets. Six test datasets have been specifically designed for method performance assessments. These include a set for addressing circularity (14) and pathogenicity/tolerance method performance assessment (49). The American College of Medical Genetics and Genomics (ACMG) and the Association for Molecular Pathology (AMP) has published guidelines for variant interpretation (50). These include instructions for use of prediction methods. A dataset was obtained for addressing concordance of prediction methods (51). Another study addressed discordant cases (52). Protein sequences of even closely related organisms contain differences and some of these are compensated variants where a disease-related variant in human is normal in another organism due to additional alteration(s) at other site(s). A dataset has been collected for such variants (53). Unfortunately, only the benign variants were made available. Analysis of the dataset representativeness, how well the datasets represent the variation space, was investigated for 24 datasets in VariBench and VariSNP (6). These cases were mapped to reference sequence and are now available in the database.
Variations are mapped into protein 3D structures in several datasets. Dedicated datasets contain those used for developing a method for predicting side-chain clashes because of residue substitutions (54), analysis of effects on structures and functions of substitutions (55) and investigation of variations in membrane proteins (19).
There are two datasets for synonymous variants as well as two for benign ones.

Effect-specific datasets
These datasets are for various types of effects. On DNA level there are eight sets for DNA regulatory elements, and on RNA level 14 datasets for splicing. Most of the splicing datasets are very small, but there are a few with substantially larger numbers. In the first version of VariBench, there were only protein stability datasets in this category, totally six datasets from four studies.
Many more sets are available for effects on protein level. Protein aggregation (two datasets), binding free energy (2), disorder (1), solubility (1) and stability are the currently available categories. Among protein stability datasets, there are 22 new datasets for single variants, almost all originating from ProTherm, and one dataset for double variants.

Molecule-specific datasets
There are in VariBench 18 specific datasets for certain molecules. There is a set of variants used to train PONmt-tRNA for substitutions affecting mitochondrial transfer RNA (tRNA) molecules (56). This is of special interest as there are 22 unique mitochondrial tRNAs that are implicated in a number of diseases.
Single amino acid substitutions were collected in 82 proteins to test whether there is a difference in performance for protein specific and generic predictors (12). All the datasets contain at least ∼100 variants. The results indicated vast differences in performances, the best generic predictors outperforming the specific predictors in most but not all cases.
The remaining datasets in this category are for variants in individual genes/proteins.

Disease-specific datasets
This category contains totally nine datasets, six of which are for cancer, one for long QT syndrome (62) and another for hypertrophic cardiomyopathy (63).
Although there are numerous studies of cancer variations, the functional verification of the relevance of those variants for the disease is usually missing. VariBench contains three datasets for variants in cancer, which have been experimentally tested (64)(65)(66), and links to three other sources, namely dbCPM (45), DoCM (46) and OncoKB (47). In addition, there is the FASMIC dataset for variants that are largely cancer related (67).

Phenotype dataset
One dataset contains information for disease phenotype, whether there is mild/moderate or severe disease due to substitutions. This dataset was used to train disease severity predictor called PON-PS (36).

Benchmark use case
VariBench datasets have mainly been used for prediction method development and testing. As the benchmark studies typically have not contained all the best performing tools, we compared the performance of the variant tolerance/pathogenicity predictor PON-P2, since this tool has been the best or among the best performing methods in a number of previous investigations (10,12,18,19,52). The setup was similar in all these studies to test the outcome of a spectrum of methods. We extended the published benchmark studies by repeating the original analyses with PON-P2. To avoid circularity, we first excluded from the datasets all cases that had been used for training PON-P2. The results  Table 2 and are reported according to the published guidelines (32) and including some additional measures.
The exercise indicated that reproducibility and reusability could not be achieved in a number of cases due to problems in reporting. We had to exclude some published benchmark studies. The dataset for pharmacogenetics variants (68) was too small for reliable estimation. The paper for compensated variants (53) did not share the disease-related variants, and thus could not be evaluated. Of the dataset used by Qian et al. (69) only 36 cases were not included to the PON-P2 training set, and therefore the benchmark had to be excluded because of too small size.
We were able to perform the analysis for six studies and we analyzed altogether 17 datasets. Full comparison was not possible in all cases as some details were not available. Therefore, we discuss and compare the performances based on the information in the original papers, but list all the details from our study in Table 2.
For MutationTaster2 the published test data has not been previously available due to being in a format that prevents reuse of the data. MutationTaster 2 was originally compared to five tools and versions (MutationTaster1, PolyPhen humdiv and humvar, PROVEAN and SIFT) (5). The accuracy and specificity are better for PON-P2 than the scores for the six tested tools and sensitivity is the second best. Only the measures given in the original article are discussed in here.
The study of circularity problems in variant testing was conducted on predictSNPSelected and SwissVarSelected datasets (14). The performance of PON-P2 is superior compared to the eight tested predictors (Muta-tionTaster2, PolyPhen, MutationAssessor, CADD, SIFT, LRT, FatHMM-U, FatHMM-W, Gerp++ and phyloP). In the test for predictSNPSelected dataset, NPV, PPV, sensitivity, accuracy and MCC are the best for PON-P2. Only for specificity, it is the second best predictor with a margin of 1%. In the data for SwissVarSelected, PON-P2 has the best score for PPV, accuracy and MCC.
It is the second best for NPV and specificity, by 1-2% margin to the best, and for sensitivity. On both datasets, PON-P2 showed the most balanced performances.
Twenty-five tools were tested according to ACMG/AMP guidelines using several datasets (51). The compared methods were REVEL, VEST3, MetaSVM, MetaLR, hEAt, Condel, MutPred, Mcap, Eigen, CADD, PolyPhen2, PROVEAN, SIFT, EA, MutationAssessor, MutationTaster, phyloP100way, FATHMM, DANN, LRT, SiPhy, phast-Const100way, GenoCanyon, GERP and Integrated_fitCons. Unfortunately, the results were not comprehensively reported. The paper contains data for AUC scores but they are presented as figures. The exact values were difficult to estimate, especially when results for 18 datasets were combined into single figures. In the end, we performed the test for eight of these datasets. In the ClinVar balanced data the AUC of PON-P2 is either shared first or second, and in VariBenchselected data it has the best performance. Comparison for the six other datasets is not as reliable, but we can summarize that the PON-P2 performance is among the best if not the best for all of these. It is unfortunate that exact numbers were not provided by the authors.
The performances of 23 methods (FATHMM, fit-Cons, LRT, MutationAssessor, MutationTaster, PlyPhen humdiv and humvar versions, PROVEAN, SIFT, VEST3, GERP++, phastCons, phyloP, SiPhy, CADD, DANN, Eigen, FATHMM-MKL, GenoCanyon, M-CAP, MetaLR, MetaSVM and REVEL) were tested on three datasets: ClinVar and two protein-specific sets for TP53 and PPARG (49). They had also a fourth set for autism spectrum diseases, but since there is no experimental evidence for the relation of these variations to the disease, that set was excluded. Although the study was well performed and described, it seems that the authors have not corrected for class imbalance. For the methods to be comparable the measures should be calculated based on the same data and have equal numbers of positive and negative cases. If that is not the case, the imbalance has to be mitigated with one of the available solutions. Some of the other benchmark studies may suffer from the same problem, but we are not sure due to incomplete descriptions of the studies. None of the tools can predict all possible variations and thus they have predictions for different numbers. Therefore we present the results both for non-normalized and normalized data. We believe that the former was used by the authors. In the case of ClinVar data, PON-P2 has better PPV, accuracy and MCC than the other methods tested in the paper.
In the case of TP53 data, the PON-P2 accuracy is second best when the data are not normalized; on other measures, PON-P2 is ranked the fourth or worse. All cancer variants, such as those in TP53, were excluded from the PON-P2 training data. This was done because the effects of variations in cancers usually have not been experimentally verified. A variant in TP53 is not 'pathogenic' alone, several variants in different proteins are needed for cancer.
All the predictors are known to have variable performance depending on the tested protein, see the study of protein-specific predictors (12); this study showed that PON-P2 had better performance for 85% of proteins, being the best of the five tested tools (PolyPhen-2, SIFT, PON-P2, MutationTaster2, CADD). PPARG seems to be another example for which PON-P2 has poor performance (49). An additional reason for poor performance may be that the PPARG data is not for pathogenicity, instead it is a 'function score' that is based on the distribution of FACS sorted cells (70). The same applies to the TP53 test data which is based on the protein function, not pathogenicity. Depending on a protein, the threshold for phenotype can be anything between 1% and 85% of the wild type activity (Vihinen, in preparation). We have previously tested PON-P2 in protein function prediction but with poor (71) or mixed (72) outcome. This is because the method has not been trained and intended for this task. These results indicate the importance of applying computational tools to their intended purpose or at least testing the performance carefully before applied to new tasks.
Another study tested the performance of 14 tools (SEQ + DYN, SEQ, DYN, MutationTaster2, PolyPhen2, MutationAssessor, CADD, SIFT, LRT, FATHMM-U, Gerp++, phyloP, Condel and Logit) in relation to structural dynamics, which was used as a proxy for functional significance of amino acid substitutions (73). PON-P2 has the best sensitivity, specificity, NPV and MMC, it is the second best for accuracy but only 13th for PPV. The explanation for the latter observation is that many of the tested tools are severely biased, having very high PPV but very low NPV, whereas the performance of PON-P2 was again balanced over all the measures.
The exercise indicated that it is possible to compare predictors to published results based on exactly the same datasets. The new performance results for PON-P2 are in line with several previously published studies that have indicated the method to be a top performer on different benchmarks (10,12,18,19,52). When choosing a method(s), one should look at consistent performance over several benchmarks. Full comparisons were not always possible because of incomplete performance assessments. Therefore, authors should meticulously describe all details and procedures in the data analysis as well as share the datasets used. Even if the data is taken from public sources, it is not possible for others to obtain exactly the same dataset as used in the papers even when applying the same selection criteria, as some important aspects seem always to be missing. In summary, it was possible to compare performances for methods not included into original studies. This is important in many ways and contributes toward increased reproducibility and comparability. Good datasets are difficult to obtain, therefore VariBench will serve as a hub for sharing these important data.