Comparison of vocabularies, representations and ranking algorithms for gene prioritization by text mining.

MOTIVATION
Computational gene prioritization methods are useful to help identify susceptibility genes potentially being involved in genetic disease. Recently, text mining techniques have been applied to extract prior knowledge from text-based genomic information sources and this knowledge can be used to improve the prioritization process. However, the effect of various vocabularies, representations and ranking algorithms on text mining for gene prioritization is still an issue that requires systematic and comparative studies. Therefore, a benchmark study about the vocabularies, representations and ranking algorithms in gene prioritization by text mining is discussed in this article.


RESULTS
We investigated 5 different domain vocabularies, 2 text representation schemes and 27 linear ranking algorithms for disease gene prioritization by text mining. We indexed 288 177 MEDLINE titles and abstracts with the TXTGate text pro.ling system and adapted the benchmark dataset of the Endeavour gene prioritization system that consists of 618 disease-causing genes. Textual gene pro.les were created and their performance for prioritization were evaluated and discussed in a comparative manner. The results show that inverse document frequency-based representation of gene term vectors performs better than the term-frequency inverse document-frequency representation. The eVOC and MESH domain vocabularies perform better than Gene Ontology, Online Mendelian Inheritance in Man's and London Dysmorphology Database. The ranking algorithms based on 1-SVM, Standard Correlation and Ward linkage method provide the best performance.


AVAILABILITY
The MATLAB code of the algorithm and benchmark datasets are available by request.


SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.


INTRODUCTION
Genome-wide experimental methods to identify disease-causing genes, such as linkage analysis and association studies, are often overwhelmed by large sets of candidate genes produced by high throughput techniques for which the low-throughput validation of candidate disease genes is time consuming and expensive (Risch, 2000). Computational prioritization methods can rank candidate disease genes from these gene sets according their likeliness of being involved in a certain disease. Moreover, a systematic gene prioritization approach that integrates multiple genomic datasets * To whom correspondence should be addressed. provides a comprehensive in silico analysis on the basis of multiple sources of existing knowledge. Several computational gene prioritization applications have been previously described.

Previous approaches
Freudenberg and Propping prioritize disease relevant human genes by measuring similarities among GO annotations and validate the results in OMIM database (Freudenberg and Propping, 2002). GeneSeeker (Van Driel et al., 2005) provides a web interface that filters candidate disease genes on the basis of cytogenetic location, phenotypes and expression patterns. DGP (disease gene prediction) (Lopez-Bigas and Ouzounis, 2004) assigns probabilities to genes based on sequence properties that indicate their likelihood to the patterns of pathogenic mutations of certain monogenetic hereditary disease. PROSPECTR (Adie et al., 2005) also classifies disease genes by sequence information but uses a decision tree model. SUSPECTS (Adie et al., 2006) integrates the results of PROSPECTR with annotation data from Gene Ontology (GO), InterPro and expression libraries to rank genes according to the likelihood that they are involved in a particular disorder. G2D (candidate genes to inherited diseases) (Perez-Itratxeta et al., 2005) scores all concepts in GO according to their relevance to each disease via text mining. Then, candidate genes are scored through a BLASTX search on reference sequence. POCUS (Turner et al., 2003) exploits the tendency for genes to be involved in the same disease by identifiable similarities, such as shared GO annotation, shared InterPro domains or a similar expression profile. eVOC annotation (Tiffin et al., 2005) is a text mining approach that performs candidate gene selection using the eVOC ontology as a controlled vocabulary. It first associates eVOC terms and disease names according to co-occurrence in MEDLINE abstracts, and then ranks the identified terms and selects the genes annotated with the top-ranking terms. In the work of Franke et al. (Franke et al., 2006), a functional human genetic network was developed that integrates information from KEGG, BIND, Reactome, human protein reference database, GO, predicted-protein interaction, human yeast twohybrid interactions and microrray coexpressions. Gene prioritization is performed by assessing whether genes are close together within the connected gene network. Endeavour (Aerts et al., 2006) takes a machine learning approach by building a model on a training set, then that model is used to rank the test set of candidate genes according to the similarity to the model. The similarity is computed as the correlation for vector space data and BLAST score for sequence data. Endeavour incorporates multiple genomic data sources (microarray, InterPro, BIND, sequence, GO annotation, Motif, Kegg, EST and text mining) and builds a model on each source of individual prioritization results. Finally, these results are combined through order statistics into a final score that offers an insight on how related a candidate gene is to the training genes on the basis of information from multiple knowledge sources. More recently, CAESAR (Gaulton et al., 2007) has been developed as a text mining-based gene prioritization tool for complex traits. CAESAR ranks genes by comparing the standard correlation of term-frequency vectors (TF profiles) of annotated terms in different ontological descriptions and integrates multiple ranking results by arithmetical (min, max and average) and parametric integrations.

Gene prioritization in imbalanced datasets
The performance of the training-testing approach of gene prioritization can be evaluated by checking the positions of real relevant genes in the ranking of a test set. A perfect prioritization should rank the gene with the strongest causal link to the biomedical concept, represented by the training set, at the highest position (at the top). The interval between the real position of that gene with the top is regarded as the error. For a prioritization model, minimizing this error is equal to improving the ranking position of the most relevant gene and in turn it reduces the number of irrelevant genes to be investigated in biological experimental validation. So a model with smaller error is more efficient and accurate to find disease relevant genes and that error is also used as a performance indicator for model comparison.
A potential problem for this training-testing approach is that ranking candidate genes in the whole genome is a class-imbalanced problem because the majority of genes are not related to the biomedical concept represented by the training set. In a class imbalanced dataset, standard discriminant algorithms are often biased towards the majority class. Hence, they are more likely to cause a high false positive rate when the majority is labeled as negative samples. For this imbalance problem, a strategy of oneclass classification is often proposed to reduce the error rate on the majority class (Estabrooks et al., 2004;Tax, 2002). The problem of one-class classification can be easily transformed to one-class prioritization as an information retrieval problem since classification is often based on ranking of distances to the density of class samples. A simple one-class prioritization model is to rank the candidate genes by their distances to the center of training genes, which is equal to the similarity value obtained by standard correlation on data with the same norm. Another complex model looks for a small coherent subset of genes, which can be achieved by finding a small-radius ball that covers as many training genes as possible (Tax and Duin, 1999). Obviously, the genes lying within the ball are more likely to be relevant than those lying outside. Thus, prioritization is performed by ranking the distance of candidate genes to the center of the ball. In a similar problem, one class Support Vector Machines (De Bie et al., 2007;Scholkopf et al., 2001) is applied to separate most of the training genes from the origin using a hyperplane and prioritization can be achieved by ranking the distance to the hyperplane. The prioritization model can also be extended by clustering methods and vary by different criteria of clustering and distance measures. Most of these formulations are similar in the way assigning a convex score function on the basis of Euclidean distance. The global minimum of this score function is at the center of the training samples (or the ball), then it increases linearly towards the outside. If the number of training genes is large, the score function can be further regularized by penalizing outliers among the training genes.
After regularization, some outliers in the training set are regarded as irrelevant samples. Hence, a ball with smaller radius is obtained and it might improve the precision of prioritization. In this article, we will regard gene prioritization as an imbalanced learning problem and employ several one-class prioritization algorithms and compare their performance.

Gene prioritization in high dimensional datasets
Current genomic datasets are usually high dimensional. As known, high-dimensional data is a double-edged sword for statistical analysis (Donoho, 2000). For the task of gene prioritization the high dimensionality of the dataset influences two aspects: First, discriminating relevant genes from irrelevant ones is more likely to be a linear problem because it is often easier to find a separating hyperplane in higher dimension. Second, processing high-dimensional data with parametric methods is difficult because these methods require an appropriate ratio of samples and variables. Moreover, the complexity of estimation, optimization and integration of these methods grows exponentially with the dimension. The second problem is also known as the curse of dimensionality (Bellman, 1961). For these reasons, in this aricle we will focus on several non-parametric ranking methods for high-dimensional data.

Approach and motivation
We adopted a high-dimensional benchmarking dataset generated by the biomedical literature mining system TXTGate . TXTGate indexes titles and abstracts of MEDLINE with different vocabularies and weighting schemes. Then, the documents × terms matrix is transformed into genes × terms matrix according to the curated gene-to-doc mapping in EntrezGene. These gene-by-term vectors, denoted as textual profiles, represent existing expert knowledge about genes from free text and have been successfully applied in text-based gene clustering  and gene prioritization (Aerts et al., 2006) applications. We could also use other non-textual profiles, such as microarray data. In Endeavour (Aerts et al., 2006), the similarity of genes is measured by standard correlation and the prioritization performance on textual gene profiles is higher than for other data sources [ Supplementary  Fig. 1 of (Aerts et al., 2006)]. This is partly because results on textual profiles are biased towards existing knowledge, since evaluation of prioritization is obtained by benchmarking disease related genes that are already known. On the other hand, the low performance on some other datasets might be caused by several factors, for example, the pre-processing methods of original data, the influence of normalization methods, etc., so they are not suitable for benchmark datasets in our problem. In text mining approaches, the effect of different vocabularies and representations is still an open question and they have been mostly selected empirically in previous approaches. The importance of text mining in gene prioritization makes its optimization an important issue. In this article, we will focus on these implied problems: (1)

Textual profiles of genes
We created 10 groups of textual gene profiles on the text mining platform TXTGate. Various literature indices were created based on title text and abstract text of MEDLINE publications and linked MEDLINE information presented in EntrezGene. Five vocabularies (Table 1) derived from public resources act as perspective on the textual information with different levels of detail. The first vocabulary is derived from GO. The names of all GO terms are retrieved from the online repository, then processed by different kind of filters. Through these filters, the terms are stemmed (Porter, 1980), the stopwords and punctuations are removed. After this treatment, we obtained a GO domain vocabulary of 23 857 terms.
The second vocabulary is based on the Medical Subject Headings (MeSH), the National Library of Medicine's controlled vocabulary thesaurus. After the same pre-processing procedures as for the GO vocabulary, we obtained 30 136 terms for MeSH vocabulary.
The third vocabulary is retrieved from the online mendelian inheritance in man's (OMIM) morbid map, the cytogenetic location of all disease genes present in OMIM and their associated diseases and consists of 5576 terms.
The fourth vocabulary is based on the London Dysmorphology Database (LDDB), which contains information on dysmorphic and neurogenetic syndromes. We extracted dysmorphology concepts as vocabulary terms and 935 terms were obtained after pre-processing.
The fifth domain vocabulary is drawn from eVOC, an ontology consisting of four orthogonal controlled vocabularies (anatomical system, cell type, pathology and developmental stage) subsuming the domain of human gene expression data. After filtering, we obtained 1788 eVOC terms.
Among these vocabularies, four of them are also used in TXTGate system. Using these controlled vocabularies, we indexed 288 177 MEDLINE titles and abstracts with reference to the mapping of EntrezGene. The terms from the domain vocabulary are regarded as a bag-of-words hence the indexed documents are represented as vectors in the space spanned by these terms. Based on the gene-to-doc mappings in EntrezGene, multiple linked documents of a same gene were combined as a single averaged gene profile and all gene profiles are normalized to obtain gene vectors on a unit space. For each domain vocabulary we investigated representation schemes to calculate the value of terms in vectors: inverse document frequency (IDF) and termfrequency × inverse document frequency (TFIDF). Apart from these, we had also implemented a binary scheme as a simplest baseline of representation. However, the performance of binary scheme is not comparable with IDF and TFIDF ones so it is not presented in this article. After combining different vocabularies and representations, we obtained 10 groups of textual profiles. The overview of the size and overlapping terms of vocabularies after indexing is presented in Table 1. In Table 2 some highest ranking terms and lowest ranking terms are listed as examples. To compare the effect of vocabularies in text-based gene prioritization, we also created a group of special profiles that uses no controlled vocabulary in the text mining procedure, denoted as no-voc profile. When no vocabulary is used, all the terms once appearing once in the referenced MEDLINE titles and abstracts in EntrezGene are regarded as useful annotations for text mining. The conceptual overview of obtaining textual gene profiles and the formulations for computing IDF and TFIDF representations are available in the Supplementary Material. The details of profiling genes using textual information is presented in the TXTGate paper .

Benchmark dataset of disease relevant genes
We used the benchmark dataset of Endeavour (Aerts et al., 2006), which consists of 618 relevant genes from 29 diseases. Genes from the same disease were constructed as a disease-specific training set used to benchmark the prioritization performance. The name of diseases and the number of genes related to the diseases are shown in Table 1 of the Supplementary Material.

Prioritization algorithms
We implemented 27 models of non-parametric prioritization algorithms categorized in three different types: regularized one-class Support Vector Machines, k-nearest neighbor method and clustering method, which is implemented as k means clustering and hierarchical clustering.

One-class SVM
The one-class SVM method, suggested by Scholkopf (Scholkopf et al., 2001), extends the binary SVM classification scheme into one-class learning by mapping the training data that contains just one class into a high-dimensional Hilbert space via a kernel function. The algorithm iteratively finds the maximal margin hyperplane that best separates the training data from the origin. In the present article, we only use linear kernels because the dimensionality of the data is very high. In prioritization task, the decision function of one-class SVM in (Scholkopf et al., 2001) is extended to a prioritization function by dropping the sign function and the constant value ρ solved by one-class optimization.

k-nearest neighbor
The nearest neighbor methods we used in this article are proposed by (Tax, 2002). In the present article, we tried three different k values (k = 1,2,3). When k ≥ 2, three varieties of nearest neighbor algorithms were implemented, denoted as κ, δ and γ , according to the differences of averaging the distance of test data to the k nearest neighbors.

K-means clustering The objective function of K-means
The prioritization is achieved by ranking the distance of the test gene to the centroid(s). In this article we tried three different K values (K = 1,2,3). Notice that when K = 1 and if all data have the same norm, the K-means algorithms is equivalent to the standard correlation (Pearson correlation) method, which directly measures the angular separation of candidate gene between averaged vectors of training genes around the origin. If the data is clustered into more than one clusters, there is a choice to select the maximum, minimum or average distance of a test gene to multiple centroids as the prioritization score.

Leave one out (LOO) validation
The performance of algorithms was evaluated by LOO prioritization. In each experimental test on a disease gene set, which contains K genes, one gene, termed the 'defector' gene, was deleted from a set of training genes and added to M randomly selected test genes, denoted as the test set. We used the remaining K −1 genes, denoted as the training set, to train our prioritization model. Then, we prioritized the test set, which contains M +1 genes by the trained model and determined the ranking of that defector gene in test data. The prioritization performance was evaluated by the error between the perfect ranking and the combined ranking position of all defector genes in the disease set with the following equation where r i is the ranking position of the i-st gene in the disease set, K is the number of genes in the disease set, M M−1 is a normalization term to make the perfect ranking equal to 1 and leads the Error to 0. In order to benchmark algorithms in a class imbalanced dataset, we set the number of random genes M to 9999.

Similarity of prioritization
We used Spearman's rank correlation to compare the ranking order of two prioritization results P 1 and P 2 obtained on identical n genes, where d i is the difference between rankings in P 1 and P 2 on corresponding genes. For each disease set, we randomly selected 99 genes and calculated a Spearman correlation matrix when each defector gene is left out. Then, we averaged the Spearman matrices for all the genes in one disease set. For all disease sets, 29 Spearman matrices were further averaged and the final matrix was used to compare the similarity of all algorithms on ranking results.

RESULTS AND DISCUSSION
We compared the performance of the prioritization algorithms and textual gene profiles by LOO cross-validation on 9999 random genes. Some significant results are shown in Figures 2 and 3. The complete table of overall benchmark result is shown in the Supplementary Material (Table 2). The performance obtained on IDF profiles is significantly better than for TFIDF profiles. When IDF profiles are used, eVOC and MeSH domain vocabularies are significantly better than GO, LDDB and OMIM. Generally, the errors of ranking algorithms based on 1-SVM, standard correlation and average ward linkage are smaller than other algorithms.

Representation schemes: IDF performs better than TFIDF
The comparison of errors on the textual representation schemes of terms shows that IDF is generally better than TFIDF in text miningbased gene prioritization. In Figure 1, we compared the errors of two representation schemes on all domain vocabularies and three best ranking algorithms. The minimal error obtained by IDF profile is (eVOC, 1SVM, 0.0477) while the minimal one by TFIDF is (GO, 1SVM, 0.0916), which means the error of best IDF profile is almost 50% less than the TFIDF one. Moreover, when the same domain vocabulary and same ranking algorithm is used, error with IDF is always smaller than with TFIDF. This is mainly because some rare terms play an important role in distinguishing the term vectors of genes from disease to disease, and through IDF representation, these rare discriminative terms get large values and dominate the prioritization results. In contrast, TFIDF tries to balance the effects of IDF and TF by multiplying them, which in fact weakens the discriminative effect for gene prioritization.

Domain vocabularies: eVOC and MESH perform better than LDDB, GO and OMIM
When the same algorithm and representation are applied, the errors obtained on eVOC and MESH vocabulary are much smaller than other vocabularies. For example, using 1-SVM and IDF, the errors on eVOC (0.0477) and MESH (0.0497) are much better than LDDB (0.0877), GO (0.0758) and OMIM (0.0714). The same situation happens for other algorithms as well (see Supplementary Material Table 1). This result is interesting since the size of the MESH vocabulary is almost 10 times larger than that of eVOC. The actual reason of why they outperform others is an issue requiring further investigation. According to our experimental results obtained from a random vocabulary, the size of the random vocabulary directly determines the error of prioritization result (Fig. 2). The larger the random vocabulary the smaller the error in prioritization. However, the size of domain vocabulary does not impact the performance directly, it is the semantic content of the vocabulary that matters. This also raises an open question about the existence of an optimal vocabulary for the problem of gene prioritization. Discussion about this topic would also be important but it is beyond the scope of this article.

Prioritization algorithms
In the beginning of the article, we proposed the strategy of oneclass prioritization and the effect of regularization with respect to the issue of class imbalancing. According to the benchmark result of 27 different linear non-parametric ranking algorithms, 1-SVM, correlation and ward average linkage are the three best algorithms. These three ranking algorithms are similar in the sense that their ranking scores are almost equal to the distance toward the density center of the training genes. In standard correlation, the ranking score is equal to the distance from the candidate gene to the center of all training genes. In 1-SVM, the score is the distance to the center of the ball that covering the training genes by regularization. During regularization, some training genes that are far from the original center are removed and the new center is recalculated. The ward linkage method is also a well-known agglomerative hierarchical clustering method and it is reported with good results in many information retrieval and pattern detection applications. In the implementation of ward average linkage in this article, the number of clusters is set to 2 and the average distance towards the 2 ward linkage clustering centroids is used as ranking score.

Clustering of prioritization algorithms
We used the Spearman correlation to measure the similarity of gene prioritization results obtained by two different algorithms. Similar to LOO cross-validation, in each disease benchmark set, a 'defector' gene was left out and mixed with 99 random genes. To compare the results on different algorithms, the random gene list was kept identical when the same gene was left out. Then the average correlation of the disease benchmark set is computed, furthermore, the final average correlation of all 29 disease sets is obtained and regarded as the correlation of the prioritization algorithms. Based on the pairwise Spearman correlation matrix of all 27 prioritization models presented in the Supplementary Material (Table 3), we clustered these 27 models in the dendrogram by complete linkage (Fig. 3). Standard correlation is highly similar to ward average linkage in ranking (Spearman correlation = 0.9915).

Fig. 2.
Comparison of prioritization performance using text profiles based on random vocabularies, domain vocabularies and no vocabulary. The horizontal line is the error of prioritization obtained by no-voc profile, which contains 259 815 terms resulted from text mining process without using any vocabulary. Based on this no-voc profile, we randomly selected several subsets of terms and created five random-voc profiles as the comparison sets to the domain vocabulary profiles. The performance obtained by domain vocabulary profiles is compared with the random-voc profiles that have the same number of terms. On the X-axis, the profiles are sorted from smaller size to larger size. As it shows, the performances of random-voc profiles increase monotonically with the vocabularies size. On the contrary, the performance of domain vocabulary profile does not solely depend on the size of vocabulary but is mainly determined by its semantic content.
1-SVM is similar to several minimal distance methods. Nearest neighbor methods and maximal distance methods are quite different from the forementioned methods.

Selecting the best configuration in text mining-based gene prioritization
From now on, for conciseness, we use the term configuration to denote the triplet choice of domain vocabulary, representation scheme and ranking algorithm. On the basis of the experimental results and previous discussion, the configuration has a strong impact on the quality of prioritization model. According to the result of full benchmark experiments, the improperly selected configuration could lead to a large error (no-voc, single max, TFIDF, 0.3757) on prioritization, which is > 7 times larger than the error of a carefully selected configuration (eVOC, 1-SVM, IDF, 0.0477). If the prioritization result is used as the reference list in biological validation, the efficiency gained from a good configuration will be remarkable.

Results of profile integration
According to the results on domain vocabulary-based profiles, we picked the best two IDF profiles (eVOC and MESH), the best two TFIDF profiles (GO and MESH) and the best of each of them (eVOC-IDF and MESH-TFIDF) and integrated them by three integration functions (min, max and average). Although there are some consistent improvement by integrating text profiles, however, the improvements are too small to be relevant so we do not discuss it in this article. The explanation of integration methods and results are available in the Supplementary Material (Table 4).

CONCLUSION
In this article, we presented an approach of comparing different configurations to create and rank textual profiles for gene prioritization. By integrating the TXTGate text mining profiling system and prioritization framework from the Endeavour system, we investigated 5 domain vocabularies, 2 text mining weighting schemes and 27 ranking algorithms (270 configurations). Our discussion can be mainly concluded as the following: controlled domain vocabulary provides an effective view to conduct text mining for gene prioritization, moreover, the impact of the selection of configurations on prioritization performance is significant. For the representation of vector-based data, IDF representation of terms causes less error than TFIDF representation. eVOC and MESH domain vocabularies give smaller errors than the GO, OMIM and LDDB vocabularies. Among the 27 models we benchmarked, 1-SVM, standard correlation and ward linkage method are the better candidates for ranking algorithm. In short, the selection of configurations is an important factor of the quality of diseaseoriented prioritization model by text mining.