Abstract

Biomedical information retrieval systems are becoming popular and complex due to massive amount of ever-growing biomedical literature. Users are unable to construct a precise and accurate query that represents the intended information in a clear manner. Therefore, query is expanded with the terms or features that retrieve more relevant information. Selection of appropriate expansion terms plays key role to improve the performance of retrieval task. We propose document frequency chi-square, a newer version of chi-square in pseudo relevance feedback for term selection. The effects of pre-processing on the performance of information retrieval specifically in biomedical domain are also depicted. On average, the proposed algorithm outperformed state-of-the-art term selection algorithms by 88% at pre-defined test points. Our experiments also conclude that, stemming cause a decrease in overall performance of the pseudo relevance feedback based information retrieval system particularly in biomedical domain.

Database URL: http://biodb.sdau.edu.cn/gan/

Introduction

Retrieving documents that match the user query is one of the foremost challenge in almost all information retrieval systems. Continuous increase in literature causes keywords mismatch problem between user query and retrieved documents (1). To retrieve documents by measuring similarity between user query and indexed documents is even more difficult in biomedical domain because genes, drugs and diseases may have numerous synonyms. For example, a user inputs a query containing keywords like ‘Medical Practitioner’ and corpus has only relevant documents however all the documents contain the words such as doctor, physician etc. It can be seen that all the terms of documents are conveying same information but these are named differently due to which mismatch problem will occur and these documents which are more relevant to the query as compared to others will not be retrieved. In order to tackle this problem local and global query expansion (QE) is used. In global QE, knowledge sources and dictionaries like (WordNet, PubMed) are used to generate candidate expansion terms (2).

In local QE, statistical information is used to find candidate expansion terms from corpus. In this approach, documents are retrieved based on user query and top k retrieved documents are considered relevant. To select candidate expansion terms from top retrieved documents, different term selection techniques like chi-square, information gain (IG), Kullback–Leibler divergence (KLD) and dice are used. It has been observed that the online available data has vividly increased in volume while the number of query terms is very scarce (3).

According to Lesk et al. average query length used to be 2.30 (4) words and it remained same even after 10 years (5). At present, there has been a rise in the trend of providing quite lengthy queries containing (five or more words), but still most common queries contain only couple of words (6). Therefore, the scope of QE has increased over the time. QE can also decrease the performance of information retrieval. In global QE candidate expansion terms extracted from dictionaries may cause decrease in performance due to word ambiguity problem. If we have a query like ‘Which bank provides more profit?’, to expand this query, we will find synonyms of query terms from dictionaries. In this query word ‘bank’ can be used in two different scenarios. It can be either used to refer financial institution or river bank. Therefore, in global QE word sense disambiguation in query words is mandatory. Lesk algorithm is used for word sense disambiguation (7).

In local QE all the retrieved documents against a particular user query are not relevant to the user query (8). This may lead to the imperfect and faulty terms pool (the pool of all terms present in top retrieved documents) that may contain many redundant and irrelevant terms. Expanding the query with such terms may even drift the query to retrieve irrelevant items (3). Hence idea behind the selection of candidate expansion terms from terms pool is to first remove these redundant or irrelevant terms from the term pool. Term selection for QE will allow only the selection of most relevant terms against particular user query. Therefore, these days term selection for QE is one of the hottest topics of research in the domain of information retrieval (9).

There are two major types of term selection methods for QE: (i) based on corpus statistics and (ii) based on term association. The choice of these methods depends on the document retrieval models e.g. Okapi BM25, TFIDF and Language Models (3). The selection methods based on term association are used to evaluate the goodness of terms based on their co-occurrence in the feedback documents. Whereas, selection methods based on corpus statistics are used to estimate the goodness of the terms based on their distribution in the corpus. In biomedical domain, it is still a huge challenge for researchers to develop an extraordinary performing term selection method for QE that must be able to outperform available methods with a very high edge (10).

Mostly widely used term selection method ‘Chi-Square’ suffers from document misclassification problem as its ability to select most affective and worthy terms for QE gets affected by the defined threshold of relevant and non-relevant class in pseudo relevance feedback. To tackle mentioned problem, we propose a new technique document frequency chi-square (DFC) and compare it with eight term selection algorithms including two different versions of chi-square proposed by Carpineto (11). Moreover, in biomedical domain effects of pre-processing on the performance of pseudo relevance feedback are also discussed. We used mean average precision (MAP) to evaluate the integrity of presented algorithm on TREC 2006 Genomic (12) dataset.

Related work

Efficient information retrieval systems are required to get relevant information against particular user query from rapidly growing biomedical literature (13). A major concern in information retrieval system is the word mismatch problem in which the same concept may be described using semantically similar but having syntactically different from of terms in both query and documents (14). For example, user query may contain a phrase like ‘cure of depression’, but the corpus documents may have different yet semantically similar phrase like ‘depression treatment’. Both are referring to same concept with different words. This problem can be solved using two approaches: query paraphrasing and QE.

In query paraphrasing approach, query words are replaced by their synonyms in order to generate query paraphrases. In above example, ‘cure’ can be replaced by its synonym ‘treatment’ to generate the paraphrase ‘treatment of depression’. Generated paraphrases are then used to retrieve documents from corpus. Zukerman et al. used WordNet (15) and parts of speech information to find the synonyms for paraphrase generation. Their experiment revealed a reasonable improvement in the process of retrieving relevant documents despite having issues in part-of-speech (POS) tagging (16).

QE techniques can further be categorized as global and local techniques. In global QE, dictionaries and knowledge resources are used to find expansion terms (17). Chu et al. performed global QE by selecting the candidate expansion terms using knowledge resources of UMLS Meta-Thesaurus and Semantic Networks. They showed 33% improvement in performance of ohsumed dataset based 40 queries, by expanding these queries using domain specific knowledge resources and document retrieval models (18). On the other hand, Stokes et al. (19) used various biomedical knowledge resources like GO, EntrezGene, ADAM etc. to improve the overall performance of information retrieval system. They also claimed that the performance of information retrieval system (19) can be increased by focusing on two factors: choice of good document ranking algorithm; and use of domain specific knowledge resources.

One of the concerns with global QE is the fact that due to unstoppable progress in new discoveries and ongoing research, available knowledge resources are in constant need of update. However, it is difficult to update the available knowledge resources rapidly. Therefore, researchers of information retrieval community are focusing on improving the system using local QE. In this approach, user queries are provided to retrieval models (Okapi BM25, TFIDF) which rank the corpus documents by measuring similarity between queries and documents. Top K documents are labeled as relevant to user information. These retrieved documents are used to generate term pool which contains all terms present in relevant documents. Different techniques like chi-square, IG, KLD, CoDice etc. are used to select terms from generated term pool. Jagendra et al. improved the performance of local QE method by introducing an aggregation technique for term selection. They combined four term selection techniques [KLD, co-occurrence, Robertson selection value (RSV) and IG] using proposed aggregation method. In order to apply Borda combination technique, all the individual term selection methods are applied and lists of candidate terms are obtained from all the methods. These ranked lists are then used to select the final QE terms. Terms having highest aggregation score chosen as the final expansion terms. Jagendra et al. illustrated that some of the expansion terms caused query drift (20). In order to tackle this problem, they performed semantic filtering by applying word2vec approach and showed 2% improvement in results.

Some researchers are also looking for ways to combine both local and global QE techniques (21, 22). In this regard, Pal et al. proposed a methodology which combined the terms generated from WordNet and two local QE (23) term selection techniques [i.e. KLD (24) and RSV (25)]. They showed that precision of retrieval model could be improved by extending the query with candidate terms generated from local and global QE (26). Abdulla et al. combined terms from both global and local QE. For global QE, they used knowledge resources like PubMed (27) and MetaMap (28), whereas for local QE, Lavrenko relevance feedback (LRF) (29) and MFT (30) techniques were used. A linear combination approach was introduced to combine the scores generated by individual techniques. This combined score was used to select the final QE terms. They selected one method from global QE and one from local QE. By doing so, they experimented with various combination pairs and found that the best performance was obtained using linear combination approach on PubMed (https://www.ncbi.nlm.nih.gov/pubmed/) and LRF (22).

In our experimentation, we have exploited pseudo relevance feedback in which documents are ranked against particular user query. Top ranked k documents are selected as relevant for the selection of candidate of expansion terms. As there are no explicit defined criteria to select threshold (top k) for documents, there is a strong chance that arbitrarily selected threshold may cause document misclassification problem as some known relevant documents may get wrongly classified as relevant and vice versa. Traditionally used chi-square does not tackle mentioned problem while selecting expansion terms. We proposed a modified version of ‘Chi-Square’ which is able to alleviate the problem of document misclassification occurred due to selection of arbitrary threshold. We have evaluated our proposed term selection algorithm against eight state-of-the-art term selection algorithms and have shown the overall comparison. We have also tested the effect of stemming on information retrieval in particularly biomedical domain.

Methodology

This section presents the methodology of pseudo relevance feedback emphasizing on the pre-processing of dataset. The dataset obtained from TREC website exists in HTML format having irrelevant information like email addresses, article digital signature, journal publishing dates and years etc. In order to remove this irrelevant content from the dataset, Apache Tika parser (https://tika.apache.org/0.7/parser.html) is used. Furthermore, all stop words such as is, am, are, about, etc. are removed from the dataset and user query by exploiting the default stop words list of solr named as ‘stop.txt’. It contains 33 English stop words. After this, we converted all the terms into their base form using Porter Stemmer. The steps involved in pre-processing of HTML documents are shown in Figure 1.

Figure 1.

Pre-processing.

To measure the effect of stemming on the performance of retrieval task, we have indexed the dataset with and without stemming.

Performance of pseudo relevance feedback depends upon two significant factors: number of top relevant documents retrieved by document retrieval model, and term selection algorithm (20). Famous documents retrieval models are Okapi BM25, language models [unigram, bi-grams, n-grams (23)], TF-IDF etc. In our experimentation, we have used Okapi BM25 as our document retrieval model.

Before feeding the user query to document retrieval model, all stop words are removed from user query. Since we have two different types of datasets i.e. stemmed and non-stemmed, therefore, user query is stemmed only for stemmed dataset. User query is then provided to document retrieval model which retrieves a list of ranked documents. Top k ranked documents are chosen for pseudo relevance feedback and only unique terms of these documents are used to create term pool. Various term selection techniques (mentioned in Section 5) are used to rank the terms for QE. Only top n terms are used to expand particular user query which is then sent back to retrieval model for final document retrieval. Using this expanded query, final ranked documents are retrieved. Figure 2 illustrates all the phases of PRF technique sequentially.

Figure 2.

Methodology for pseudo relevance feedback.

Okapi bm25 weighting algorithm

Okapi BM25 is a probabilistic model that not only assigns weights to documents but also rank them according to their relevance against particular query. It has been widely used in biomedical domain for retrieval of information. Mathematical expression of document ranking is given as (31):
weight=SJ . k1+1.freqidK1.1-b+b.dlavdl+freqid× k3+1.freqiqk3+freqk3+freqiq 
(1)
where
  • k1 and k3 are the parameters that are used to weight the effect of term frequency in document and query, whereas b is used as tuning constant to control normalization.

  • freqid depicts the frequency of the occurrence of the term in document d.

  • freqiq is the occurrence frequency of term in query q.

  • dl and avdl illustrate document length and average document length in the corpus, respectively.

whereas,

SJ is the Robertson Sparck Jones weight, calculated using the formula below
SJ=log(rt+0.5) /(R-rt+0.5)(n-rt+0.5) /(N-n-R+0.5)
(2)
where |R| is the number of relevant documents of a specific topic, rt is the number of relevant documents that contain the term i, N is the total documents present in the corpus and n denotes the number of documents containing that term.

Term selection metrics

It is pretty obvious that corpus may have redundant and irrelevant terms that can cause query drift. To avoid this, all terms of corpus are ranked on the basis of statistical information used in various term ranking methods. In this section we will discuss eight such term ranking methods in context of QE.

A. Kullback–Leibler divergence (KLD)

KLD (24) is widely used technique in information theory (32), statistical language modeling based speech processing and natural language applications (25). It assigns score to terms based on their probability in relevant documents and corpus.
KLDterm=PRtermlogPR(term)PC(term)
(3)
where PR(term) is the probability of term’s presence in top retrieved relevant documents R. It can be calculated as:
PRterm=DϵRtfterm|DDϵRterm ϵ Dtfterm|D.
(4)
And PC(term) is the probability of term’s presence in the corpus, calculated as:
PCterm=DϵCtfterm|DDϵCterm ϵ Dtfterm|D.
(5)

Equation (3) is used to assign scores to terms present in the term pool. This technique assigns scores fall in the range of 0–1. The term having 0 score is considered as irrelevant term. Similarly, a score of 1 shows that the term is an excellent candidate for QE.

B. Co-occurrence based query expansion

Co-occurrence is a term association based method used to assign scores to the terms present in the term pool. This method assigns score by measuring the relationship of candidate terms with query words (32). Rijsbergen (33) has described it as an algorithm that finds relationship between corpus and query terms. In order to find the co-occurrence association between two terms, co-efficients like CoJaccard, CoDice and Cosine are used. It can be calculated as:
CoDice(termi,termj)=dfijdfi+ dfi-dfij
(6)
where dfi and dfj are the frequency of documents in which term i and term j occur, respectively. Similarly, dfij is the number of documents in which both terms i and j occur together.
Expanding the query with highly similar terms may also cause query drift problem. In order to avoid query drift, the concept of inverse document frequency (IDF) is used. To handle this problem, codegree is calculated which also caters IDF as well. Let qi be the query term and ct be the candidate term, then codegree and IDF can be calculated using following expression
Codegreeqi,ct=log10CoDiceqi,ct+1.(IDFctlog10D). 
(7)
And
IDFct=log10(NNc)
(8)
where Nc is the number of documents in corpus that have candidate term ct, N is the total number of documents present in corpus and D is the number of top retrieved documents. To obtain the value for a candidate term against all query terms, following formula can be used:
CooccurrencefinalQ,ct=qiϵQCodegreeqi,ct.
(9)

C. Information gain (IG)

IG is an algorithm that utilizes the knowledge about the presence or absence of particular term in documents to find the degree of class prediction (34). Let C={C1,C2} be the set of classes where C1 belongs to top retrieved relevant documents and C2 belongs to non-relevant documents.

Value of IG for term t can be calculated as:
IGt=-j=1CPcjlogPcj+Ptj=1CPCj|t logPCj|t +P(t-)j=1|C|PCj|t-logPCj|t-
(10)
where P (t) is the probability of term t’s occurrence, t- denotes non-occurrence probability i.e. Pt-=1-P(t). P (cj|t) is the conditional probability that the jth class occurs given term t. Similarly, P(cj|t) stands for the conditional probability of jth class given the term t is non-existent, whereas P(cj) is the probability of jth class itself. This value is used to measure the importance of a term with respect to the two classes. This gives the score to the terms present in term pool. Ultimately high scoring terms can then be used for QE purpose.

D. Probabilistic relevance feedback (PRF)

This measure assigns score to the terms present in term pool by calculating their probability in relevant and non-relevant documents (35). A term having higher probability in relevant class is considered more suitable candidate term for QE. Mathematical expression of PRF is obtained as:
PRFt=Prelevance(term)Pnon-relevance(term)
(11)
where Prelevance(term) is the probability of term in relevant documents and Pnon-relevance(term) is the probability of term in non-relevant documents.

E. Chi-square (CS)

A statistical measure used to measure the divergence of two events is known as chi-square (36). For a term t, it measures how much independent t is from relevant and irrelevant class. The lesser the independence, the higher will be the score for that term. Mathematical expression of chi-square is given below
Chi-Square=pRt-pCt2pCt
(12)
where pR(t) is the probability of term t present in relevant documents, and pC(t) is the probability of term in corpus. In experimentation we also used chi-square version without square used by (11).

F. Lavrenko relevance feedback (LRF)

This technique uses the formula derived from Lavrenko relevance model (37). It is the technique based on language model. The score for the QE terms can be found by using the formula:
Scoret=all RlogP(t|MR)P(t|G). 
(13)
In above equation, P(t|G) is the probability of occurrence of the term t in collection. Whereas, P(t|MR) can be found using the formula below:
Pt|MR=λ ×TFt,RtϵRTFt,R+(1-λ)×P(t|G)
(14)
where TF(t,R) is the frequency of the term in relevant document R and the denominator is the summation of all the term frequencies for a relevant document. The λ is the parameter that can be adjusted during experimentation. Researchers have found that λ=0.6 shows best results (22).

Proposed term selection metric: document frequency chi-square (DFC)

Chi-square is one of the widely used algorithms for term selection in text classification. It has been used by Carpineto et al. for pseudo relevance feedback based term selection but unfortunately its performance was not up to the mark because term selection for QE in pseudo relevance feedback is very different from term selection in text classification. In pseudo relevance feedback, there exist only two classes which are highly skewed. We first retrieve documents based on user query and select top k documents as relevant while the rest of the documents are treated as non-relevant. However, there is no defined criterion to choose the threshold between relevant and non-relevant ranked list of documents. There is a possibility that a non-relevant document may get classified as relevant document. Similarly, possibility of getting a relevant document in non-relevant class also exists. In order to fully understand the effect of this thresholding, let us consider a corpus of 10 documents which contain three documents (D1,D2,D3) of actual relevant class and rest are from non-relevant class. In pseudo relevance feedback, after document ranking, if we decide threshold at D4, we will get the following sets of documents:
R={D1,D2,D3,D4}
NR={D5,D6,D7,D8,DD,D10}
Let there be terms t1t50 in corpus. We consider a scenario in which t1 occurs 10 times in R however it is only occurring in D4 document. The same term occurs three times in NR, such that it appears two times in D5 and one times in D6 document. When distribution based on term frequency is considered, chi-square will consider t1 as a good term for QE which is not true. Now if the distribution is considered in context of document frequency which is binary in nature and only considers the presence of term in documents, we notice that using this distribution, document frequency of t1 is only 1 in R whereas it is 2 in NR. As t1 has higher document frequency in non-relevant class, therefore DFC will not rank it as a discriminative term. DFC not only considers the term presence in relevant documents, but also keeps track of other important factors like terms’ absence in relevant class and similarly term presence and absence in non-relevant class as well. Mathematically, its formula can be written as:
DFC =(tdfr-tdfc×ratior)2tdfc×ratior+(tdfr¯-tdfc¯×ratior)2tdfc¯×ratior+(tdfnr-tdfc×rationr)2tdfc×rationr+(tdfnr¯-tdfc¯×rationr)2tdfc¯×rationr
(15)
such that
ratior=size of relevant classcorpus size
(16)
rationr=size of nonrelevant classcorpus size
(17)
where

tdfr= term document frequency in relevant class,

tdfc= term document frequency in corpus,

tdfnr= term document frequency in non-relevant class,

tdfnr¯= term absence in non-relevant class,

tdfr¯= term absence in relevant class and

tdfc¯= term absence in corpus.

Dataset and evaluation measure

In order to address the information retrieval system that targets the needs of biomedical scientists and geneticists, TREC 2006 Genomic Track (38) dataset is selected. This dataset consists of 162 259 documents having total 1 437 356 250 unique terms from 49 journals published electronically at Highwire Press. These are HTML documents obtained by using web crawler on the Highwire Press website. The full collection is 12.3 GB in size.

MAP is used to evaluate the performance of nine term selection algorithms using Okapi BM25 as retrieval model. This evaluation measure is widely used in information retrieval system. Mathematical expressions of average and MAP are given below

a: Average precision

This measure compares the documents ranked by retrieval model with pre-defined set of documents ranked by domain experts against particular query.
AverageP=r=1N(P(r)×rel(r))Crt 
(18)
where

r is rank,

N denotes the number of retrieved documents,

rel(r) is a function that tells whether a document is relevant or not (binary) and

P(r) stands for precision.

b: Mean average precision

It summarizes the ranking results obtained from multiple queries by averaging the AverageP.
MAP=q=1QAverageP(q)|Q|. 
(19)

Practical illustration of TREC data

This section summarizes the background of strategical decisions taken in context of typical behavior of the system over different queries. It also depicts the source of query drift in quest of further improvements while producing and comparing results.

Table 1 shows performance difference of two algorithms (DFC and chi-square) and baselines for 36 queries of TREC Dataset. All results have been calculated on the following benchmark: documents =40, top terms =10. Delta(DFC-CS) shows the difference in precision of DFC and Chi-Square. The most positive value of delta(DFC-CS) shows that DFC has outperformed chi-square. On the other hand, the most negative value depicts victory of chi-square over DFC with a huge margin. By observing the differences, we notice that query 201 has the most positive value of delta(DFC-CS) whereas query 207 has most negative.

Table 1.

Summary of difference in precision for 36 TREC queries

QueriesPrecision
Base linePΔ (Term selection—base line)
DFCchi-squareΔ (DFC-CS)Δ (DFC-BS)Δ (CS-BS)
2000.38190.41230.3796−0.03040.00230.0327
2010.97550.55990.58250.41560.393−0.0226
2020.0390.04390.0528−0.0049−0.0138−0.0089
2030.63780.63340.63930.0044−0.0014−0.0059
2040.66490.66370.64230.00120.02260.0214
2050.16580.13490.19540.0309−0.0296−0.0605
2060.53150.34120.43370.19030.0978−0.0925
2070.07440.13640.0661−0.0620.00830.0703
2080.42220.4620.264−0.03980.15830.1981
2090.51150.19940.26440.31210.2471−0.065
2100.06510.07370.0764−0.0086−0.0113−0.0027
2110.45940.23480.34810.22460.1113−0.1133
2120.36310.35550.39410.0076−.031−0.0386
2130.51860.55870.5306−0.0401−.01210.0281
2140.5580.48910.54080.06890.0171−0.0517
2150.390.32470.45570.0653−0.0657−0.131
2160.10850.08490.08080.02360.02770.0041
2170.00310.0010.00510.0021−0.002−0.0041
2180.23550.20210.28340.0334−0.0479−0.0813
2190.02010.08750.0839−0.0619−0.06380.0036
2200.91510.92420.8556−0.00920.05950.0687
2210.00010.00010.00010.00010.00010.0001
2220.09640.06480.08520.03160.0112−0.0204
2230.23230.11660.41050.1157−0.1782−0.2939
2240.02980.02150.22680.0083−0.197−0.2053
2250.09090.08330.01640.00760.07450.0669
2260.74490.68470.3030.06030.44190.3817
2270.16630.14550.17160.0208−0.0053−0.026
2280.0050.00460.0050.00040−0.0004
2290.62210.49880.50420.12320.1179−0.0053
2300.19660.08640.09050.11020.1061−0.0041
2310.11690.0380.13440.079−0.0175−0.0964
2320.08410.07970.08330.00440.0008−0.0036
2330.10.04880.08750.05130.0126−0.0387
2340.11580.0850.11620.0307−0.0004−0.0311
2350.12980.13150.1771−0.0016−0.0473−0.0457
MAP0.29920.25040.26630.04890.0329−0.0159
QueriesPrecision
Base linePΔ (Term selection—base line)
DFCchi-squareΔ (DFC-CS)Δ (DFC-BS)Δ (CS-BS)
2000.38190.41230.3796−0.03040.00230.0327
2010.97550.55990.58250.41560.393−0.0226
2020.0390.04390.0528−0.0049−0.0138−0.0089
2030.63780.63340.63930.0044−0.0014−0.0059
2040.66490.66370.64230.00120.02260.0214
2050.16580.13490.19540.0309−0.0296−0.0605
2060.53150.34120.43370.19030.0978−0.0925
2070.07440.13640.0661−0.0620.00830.0703
2080.42220.4620.264−0.03980.15830.1981
2090.51150.19940.26440.31210.2471−0.065
2100.06510.07370.0764−0.0086−0.0113−0.0027
2110.45940.23480.34810.22460.1113−0.1133
2120.36310.35550.39410.0076−.031−0.0386
2130.51860.55870.5306−0.0401−.01210.0281
2140.5580.48910.54080.06890.0171−0.0517
2150.390.32470.45570.0653−0.0657−0.131
2160.10850.08490.08080.02360.02770.0041
2170.00310.0010.00510.0021−0.002−0.0041
2180.23550.20210.28340.0334−0.0479−0.0813
2190.02010.08750.0839−0.0619−0.06380.0036
2200.91510.92420.8556−0.00920.05950.0687
2210.00010.00010.00010.00010.00010.0001
2220.09640.06480.08520.03160.0112−0.0204
2230.23230.11660.41050.1157−0.1782−0.2939
2240.02980.02150.22680.0083−0.197−0.2053
2250.09090.08330.01640.00760.07450.0669
2260.74490.68470.3030.06030.44190.3817
2270.16630.14550.17160.0208−0.0053−0.026
2280.0050.00460.0050.00040−0.0004
2290.62210.49880.50420.12320.1179−0.0053
2300.19660.08640.09050.11020.1061−0.0041
2310.11690.0380.13440.079−0.0175−0.0964
2320.08410.07970.08330.00440.0008−0.0036
2330.10.04880.08750.05130.0126−0.0387
2340.11580.0850.11620.0307−0.0004−0.0311
2350.12980.13150.1771−0.0016−0.0473−0.0457
MAP0.29920.25040.26630.04890.0329−0.0159
Table 1.

Summary of difference in precision for 36 TREC queries

QueriesPrecision
Base linePΔ (Term selection—base line)
DFCchi-squareΔ (DFC-CS)Δ (DFC-BS)Δ (CS-BS)
2000.38190.41230.3796−0.03040.00230.0327
2010.97550.55990.58250.41560.393−0.0226
2020.0390.04390.0528−0.0049−0.0138−0.0089
2030.63780.63340.63930.0044−0.0014−0.0059
2040.66490.66370.64230.00120.02260.0214
2050.16580.13490.19540.0309−0.0296−0.0605
2060.53150.34120.43370.19030.0978−0.0925
2070.07440.13640.0661−0.0620.00830.0703
2080.42220.4620.264−0.03980.15830.1981
2090.51150.19940.26440.31210.2471−0.065
2100.06510.07370.0764−0.0086−0.0113−0.0027
2110.45940.23480.34810.22460.1113−0.1133
2120.36310.35550.39410.0076−.031−0.0386
2130.51860.55870.5306−0.0401−.01210.0281
2140.5580.48910.54080.06890.0171−0.0517
2150.390.32470.45570.0653−0.0657−0.131
2160.10850.08490.08080.02360.02770.0041
2170.00310.0010.00510.0021−0.002−0.0041
2180.23550.20210.28340.0334−0.0479−0.0813
2190.02010.08750.0839−0.0619−0.06380.0036
2200.91510.92420.8556−0.00920.05950.0687
2210.00010.00010.00010.00010.00010.0001
2220.09640.06480.08520.03160.0112−0.0204
2230.23230.11660.41050.1157−0.1782−0.2939
2240.02980.02150.22680.0083−0.197−0.2053
2250.09090.08330.01640.00760.07450.0669
2260.74490.68470.3030.06030.44190.3817
2270.16630.14550.17160.0208−0.0053−0.026
2280.0050.00460.0050.00040−0.0004
2290.62210.49880.50420.12320.1179−0.0053
2300.19660.08640.09050.11020.1061−0.0041
2310.11690.0380.13440.079−0.0175−0.0964
2320.08410.07970.08330.00440.0008−0.0036
2330.10.04880.08750.05130.0126−0.0387
2340.11580.0850.11620.0307−0.0004−0.0311
2350.12980.13150.1771−0.0016−0.0473−0.0457
MAP0.29920.25040.26630.04890.0329−0.0159
QueriesPrecision
Base linePΔ (Term selection—base line)
DFCchi-squareΔ (DFC-CS)Δ (DFC-BS)Δ (CS-BS)
2000.38190.41230.3796−0.03040.00230.0327
2010.97550.55990.58250.41560.393−0.0226
2020.0390.04390.0528−0.0049−0.0138−0.0089
2030.63780.63340.63930.0044−0.0014−0.0059
2040.66490.66370.64230.00120.02260.0214
2050.16580.13490.19540.0309−0.0296−0.0605
2060.53150.34120.43370.19030.0978−0.0925
2070.07440.13640.0661−0.0620.00830.0703
2080.42220.4620.264−0.03980.15830.1981
2090.51150.19940.26440.31210.2471−0.065
2100.06510.07370.0764−0.0086−0.0113−0.0027
2110.45940.23480.34810.22460.1113−0.1133
2120.36310.35550.39410.0076−.031−0.0386
2130.51860.55870.5306−0.0401−.01210.0281
2140.5580.48910.54080.06890.0171−0.0517
2150.390.32470.45570.0653−0.0657−0.131
2160.10850.08490.08080.02360.02770.0041
2170.00310.0010.00510.0021−0.002−0.0041
2180.23550.20210.28340.0334−0.0479−0.0813
2190.02010.08750.0839−0.0619−0.06380.0036
2200.91510.92420.8556−0.00920.05950.0687
2210.00010.00010.00010.00010.00010.0001
2220.09640.06480.08520.03160.0112−0.0204
2230.23230.11660.41050.1157−0.1782−0.2939
2240.02980.02150.22680.0083−0.197−0.2053
2250.09090.08330.01640.00760.07450.0669
2260.74490.68470.3030.06030.44190.3817
2270.16630.14550.17160.0208−0.0053−0.026
2280.0050.00460.0050.00040−0.0004
2290.62210.49880.50420.12320.1179−0.0053
2300.19660.08640.09050.11020.1061−0.0041
2310.11690.0380.13440.079−0.0175−0.0964
2320.08410.07970.08330.00440.0008−0.0036
2330.10.04880.08750.05130.0126−0.0387
2340.11580.0850.11620.0307−0.0004−0.0311
2350.12980.13150.1771−0.0016−0.0473−0.0457
MAP0.29920.25040.26630.04890.0329−0.0159

Delta(DFC-BS) and delta(CS-BS) are the differences in the performance of information retrieval system after applying QE using algorithms (DFC and chi-square) and without applying any QE (baseline).

These columns show the effect on the performance after applying QE techniques. Positive value of the delta shows an increase in performance after applying QE whereas negative value depicts decline in performance due to QE. It is pretty easy to see that negative value of delta in both cases is directly proportional to the query drift. It can be seen from the table that 16 out of 35 queries have shown a decrease in performance due to query drift using DFC. On the other hand, by applying QE using chi-square, only 9 out of 36 queries have shown an improved performance. For delta(DFC-BS), the best performance has observed for query 225 and for delta (CS-BS), query 226 has marked the most increase in precision after applying QE. Highlighted values at the bottom of the table illustrates mean average precision difference of mentioned algorithms.

In order to further explore chi-square term selection algorithms, query 201 and 207 are selected as they have revealed best performance for DFC and chi-square, respectively. These two algorithms are applied again on query 201 and 207 to obtain top 10 terms from top 40 retrieved documents. The selected terms are listed in Tables 2 and 3.

Table 2.

Top 10 terms selected by chi-square and DFC for query 201 (precision is shown by adding each term to query)

Top 10 terms ranked by DFC
TermsBaselinecalipelv599enature00766mouriauxv600etroviscobraf418934ashieldsjklintenas
Precision0.58250.64260.93200.95690.86470.92670.93710.94040.96760.96760.9735
Top 10 terms ranked by chi-square
TermsBaselinebrafrasraftransgelincref9ncvmm12v600euvealkras
Precision0.58250.76320.79600.81270.65280.55980.50180.56570.75610.79770.8307
Query Termsgenesassociatedcancer
Top 10 terms ranked by DFC
TermsBaselinecalipelv599enature00766mouriauxv600etroviscobraf418934ashieldsjklintenas
Precision0.58250.64260.93200.95690.86470.92670.93710.94040.96760.96760.9735
Top 10 terms ranked by chi-square
TermsBaselinebrafrasraftransgelincref9ncvmm12v600euvealkras
Precision0.58250.76320.79600.81270.65280.55980.50180.56570.75610.79770.8307
Query Termsgenesassociatedcancer
Table 2.

Top 10 terms selected by chi-square and DFC for query 201 (precision is shown by adding each term to query)

Top 10 terms ranked by DFC
TermsBaselinecalipelv599enature00766mouriauxv600etroviscobraf418934ashieldsjklintenas
Precision0.58250.64260.93200.95690.86470.92670.93710.94040.96760.96760.9735
Top 10 terms ranked by chi-square
TermsBaselinebrafrasraftransgelincref9ncvmm12v600euvealkras
Precision0.58250.76320.79600.81270.65280.55980.50180.56570.75610.79770.8307
Query Termsgenesassociatedcancer
Top 10 terms ranked by DFC
TermsBaselinecalipelv599enature00766mouriauxv600etroviscobraf418934ashieldsjklintenas
Precision0.58250.64260.93200.95690.86470.92670.93710.94040.96760.96760.9735
Top 10 terms ranked by chi-square
TermsBaselinebrafrasraftransgelincref9ncvmm12v600euvealkras
Precision0.58250.76320.79600.81270.65280.55980.50180.56570.75610.79770.8307
Query Termsgenesassociatedcancer
Table 3.

Top 10 terms selected by chi-square and DFC for query 207 (precision is shown by adding each term to query)

Top 10 terms ranked by DFC
TermsBaselineetidronatealendronatebisphosphonatesdidronelrisedronateibandronatebisphosphonateart271int1999; 9pprice
Precision0.066139260.07980.09460.1470.15570.15230.17770.18740.17360.16890.164
Top 10 terms ranked by chi-square
TermsBaselineetidronatefetuinbisphosphonatesincadronatepaget’spamidronateaminobisphosphonatesibandronatebisphosphonatetiludronate
Precision0.066139260.07980.06950.12650.16040.16910.22860.2040.20560.20730.2609
Query Termstoxicitiesassociatedetidronate
Top 10 terms ranked by DFC
TermsBaselineetidronatealendronatebisphosphonatesdidronelrisedronateibandronatebisphosphonateart271int1999; 9pprice
Precision0.066139260.07980.09460.1470.15570.15230.17770.18740.17360.16890.164
Top 10 terms ranked by chi-square
TermsBaselineetidronatefetuinbisphosphonatesincadronatepaget’spamidronateaminobisphosphonatesibandronatebisphosphonatetiludronate
Precision0.066139260.07980.06950.12650.16040.16910.22860.2040.20560.20730.2609
Query Termstoxicitiesassociatedetidronate
Table 3.

Top 10 terms selected by chi-square and DFC for query 207 (precision is shown by adding each term to query)

Top 10 terms ranked by DFC
TermsBaselineetidronatealendronatebisphosphonatesdidronelrisedronateibandronatebisphosphonateart271int1999; 9pprice
Precision0.066139260.07980.09460.1470.15570.15230.17770.18740.17360.16890.164
Top 10 terms ranked by chi-square
TermsBaselineetidronatefetuinbisphosphonatesincadronatepaget’spamidronateaminobisphosphonatesibandronatebisphosphonatetiludronate
Precision0.066139260.07980.06950.12650.16040.16910.22860.2040.20560.20730.2609
Query Termstoxicitiesassociatedetidronate
Top 10 terms ranked by DFC
TermsBaselineetidronatealendronatebisphosphonatesdidronelrisedronateibandronatebisphosphonateart271int1999; 9pprice
Precision0.066139260.07980.09460.1470.15570.15230.17770.18740.17360.16890.164
Top 10 terms ranked by chi-square
TermsBaselineetidronatefetuinbisphosphonatesincadronatepaget’spamidronateaminobisphosphonatesibandronatebisphosphonatetiludronate
Precision0.066139260.07980.06950.12650.16040.16910.22860.2040.20560.20730.2609
Query Termstoxicitiesassociatedetidronate

Original query is expanded by adding one term at a time and precision is measured just to reveal the positive or negative effects of newly added term over QE. Results of incremental QE obtained after iterating over all 10 terms are shown in Tables 2 and 3. As observed from the tables, expanding user query with selected terms has marked a reasonable boost in the performance of specified query.

Tables 4 and 5 depict the unique terms selected by chi-square and DFC for queries 201 and 207, respectively. Both tables also show the document frequency based parameters (tdfr, tdfr¯, tdfnr, tdfnr¯) as well as the probabilities used by chi-square. To lay out a clear picture of the importance of terms against each algorithm, ranks of these unique terms as determined by their scores of chi-square and DFC are also shown.

Table 4.

Scores of 18 unique terms selected by chi-square and DFC for query 201 on 40 documents

TermsP (relevance)P (corpus)tdfrtdfnr¯tdfnrtdfr¯ CSRankCSDFCRankDFC
braf0.000823.39E-071416215069261.997819558.4117
ras0.010859.00E-05381492771294221.28772411.48516
raf0.006523.36E-0540157245497401.251631254.75713
transgelin0.000351.26E-07116219722390.954174.44218
cref0.000371.64E-07116220316390.83125236.70417
9nc0.000154.31E-0811622163390.554261012.39414
vmm120.000123.27E-0811622190390.463974055.511
v600e0.000123.34E-0851622172350.4351814481.6074
uveal0.000312.44E-07316214475370.39119462.4115
kras0.000434.81E-07616213089340.3783101526.48412
calipel0.000025.57E-0951622190350.07912202781
v599e0.000114.80E-0881622145320.25231119960.8612
nature007660.000032.23E-081216219920280.04441518238.2373
mouriaux0.000039.04E-0951622172350.07591314481.6075
trovisco0.000013.48E-0931622190370.04941412166.656
418934a0.000017.65E-0951622136350.0224179212.1598
shieldsj0.000012.09E-0921622190380.0296168111.059
klintenas0.000011.39E-0921622190380.0197188111.0510
TermsP (relevance)P (corpus)tdfrtdfnr¯tdfnrtdfr¯ CSRankCSDFCRankDFC
braf0.000823.39E-071416215069261.997819558.4117
ras0.010859.00E-05381492771294221.28772411.48516
raf0.006523.36E-0540157245497401.251631254.75713
transgelin0.000351.26E-07116219722390.954174.44218
cref0.000371.64E-07116220316390.83125236.70417
9nc0.000154.31E-0811622163390.554261012.39414
vmm120.000123.27E-0811622190390.463974055.511
v600e0.000123.34E-0851622172350.4351814481.6074
uveal0.000312.44E-07316214475370.39119462.4115
kras0.000434.81E-07616213089340.3783101526.48412
calipel0.000025.57E-0951622190350.07912202781
v599e0.000114.80E-0881622145320.25231119960.8612
nature007660.000032.23E-081216219920280.04441518238.2373
mouriaux0.000039.04E-0951622172350.07591314481.6075
trovisco0.000013.48E-0931622190370.04941412166.656
418934a0.000017.65E-0951622136350.0224179212.1598
shieldsj0.000012.09E-0921622190380.0296168111.059
klintenas0.000011.39E-0921622190380.0197188111.0510
Table 4.

Scores of 18 unique terms selected by chi-square and DFC for query 201 on 40 documents

TermsP (relevance)P (corpus)tdfrtdfnr¯tdfnrtdfr¯ CSRankCSDFCRankDFC
braf0.000823.39E-071416215069261.997819558.4117
ras0.010859.00E-05381492771294221.28772411.48516
raf0.006523.36E-0540157245497401.251631254.75713
transgelin0.000351.26E-07116219722390.954174.44218
cref0.000371.64E-07116220316390.83125236.70417
9nc0.000154.31E-0811622163390.554261012.39414
vmm120.000123.27E-0811622190390.463974055.511
v600e0.000123.34E-0851622172350.4351814481.6074
uveal0.000312.44E-07316214475370.39119462.4115
kras0.000434.81E-07616213089340.3783101526.48412
calipel0.000025.57E-0951622190350.07912202781
v599e0.000114.80E-0881622145320.25231119960.8612
nature007660.000032.23E-081216219920280.04441518238.2373
mouriaux0.000039.04E-0951622172350.07591314481.6075
trovisco0.000013.48E-0931622190370.04941412166.656
418934a0.000017.65E-0951622136350.0224179212.1598
shieldsj0.000012.09E-0921622190380.0296168111.059
klintenas0.000011.39E-0921622190380.0197188111.0510
TermsP (relevance)P (corpus)tdfrtdfnr¯tdfnrtdfr¯ CSRankCSDFCRankDFC
braf0.000823.39E-071416215069261.997819558.4117
ras0.010859.00E-05381492771294221.28772411.48516
raf0.006523.36E-0540157245497401.251631254.75713
transgelin0.000351.26E-07116219722390.954174.44218
cref0.000371.64E-07116220316390.83125236.70417
9nc0.000154.31E-0811622163390.554261012.39414
vmm120.000123.27E-0811622190390.463974055.511
v600e0.000123.34E-0851622172350.4351814481.6074
uveal0.000312.44E-07316214475370.39119462.4115
kras0.000434.81E-07616213089340.3783101526.48412
calipel0.000025.57E-0951622190350.07912202781
v599e0.000114.80E-0881622145320.25231119960.8612
nature007660.000032.23E-081216219920280.04441518238.2373
mouriaux0.000039.04E-0951622172350.07591314481.6075
trovisco0.000013.48E-0931622190370.04941412166.656
418934a0.000017.65E-0951622136350.0224179212.1598
shieldsj0.000012.09E-0921622190380.0296168111.059
klintenas0.000011.39E-0921622190380.0197188111.0510
Table 5.

Scores of 18 unique terms selected by Chi-square and DFC for query 207 on 40 documents

TermsP (relevance)P (corpus)tdfrtdfnr¯tdfnrtdfr¯CSRankCSDFCRankDFC
etidronate0.00193.10E-074016218633011.641888911
fetuin0.00291.70E-064161944275364.9412225.1416
bisphosphonates0.00179.03E-073216194827183.2083136743
incadronate0.00035.90E-0841622145361.52547205.6912
paget’s0.00083.70E-076162119100341.72851366.9315
pamidronate0.00127.50E-0718162081138221.91868399.1711
aminobisphosphonates0.00037.60E-08616220217341.18476339.7313
ibandronate0.00063.60E-071216218435280.9998124116
bisphosphonate0.00074.80E-0726162023196141.019912320.17
tiludronate0.00013.40E-08516220613350.294105626.0114
alendronate0.00056.40E-0724162096123160.391215865.12
didronel1.80E-055.50E-0941622181360.0591612976.34
risedronate0.00031.90E-071216218633280.4731112963.55
art2711.40E-052.08E-0931622190370.0981412166.78
int1999; 91.40E-052.08E-0931622190370.0981512166.79
pprice3.20E-057.60E-0941622172360.1351310812.310
TermsP (relevance)P (corpus)tdfrtdfnr¯tdfnrtdfr¯CSRankCSDFCRankDFC
etidronate0.00193.10E-074016218633011.641888911
fetuin0.00291.70E-064161944275364.9412225.1416
bisphosphonates0.00179.03E-073216194827183.2083136743
incadronate0.00035.90E-0841622145361.52547205.6912
paget’s0.00083.70E-076162119100341.72851366.9315
pamidronate0.00127.50E-0718162081138221.91868399.1711
aminobisphosphonates0.00037.60E-08616220217341.18476339.7313
ibandronate0.00063.60E-071216218435280.9998124116
bisphosphonate0.00074.80E-0726162023196141.019912320.17
tiludronate0.00013.40E-08516220613350.294105626.0114
alendronate0.00056.40E-0724162096123160.391215865.12
didronel1.80E-055.50E-0941622181360.0591612976.34
risedronate0.00031.90E-071216218633280.4731112963.55
art2711.40E-052.08E-0931622190370.0981412166.78
int1999; 91.40E-052.08E-0931622190370.0981512166.79
pprice3.20E-057.60E-0941622172360.1351310812.310
Table 5.

Scores of 18 unique terms selected by Chi-square and DFC for query 207 on 40 documents

TermsP (relevance)P (corpus)tdfrtdfnr¯tdfnrtdfr¯CSRankCSDFCRankDFC
etidronate0.00193.10E-074016218633011.641888911
fetuin0.00291.70E-064161944275364.9412225.1416
bisphosphonates0.00179.03E-073216194827183.2083136743
incadronate0.00035.90E-0841622145361.52547205.6912
paget’s0.00083.70E-076162119100341.72851366.9315
pamidronate0.00127.50E-0718162081138221.91868399.1711
aminobisphosphonates0.00037.60E-08616220217341.18476339.7313
ibandronate0.00063.60E-071216218435280.9998124116
bisphosphonate0.00074.80E-0726162023196141.019912320.17
tiludronate0.00013.40E-08516220613350.294105626.0114
alendronate0.00056.40E-0724162096123160.391215865.12
didronel1.80E-055.50E-0941622181360.0591612976.34
risedronate0.00031.90E-071216218633280.4731112963.55
art2711.40E-052.08E-0931622190370.0981412166.78
int1999; 91.40E-052.08E-0931622190370.0981512166.79
pprice3.20E-057.60E-0941622172360.1351310812.310
TermsP (relevance)P (corpus)tdfrtdfnr¯tdfnrtdfr¯CSRankCSDFCRankDFC
etidronate0.00193.10E-074016218633011.641888911
fetuin0.00291.70E-064161944275364.9412225.1416
bisphosphonates0.00179.03E-073216194827183.2083136743
incadronate0.00035.90E-0841622145361.52547205.6912
paget’s0.00083.70E-076162119100341.72851366.9315
pamidronate0.00127.50E-0718162081138221.91868399.1711
aminobisphosphonates0.00037.60E-08616220217341.18476339.7313
ibandronate0.00063.60E-071216218435280.9998124116
bisphosphonate0.00074.80E-0726162023196141.019912320.17
tiludronate0.00013.40E-08516220613350.294105626.0114
alendronate0.00056.40E-0724162096123160.391215865.12
didronel1.80E-055.50E-0941622181360.0591612976.34
risedronate0.00031.90E-071216218633280.4731112963.55
art2711.40E-052.08E-0931622190370.0981412166.78
int1999; 91.40E-052.08E-0931622190370.0981512166.79
pprice3.20E-057.60E-0941622172360.1351310812.310

As shown in the Table 4, chi-square has assigned highest score to the term braf while DFC ranks calipel as the top term. A close inspection of document frequency parameters show that braf is present in 14 relevant documents and 69 non-relevant documents. On the other hand, calipel is present in five documents of relevant class and it is entirely absent in non-relevant documents. Due to this reason, DFC considers it a highly discriminative term to differentiate between relevant and non-relevant class. Similarly, we observe second term ranked by both algorithms. DFC has placed v5899e at second rank whereas is selected as second best term by chi-square. We explain this by observing the fact that is present more times in non-relevant documents as compared to v599e.

Likewise, other terms can also be observed from the table. Similarly, from Table 5 it can be seen that etidronate is ranked as the best term by both algorithms. DFC has selected alendronate as second best term and chi-square placed fetuin at second rank. Fetuin is present in only 4 documents of relevant class and 275 of non-relevant class documents. However, alendronate is present in 24 relevant documents and 123 non-relevant documents. By analyzing and comparing these parameters, it is pretty easy to see that alendronate is more suitable candidate than fetuin as it is present more times in relevant documents and also has lesser occurrence in non-relevant class.

Experimental setup and results

We use an open source search platform known as ‘Solr’ (39) for experimentation. It includes features of full text search and real time indexing. In experimentation, Okapi BM25 is used as retrieval model. In this section we briefly explain about experimental setup and compare the results of all term selection techniques against defined test points.

A. Results without stemming

To analyze the effect of pre-processing on biomedical data, we have used two different methods for indexing of the corpus documents as discussed in the Section 3. This section depicts the results of nine term selection algorithms in the form of tables at pre-defined test points. Expectedly, all feature selection techniques do not produce their peak results at the same defined set of parameters. These parameters are number of top retrieved relevant documents and candidate expansion terms that get merged with the query. For sake of laying out the clear picture of the performance of pseudo relevance feedback and better comparison of term ranking algorithms, we have shown a graph containing the peak results only against the best parameters of terms for all techniques found from below mentioned tables.

Tables 6–10 illustrate MAP of nine term selection algorithms on pre-defined benchmark test points at top terms (5, 10, 15, …, 50) and documents (10, 20, …, 50). Boldface values in these tables indicate the highest performance of a particular term selection algorithm across all the mentioned term selection algorithms at a specific number of terms.

Table 6.

Mean average precision of nine term selection algorithms by choosing digit 10 as threshold to divide ranked corpus documents as relevant and non-relevant

No. of termsFR metrics
Chi-squareChiDFCKLDRSVCoDiceIGLRFPRF
50.26610.26320.27480.27240.27130.26960.27120.28590.2661
100.25470.26490.28010.26650.26190.27770.28010.28840.2692
150.25060.26170.2810.25390.26470.28210.28170.28390.2653
200.25690.26240.28150.25430.24930.2720.27980.28420.2643
250.26190.25870.28140.2580.24390.27260.2790.2790.2616
300.25930.2590.280.25010.24560.26970.2790.27460.2603
350.26270.25910.280.24880.24880.26750.27170.27010.2603
400.25960.2550.2780.24990.24660.26480.26780.26660.2577
450.26060.2570.27640.24550.250.26470.26520.26440.2558
500.25950.25760.27650.24450.24260.26540.25960.26480.2565
No. of termsFR metrics
Chi-squareChiDFCKLDRSVCoDiceIGLRFPRF
50.26610.26320.27480.27240.27130.26960.27120.28590.2661
100.25470.26490.28010.26650.26190.27770.28010.28840.2692
150.25060.26170.2810.25390.26470.28210.28170.28390.2653
200.25690.26240.28150.25430.24930.2720.27980.28420.2643
250.26190.25870.28140.2580.24390.27260.2790.2790.2616
300.25930.2590.280.25010.24560.26970.2790.27460.2603
350.26270.25910.280.24880.24880.26750.27170.27010.2603
400.25960.2550.2780.24990.24660.26480.26780.26660.2577
450.26060.2570.27640.24550.250.26470.26520.26440.2558
500.25950.25760.27650.24450.24260.26540.25960.26480.2565
Table 6.

Mean average precision of nine term selection algorithms by choosing digit 10 as threshold to divide ranked corpus documents as relevant and non-relevant

No. of termsFR metrics
Chi-squareChiDFCKLDRSVCoDiceIGLRFPRF
50.26610.26320.27480.27240.27130.26960.27120.28590.2661
100.25470.26490.28010.26650.26190.27770.28010.28840.2692
150.25060.26170.2810.25390.26470.28210.28170.28390.2653
200.25690.26240.28150.25430.24930.2720.27980.28420.2643
250.26190.25870.28140.2580.24390.27260.2790.2790.2616
300.25930.2590.280.25010.24560.26970.2790.27460.2603
350.26270.25910.280.24880.24880.26750.27170.27010.2603
400.25960.2550.2780.24990.24660.26480.26780.26660.2577
450.26060.2570.27640.24550.250.26470.26520.26440.2558
500.25950.25760.27650.24450.24260.26540.25960.26480.2565
No. of termsFR metrics
Chi-squareChiDFCKLDRSVCoDiceIGLRFPRF
50.26610.26320.27480.27240.27130.26960.27120.28590.2661
100.25470.26490.28010.26650.26190.27770.28010.28840.2692
150.25060.26170.2810.25390.26470.28210.28170.28390.2653
200.25690.26240.28150.25430.24930.2720.27980.28420.2643
250.26190.25870.28140.2580.24390.27260.2790.2790.2616
300.25930.2590.280.25010.24560.26970.2790.27460.2603
350.26270.25910.280.24880.24880.26750.27170.27010.2603
400.25960.2550.2780.24990.24660.26480.26780.26660.2577
450.26060.2570.27640.24550.250.26470.26520.26440.2558
500.25950.25760.27650.24450.24260.26540.25960.26480.2565
Table 7.

Mean average precision of nine term selection algorithms by choosing digit 20 as threshold to categories ranked corpus documents as relevant and non-relevant

No. of termsFR metrics
Chi-squareChiDFCKLDRSVCoDiceIGLRFPRF
50.25740.26140.28590.29250.28240.27270.27000.28350.2624
100.26680.25190.28920.27540.26570.27770.26660.27870.2651
150.26600.25310.28780.26000.25530.27140.26740.26600.2637
200.26160.25150.28900.25560.24950.26720.26770.26050.2663
250.26560.25140.28920.25220.24410.26570.26630.25870.2643
300.27450.25040.29100.24430.23810.26670.26350.25540.2591
350.26420.24960.28870.24060.23750.26200.26080.25300.2619
400.26130.24820.29120.23910.23810.25980.25800.24950.2598
450.26300.24520.28910.23580.23800.26010.25980.24760.2577
500.26300.24600.29030.23430.23380.25860.25690.24410.2565
No. of termsFR metrics
Chi-squareChiDFCKLDRSVCoDiceIGLRFPRF
50.25740.26140.28590.29250.28240.27270.27000.28350.2624
100.26680.25190.28920.27540.26570.27770.26660.27870.2651
150.26600.25310.28780.26000.25530.27140.26740.26600.2637
200.26160.25150.28900.25560.24950.26720.26770.26050.2663
250.26560.25140.28920.25220.24410.26570.26630.25870.2643
300.27450.25040.29100.24430.23810.26670.26350.25540.2591
350.26420.24960.28870.24060.23750.26200.26080.25300.2619
400.26130.24820.29120.23910.23810.25980.25800.24950.2598
450.26300.24520.28910.23580.23800.26010.25980.24760.2577
500.26300.24600.29030.23430.23380.25860.25690.24410.2565
Table 7.

Mean average precision of nine term selection algorithms by choosing digit 20 as threshold to categories ranked corpus documents as relevant and non-relevant

No. of termsFR metrics
Chi-squareChiDFCKLDRSVCoDiceIGLRFPRF
50.25740.26140.28590.29250.28240.27270.27000.28350.2624
100.26680.25190.28920.27540.26570.27770.26660.27870.2651
150.26600.25310.28780.26000.25530.27140.26740.26600.2637
200.26160.25150.28900.25560.24950.26720.26770.26050.2663
250.26560.25140.28920.25220.24410.26570.26630.25870.2643
300.27450.25040.29100.24430.23810.26670.26350.25540.2591
350.26420.24960.28870.24060.23750.26200.26080.25300.2619
400.26130.24820.29120.23910.23810.25980.25800.24950.2598
450.26300.24520.28910.23580.23800.26010.25980.24760.2577
500.26300.24600.29030.23430.23380.25860.25690.24410.2565
No. of termsFR metrics
Chi-squareChiDFCKLDRSVCoDiceIGLRFPRF
50.25740.26140.28590.29250.28240.27270.27000.28350.2624
100.26680.25190.28920.27540.26570.27770.26660.27870.2651
150.26600.25310.28780.26000.25530.27140.26740.26600.2637
200.26160.25150.28900.25560.24950.26720.26770.26050.2663
250.26560.25140.28920.25220.24410.26570.26630.25870.2643
300.27450.25040.29100.24430.23810.26670.26350.25540.2591
350.26420.24960.28870.24060.23750.26200.26080.25300.2619
400.26130.24820.29120.23910.23810.25980.25800.24950.2598
450.26300.24520.28910.23580.23800.26010.25980.24760.2577
500.26300.24600.29030.23430.23380.25860.25690.24410.2565
Table 8.

Mean average precision of nine term selection algorithms by choosing digit 30 as threshold to divide ranked corpus documents as relevant and non-relevant

No. of termsFR metrics
Chi-squareChiDFCKLDRSVCoDiceIGLRFPRF
50.25040.23720.29250.29540.27580.27480.27980.28560.2559
100.26170.22980.29510.27200.25690.26420.26740.27600.2527
150.25940.22940.29350.26570.26390.25910.26280.25800.2521
200.25740.22900.28850.26050.25440.26120.25860.25770.2523
250.25970.22990.28940.25540.24530.26440.25070.25700.2552
300.26190.22780.28510.25130.24020.26480.25020.25520.2531
350.25750.22800.28520.25070.23560.26280.24860.24990.2521
400.25310.22600.28280.23370.23000.25990.24700.24310.2490
No. of termsFR metrics
Chi-squareChiDFCKLDRSVCoDiceIGLRFPRF
50.25040.23720.29250.29540.27580.27480.27980.28560.2559
100.26170.22980.29510.27200.25690.26420.26740.27600.2527
150.25940.22940.29350.26570.26390.25910.26280.25800.2521
200.25740.22900.28850.26050.25440.26120.25860.25770.2523
250.25970.22990.28940.25540.24530.26440.25070.25700.2552
300.26190.22780.28510.25130.24020.26480.25020.25520.2531
350.25750.22800.28520.25070.23560.26280.24860.24990.2521
400.25310.22600.28280.23370.23000.25990.24700.24310.2490
Table 8.

Mean average precision of nine term selection algorithms by choosing digit 30 as threshold to divide ranked corpus documents as relevant and non-relevant

No. of termsFR metrics
Chi-squareChiDFCKLDRSVCoDiceIGLRFPRF
50.25040.23720.29250.29540.27580.27480.27980.28560.2559
100.26170.22980.29510.27200.25690.26420.26740.27600.2527
150.25940.22940.29350.26570.26390.25910.26280.25800.2521
200.25740.22900.28850.26050.25440.26120.25860.25770.2523
250.25970.22990.28940.25540.24530.26440.25070.25700.2552
300.26190.22780.28510.25130.24020.26480.25020.25520.2531
350.25750.22800.28520.25070.23560.26280.24860.24990.2521
400.25310.22600.28280.23370.23000.25990.24700.24310.2490
No. of termsFR metrics
Chi-squareChiDFCKLDRSVCoDiceIGLRFPRF
50.25040.23720.29250.29540.27580.27480.27980.28560.2559
100.26170.22980.29510.27200.25690.26420.26740.27600.2527
150.25940.22940.29350.26570.26390.25910.26280.25800.2521
200.25740.22900.28850.26050.25440.26120.25860.25770.2523
250.25970.22990.28940.25540.24530.26440.25070.25700.2552
300.26190.22780.28510.25130.24020.26480.25020.25520.2531
350.25750.22800.28520.25070.23560.26280.24860.24990.2521
400.25310.22600.28280.23370.23000.25990.24700.24310.2490
Table 9.

Mean average precision of nine term selection algorithms by choosing digit 40 as threshold to categories ranked corpus documents as relevant and non-relevant

No. of termsFR metrics
Chi-squareChiDFCKLDRSVCoDiceIGLRFPRF
50.24800.25080.28970.27440.26740.26360.26990.27050.2523
100.25330.23690.29830.27020.25990.25970.25860.25950.2525
150.26160.23810.29830.26590.25780.25410.25140.24270.2533
200.26130.22930.29260.26110.24470.25030.24730.22810.2523
250.26240.22490.28620.24900.23810.25120.24170.24250.2503
300.25990.22620.28500.23920.23160.25430.24030.23390.2481
350.25880.22160.28800.23260.23180.25320.24040.22760.2415
400.25410.21900.28660.23620.23380.25380.24320.22230.2413
450.25180.22080.28440.23740.23300.25170.24230.23040.2389
500.25030.21950.28360.23650.22560.24990.24210.22840.2349
No. of termsFR metrics
Chi-squareChiDFCKLDRSVCoDiceIGLRFPRF
50.24800.25080.28970.27440.26740.26360.26990.27050.2523
100.25330.23690.29830.27020.25990.25970.25860.25950.2525
150.26160.23810.29830.26590.25780.25410.25140.24270.2533
200.26130.22930.29260.26110.24470.25030.24730.22810.2523
250.26240.22490.28620.24900.23810.25120.24170.24250.2503
300.25990.22620.28500.23920.23160.25430.24030.23390.2481
350.25880.22160.28800.23260.23180.25320.24040.22760.2415
400.25410.21900.28660.23620.23380.25380.24320.22230.2413
450.25180.22080.28440.23740.23300.25170.24230.23040.2389
500.25030.21950.28360.23650.22560.24990.24210.22840.2349
Table 9.

Mean average precision of nine term selection algorithms by choosing digit 40 as threshold to categories ranked corpus documents as relevant and non-relevant

No. of termsFR metrics
Chi-squareChiDFCKLDRSVCoDiceIGLRFPRF
50.24800.25080.28970.27440.26740.26360.26990.27050.2523
100.25330.23690.29830.27020.25990.25970.25860.25950.2525
150.26160.23810.29830.26590.25780.25410.25140.24270.2533
200.26130.22930.29260.26110.24470.25030.24730.22810.2523
250.26240.22490.28620.24900.23810.25120.24170.24250.2503
300.25990.22620.28500.23920.23160.25430.24030.23390.2481
350.25880.22160.28800.23260.23180.25320.24040.22760.2415
400.25410.21900.28660.23620.23380.25380.24320.22230.2413
450.25180.22080.28440.23740.23300.25170.24230.23040.2389
500.25030.21950.28360.23650.22560.24990.24210.22840.2349
No. of termsFR metrics
Chi-squareChiDFCKLDRSVCoDiceIGLRFPRF
50.24800.25080.28970.27440.26740.26360.26990.27050.2523
100.25330.23690.29830.27020.25990.25970.25860.25950.2525
150.26160.23810.29830.26590.25780.25410.25140.24270.2533
200.26130.22930.29260.26110.24470.25030.24730.22810.2523
250.26240.22490.28620.24900.23810.25120.24170.24250.2503
300.25990.22620.28500.23920.23160.25430.24030.23390.2481
350.25880.22160.28800.23260.23180.25320.24040.22760.2415
400.25410.21900.28660.23620.23380.25380.24320.22230.2413
450.25180.22080.28440.23740.23300.25170.24230.23040.2389
500.25030.21950.28360.23650.22560.24990.24210.22840.2349
Table 10.

Mean average precision of nine term selection algorithms by choosing digit 50 as threshold to classified ranked corpus documents as relevant and non-relevant

No. oftermsFR metrics
Chi-squareChiDFCKLDRSVCoDiceIGLRFPRF
50.24910.23420.29260.28100.26300.25020.2400.24790.2540
100.24230.2330.29870.25750.25170.24890.22760.24850.2530
150.26290.22120.30150.25790.25390.24410.23660.24000.2572
200.26850.21970.29940.25610.24030.24640.23970.22040.2552
250.26330.21920.29040.24590.23540.24500.23380.21340.2553
300.26130.21530.28710.23940.22640.24130.23240.21400.2517
350.25810.21370.28980.23660.23030.24260.23650.20880.2477
400.25660.21450.28810.23420.22110.24180.22890.19980.2447
450.25210.21320.28750.23040.22350.24080.23530.19820.2428
500.25090.21160.28700.22720.21760.24140.23030.19980.2413
No. oftermsFR metrics
Chi-squareChiDFCKLDRSVCoDiceIGLRFPRF
50.24910.23420.29260.28100.26300.25020.2400.24790.2540
100.24230.2330.29870.25750.25170.24890.22760.24850.2530
150.26290.22120.30150.25790.25390.24410.23660.24000.2572
200.26850.21970.29940.25610.24030.24640.23970.22040.2552
250.26330.21920.29040.24590.23540.24500.23380.21340.2553
300.26130.21530.28710.23940.22640.24130.23240.21400.2517
350.25810.21370.28980.23660.23030.24260.23650.20880.2477
400.25660.21450.28810.23420.22110.24180.22890.19980.2447
450.25210.21320.28750.23040.22350.24080.23530.19820.2428
500.25090.21160.28700.22720.21760.24140.23030.19980.2413
Table 10.

Mean average precision of nine term selection algorithms by choosing digit 50 as threshold to classified ranked corpus documents as relevant and non-relevant

No. oftermsFR metrics
Chi-squareChiDFCKLDRSVCoDiceIGLRFPRF
50.24910.23420.29260.28100.26300.25020.2400.24790.2540
100.24230.2330.29870.25750.25170.24890.22760.24850.2530
150.26290.22120.30150.25790.25390.24410.23660.24000.2572
200.26850.21970.29940.25610.24030.24640.23970.22040.2552
250.26330.21920.29040.24590.23540.24500.23380.21340.2553
300.26130.21530.28710.23940.22640.24130.23240.21400.2517
350.25810.21370.28980.23660.23030.24260.23650.20880.2477
400.25660.21450.28810.23420.22110.24180.22890.19980.2447
450.25210.21320.28750.23040.22350.24080.23530.19820.2428
500.25090.21160.28700.22720.21760.24140.23030.19980.2413
No. oftermsFR metrics
Chi-squareChiDFCKLDRSVCoDiceIGLRFPRF
50.24910.23420.29260.28100.26300.25020.2400.24790.2540
100.24230.2330.29870.25750.25170.24890.22760.24850.2530
150.26290.22120.30150.25790.25390.24410.23660.24000.2572
200.26850.21970.29940.25610.24030.24640.23970.22040.2552
250.26330.21920.29040.24590.23540.24500.23380.21340.2553
300.26130.21530.28710.23940.22640.24130.23240.21400.2517
350.25810.21370.28980.23660.23030.24260.23650.20880.2477
400.25660.21450.28810.23420.22110.24180.22890.19980.2447
450.25210.21320.28750.23040.22350.24080.23530.19820.2428
500.25090.21160.28700.22720.21760.24140.23030.19980.2413

Table 6 highlights the best performing term selection algorithms over following defined set of test points (i.e. top documents =10, top terms =5, 10, 15, 20, 25, 30, 35, 40, 45, 50). It can be clearly seen that LRF outperforms the rest of term selection algorithms at following test points T = 5, 10, 15, 20. Likewise, DFC exhibits best performance in the remaining test points. RSV does not perform up to the mark as its performance kept decreasing gradually with the increase in number of terms. KLD follow the footsteps of RSV but it somehow manages to beat RSV in a race of being called as worst performing algorithm.

It has also been observed that the performance of chi (without square) and PRF show an overall decline in score with gradual increase in number of top selected terms. We can also observe from the table that the scores of CoDice and IG kept increasing until the term test point T = 15, and for the remaining test points, decrease in performance is observed. On the other hand, chi-square follows a mix sort of trend as its performance kept decreasing slightly on couple of test points at first and then all of a sudden start increasing but then it gradually decreases for remaining term test points.

Table 7 illustrates the performance of term selection algorithms for 20 number of documents and top terms T = 5, 10, 15, 20, 25, 30, 35, 40, 45, 50. As the table suggests, it is pretty obvious to say that KLD outperforms the rest of term selection algorithms only at following test point T = 5. Surprisingly, DFC exhibits best performance in all the remaining test points. In addition, RSV does not perform up to the mark again even with the increase of top documents, as its performance (32) kept decreasing gradually with the increase in number of terms. The performance of IG, LRF and chi (without square) follow a pattern in which they have highest MAP at term test point =10, whereas for the rest of the test points, gradually decreasing scores are observed. Chi-square based on probability shows a peculiar behavior as the performance first arbitrarily increases with gradual increase of top selected terms. This increase in performance is observed until T = 30 and after that the performance drops and an almost constant score is observed. As far as CoDice and PRF are concerned, no clear pattern is observed in their performance. Some test points cause a slight increase or decrease in performance while others keep the performance constant.

In Table 8, we have depicted the results of term selection algorithms obtained at document test point =30 and for all defined terms test points =5, 10, 15, 20, 25, 30, 35, 40, 45, 50. As the table suggests, it is pretty obvious to say that KLD outperforms the rest of term selection algorithms only at following test point T = 5. Surprisingly, DFC exhibits best performance in all the remaining test points. In addition, Chi (without square) is the worst performer and its performance kept decreasing gradually with the increase in number of terms. LRF and IG start with a very good score at T = 5 but with the increase in number of top selected terms, their performance also kept getting worst. On the other hand, PRF follows an almost constant trend as the difference between its best and worst score is only 0.011. The performance of term selection algorithms such as chi-square, RSV and CoDice follow a mixed pattern. As the number of top terms are increased, the results of mentioned term selection algorithm sometimes increase and all of a sudden decrease at the very next test point.

For top document =40 and 50, we have shown the best performance of nine term selection algorithms in Tables 9 and 10, respectively. As the table suggests, it is pretty clear that DFC exhibits best performance in all the test points. Table 9 depicts that the performance of KLD, RSV, CoDice, IG and PRF keep decreasing gradually with the increase in number of terms. It also marks that Chi (without square) is the worst performer as it shows the least score at T = 50.

However, algorithms such as LRF and chi-square follow no clear pattern as their score vary from one test point to another by either decreasing or increasing suddenly.

While studying the performance of term selection algorithms in Table 10, we observe that LRF depicts the worse performance and shows gradual decrease in performance with increasing number of terms. KLD, RSV, CoDice and PRF also follow a decreasing pattern as they mark their best performance only at T = 5 and eventually kept getting decrease until term test point 50. Conversely, we observe that algorithms such as chi-square and IG show an unpredictable behavior in their performance. The scores of chi-square first increase up to T = 15, and then decrease as number of terms approaches to 50. IG shows an even more abrupt behavior as the score keeps on increasing and decreasing at different term test points.

Figure 3 result summarizes the performance of nine term selection algorithms in terms of MAP against number of documents. Trends of all term selection algorithms (chi-square, KLD, RSV, CoDice, IG, LRF) along with newly proposed technique (DFC) and baseline are shown only at peak values retrieved from Tables 6–10. As the graph suggests, it is pretty easy to see that DFC and KLD have outperformed the rest but in a straight comparison, DFC is a clear winner. Although at start there is a clear difference between the performance of DFC and LRF, but eventually with the increase in number of documents, DFC performance has gradually improved and reached the highest value of 0.3. As a result, we conclude that LRF outperforms the rest of the algorithms between 10 to nearly 15 documents, whereas the performance of DFC is highest for almost next 5 documents. For around next 10 documents, KLD has shown a slightly better performance than DFC but after that DFC has emerged as the winner among all term selection algorithms.

Figure 3.

Peak results of nine term selection techniques.

B. Results with stemming

This section compares the performance of the nine term selection algorithms before and after stemming.

Tables 11–15 depict the difference in MAP of nine term selection algorithms on the pre-defined test points (i.e. number of top documents =10, 20, 30, 40, 50 and number of top terms =5, 10, 15, …, 50). For every algorithm, this MAP difference is denoted by Delta and is calculated as:
Δalgorithm=MAP(before Stemming)-MAPafter Stemming. 
(20)
Table 11.

Mean average precision difference of nine term selection algorithms at 10 documents

No. of termsFR metrics
ΔChi-squareΔChiΔDFCΔKLDΔRSVΔCoDiceΔIGΔLRFΔPRF
50.08050.10320.08970.01550.10910.07840.04600.06250.0526
100.08500.12060.11240.00440.11070.08300.05570.06390.0458
150.09060.13990.12220.01640.12780.09480.05070.06280.0781
200.08980.14270.11050.01740.12890.09300.05440.05960.0674
250.10690.13180.12420.01940.13400.09800.05110.06080.0781
300.09010.13120.12470.01050.12730.09470.05450.05820.0818
350.08070.12940.12060.02030.13190.10120.05820.05720.0755
400.08700.12460.11850.01110.13580.09580.06490.06220.0798
450.08830.12860.12570.01610.13940.10070.06600.06460.0810
500.08160.13010.12420.02000.14340.09950.06580.06130.0806
No. of termsFR metrics
ΔChi-squareΔChiΔDFCΔKLDΔRSVΔCoDiceΔIGΔLRFΔPRF
50.08050.10320.08970.01550.10910.07840.04600.06250.0526
100.08500.12060.11240.00440.11070.08300.05570.06390.0458
150.09060.13990.12220.01640.12780.09480.05070.06280.0781
200.08980.14270.11050.01740.12890.09300.05440.05960.0674
250.10690.13180.12420.01940.13400.09800.05110.06080.0781
300.09010.13120.12470.01050.12730.09470.05450.05820.0818
350.08070.12940.12060.02030.13190.10120.05820.05720.0755
400.08700.12460.11850.01110.13580.09580.06490.06220.0798
450.08830.12860.12570.01610.13940.10070.06600.06460.0810
500.08160.13010.12420.02000.14340.09950.06580.06130.0806
Table 11.

Mean average precision difference of nine term selection algorithms at 10 documents

No. of termsFR metrics
ΔChi-squareΔChiΔDFCΔKLDΔRSVΔCoDiceΔIGΔLRFΔPRF
50.08050.10320.08970.01550.10910.07840.04600.06250.0526
100.08500.12060.11240.00440.11070.08300.05570.06390.0458
150.09060.13990.12220.01640.12780.09480.05070.06280.0781
200.08980.14270.11050.01740.12890.09300.05440.05960.0674
250.10690.13180.12420.01940.13400.09800.05110.06080.0781
300.09010.13120.12470.01050.12730.09470.05450.05820.0818
350.08070.12940.12060.02030.13190.10120.05820.05720.0755
400.08700.12460.11850.01110.13580.09580.06490.06220.0798
450.08830.12860.12570.01610.13940.10070.06600.06460.0810
500.08160.13010.12420.02000.14340.09950.06580.06130.0806
No. of termsFR metrics
ΔChi-squareΔChiΔDFCΔKLDΔRSVΔCoDiceΔIGΔLRFΔPRF
50.08050.10320.08970.01550.10910.07840.04600.06250.0526
100.08500.12060.11240.00440.11070.08300.05570.06390.0458
150.09060.13990.12220.01640.12780.09480.05070.06280.0781
200.08980.14270.11050.01740.12890.09300.05440.05960.0674
250.10690.13180.12420.01940.13400.09800.05110.06080.0781
300.09010.13120.12470.01050.12730.09470.05450.05820.0818
350.08070.12940.12060.02030.13190.10120.05820.05720.0755
400.08700.12460.11850.01110.13580.09580.06490.06220.0798
450.08830.12860.12570.01610.13940.10070.06600.06460.0810
500.08160.13010.12420.02000.14340.09950.06580.06130.0806
Table 12.

Mean average precision difference of nine term selection algorithms at 20 documents

No. oftermsFR metrics
ΔChi-squareΔChiΔDFCΔKLDΔRSVΔCoDiceΔIGΔLRFΔPRF
50.08880.08320.08980.03670.13680.06790.04740.05100.0597
100.10420.11500.10500.03720.15290.08180.06070.06110.0820
150.12240.13560.12670.03720.15280.08710.05040.07130.0924
200.12670.12900.13690.02310.14380.08390.05550.06980.0864
250.12140.13210.14910.02360.15130.08130.05700.07130.0958
300.12590.13450.15450.03040.14290.08810.06440.06040.1046
350.12600.13530.15590.02680.14160.08910.06550.05910.1095
400.12170.13350.16210.02780.14000.09370.06560.05550.1077
450.12440.13150.15560.03090.13000.09300.07390.05510.1143
500.12180.12680.16090.03570.12200.09460.07750.05770.1165
No. oftermsFR metrics
ΔChi-squareΔChiΔDFCΔKLDΔRSVΔCoDiceΔIGΔLRFΔPRF
50.08880.08320.08980.03670.13680.06790.04740.05100.0597
100.10420.11500.10500.03720.15290.08180.06070.06110.0820
150.12240.13560.12670.03720.15280.08710.05040.07130.0924
200.12670.12900.13690.02310.14380.08390.05550.06980.0864
250.12140.13210.14910.02360.15130.08130.05700.07130.0958
300.12590.13450.15450.03040.14290.08810.06440.06040.1046
350.12600.13530.15590.02680.14160.08910.06550.05910.1095
400.12170.13350.16210.02780.14000.09370.06560.05550.1077
450.12440.13150.15560.03090.13000.09300.07390.05510.1143
500.12180.12680.16090.03570.12200.09460.07750.05770.1165
Table 12.

Mean average precision difference of nine term selection algorithms at 20 documents

No. oftermsFR metrics
ΔChi-squareΔChiΔDFCΔKLDΔRSVΔCoDiceΔIGΔLRFΔPRF
50.08880.08320.08980.03670.13680.06790.04740.05100.0597
100.10420.11500.10500.03720.15290.08180.06070.06110.0820
150.12240.13560.12670.03720.15280.08710.05040.07130.0924
200.12670.12900.13690.02310.14380.08390.05550.06980.0864
250.12140.13210.14910.02360.15130.08130.05700.07130.0958
300.12590.13450.15450.03040.14290.08810.06440.06040.1046
350.12600.13530.15590.02680.14160.08910.06550.05910.1095
400.12170.13350.16210.02780.14000.09370.06560.05550.1077
450.12440.13150.15560.03090.13000.09300.07390.05510.1143
500.12180.12680.16090.03570.12200.09460.07750.05770.1165
No. oftermsFR metrics
ΔChi-squareΔChiΔDFCΔKLDΔRSVΔCoDiceΔIGΔLRFΔPRF
50.08880.08320.08980.03670.13680.06790.04740.05100.0597
100.10420.11500.10500.03720.15290.08180.06070.06110.0820
150.12240.13560.12670.03720.15280.08710.05040.07130.0924
200.12670.12900.13690.02310.14380.08390.05550.06980.0864
250.12140.13210.14910.02360.15130.08130.05700.07130.0958
300.12590.13450.15450.03040.14290.08810.06440.06040.1046
350.12600.13530.15590.02680.14160.08910.06550.05910.1095
400.12170.13350.16210.02780.14000.09370.06560.05550.1077
450.12440.13150.15560.03090.13000.09300.07390.05510.1143
500.12180.12680.16090.03570.12200.09460.07750.05770.1165
Table 13.

Mean average precision difference of nine term selection algorithms at 30 documents

No. oftermsFR metrics
ΔChi-squareΔChiΔDFCΔKLDΔRSVΔCoDiceΔIGΔLRFΔPRF
50.08360.11830.09550.04050.1320.06230.05310.04690.0616
100.12060.11200.10700.05030.16780.07880.05420.0550.0706
150.12630.13020.12260.04670.17110.07560.05100.06010.1002
200.12670.14600.13780.05020.17110.08180.05520.06720.1206
250.12270.14330.14600.05180.16660.09150.05130.06450.1300
300.11630.13200.14640.03260.16700.09950.06400.06280.1333
350.12390.13230.14490.03180.16300.10160.06290.05720.1327
400.12240.13590.14880.03150.16530.09650.06790.05820.1154
450.12790.13570.15270.04220.16470.09730.07700.05380.1182
500.13440.13820.16060.03900.16860.09980.08280.05960.1149
No. oftermsFR metrics
ΔChi-squareΔChiΔDFCΔKLDΔRSVΔCoDiceΔIGΔLRFΔPRF
50.08360.11830.09550.04050.1320.06230.05310.04690.0616
100.12060.11200.10700.05030.16780.07880.05420.0550.0706
150.12630.13020.12260.04670.17110.07560.05100.06010.1002
200.12670.14600.13780.05020.17110.08180.05520.06720.1206
250.12270.14330.14600.05180.16660.09150.05130.06450.1300
300.11630.13200.14640.03260.16700.09950.06400.06280.1333
350.12390.13230.14490.03180.16300.10160.06290.05720.1327
400.12240.13590.14880.03150.16530.09650.06790.05820.1154
450.12790.13570.15270.04220.16470.09730.07700.05380.1182
500.13440.13820.16060.03900.16860.09980.08280.05960.1149
Table 13.

Mean average precision difference of nine term selection algorithms at 30 documents

No. oftermsFR metrics
ΔChi-squareΔChiΔDFCΔKLDΔRSVΔCoDiceΔIGΔLRFΔPRF
50.08360.11830.09550.04050.1320.06230.05310.04690.0616
100.12060.11200.10700.05030.16780.07880.05420.0550.0706
150.12630.13020.12260.04670.17110.07560.05100.06010.1002
200.12670.14600.13780.05020.17110.08180.05520.06720.1206
250.12270.14330.14600.05180.16660.09150.05130.06450.1300
300.11630.13200.14640.03260.16700.09950.06400.06280.1333
350.12390.13230.14490.03180.16300.10160.06290.05720.1327
400.12240.13590.14880.03150.16530.09650.06790.05820.1154
450.12790.13570.15270.04220.16470.09730.07700.05380.1182
500.13440.13820.16060.03900.16860.09980.08280.05960.1149
No. oftermsFR metrics
ΔChi-squareΔChiΔDFCΔKLDΔRSVΔCoDiceΔIGΔLRFΔPRF
50.08360.11830.09550.04050.1320.06230.05310.04690.0616
100.12060.11200.10700.05030.16780.07880.05420.0550.0706
150.12630.13020.12260.04670.17110.07560.05100.06010.1002
200.12670.14600.13780.05020.17110.08180.05520.06720.1206
250.12270.14330.14600.05180.16660.09150.05130.06450.1300
300.11630.13200.14640.03260.16700.09950.06400.06280.1333
350.12390.13230.14490.03180.16300.10160.06290.05720.1327
400.12240.13590.14880.03150.16530.09650.06790.05820.1154
450.12790.13570.15270.04220.16470.09730.07700.05380.1182
500.13440.13820.16060.03900.16860.09980.08280.05960.1149
Table 14.

Mean average precision difference of nine term selection algorithms at 40 documents

No. oftermsFR metrics
ΔChi-squareΔChiΔDFCΔKLDΔRSVΔCoDiceΔIGΔLRFΔPRF
50.07770.10740.10110.04960.14680.06960.03880.04360.0451
100.10420.12770.13290.06440.16480.07400.04120.04120.0761
150.12800.11670.12210.04440.16300.07380.05770.04520.1005
200.11460.12290.13210.04040.16160.08620.05780.03950.1260
250.11770.12610.14020.03540.16630.08830.05820.03840.1340
300.12410.13690.15040.03780.15860.09780.05510.04400.1362
350.13040.13960.16090.03900.16520.09580.06420.03740.1335
400.12010.13830.16100.04480.17180.09620.06580.04100.1376
450.12410.13850.15620.04790.16550.09830.07140.04740.1417
500.12380.13560.16130.03960.16480.09280.06300.04580.1430
No. oftermsFR metrics
ΔChi-squareΔChiΔDFCΔKLDΔRSVΔCoDiceΔIGΔLRFΔPRF
50.07770.10740.10110.04960.14680.06960.03880.04360.0451
100.10420.12770.13290.06440.16480.07400.04120.04120.0761
150.12800.11670.12210.04440.16300.07380.05770.04520.1005
200.11460.12290.13210.04040.16160.08620.05780.03950.1260
250.11770.12610.14020.03540.16630.08830.05820.03840.1340
300.12410.13690.15040.03780.15860.09780.05510.04400.1362
350.13040.13960.16090.03900.16520.09580.06420.03740.1335
400.12010.13830.16100.04480.17180.09620.06580.04100.1376
450.12410.13850.15620.04790.16550.09830.07140.04740.1417
500.12380.13560.16130.03960.16480.09280.06300.04580.1430
Table 14.

Mean average precision difference of nine term selection algorithms at 40 documents

No. oftermsFR metrics
ΔChi-squareΔChiΔDFCΔKLDΔRSVΔCoDiceΔIGΔLRFΔPRF
50.07770.10740.10110.04960.14680.06960.03880.04360.0451
100.10420.12770.13290.06440.16480.07400.04120.04120.0761
150.12800.11670.12210.04440.16300.07380.05770.04520.1005
200.11460.12290.13210.04040.16160.08620.05780.03950.1260
250.11770.12610.14020.03540.16630.08830.05820.03840.1340
300.12410.13690.15040.03780.15860.09780.05510.04400.1362
350.13040.13960.16090.03900.16520.09580.06420.03740.1335
400.12010.13830.16100.04480.17180.09620.06580.04100.1376
450.12410.13850.15620.04790.16550.09830.07140.04740.1417
500.12380.13560.16130.03960.16480.09280.06300.04580.1430
No. oftermsFR metrics
ΔChi-squareΔChiΔDFCΔKLDΔRSVΔCoDiceΔIGΔLRFΔPRF
50.07770.10740.10110.04960.14680.06960.03880.04360.0451
100.10420.12770.13290.06440.16480.07400.04120.04120.0761
150.12800.11670.12210.04440.16300.07380.05770.04520.1005
200.11460.12290.13210.04040.16160.08620.05780.03950.1260
250.11770.12610.14020.03540.16630.08830.05820.03840.1340
300.12410.13690.15040.03780.15860.09780.05510.04400.1362
350.13040.13960.16090.03900.16520.09580.06420.03740.1335
400.12010.13830.16100.04480.17180.09620.06580.04100.1376
450.12410.13850.15620.04790.16550.09830.07140.04740.1417
500.12380.13560.16130.03960.16480.09280.06300.04580.1430
Table 15.

Mean average precision difference of nine term selection algorithms at 50 documents

No. oftermsFR metrics
ΔChi-squareΔChiΔDFCΔKLDΔRSVΔCoDiceΔIGΔLRFΔPRF
50.0830.09560.09010.04430.13190.04620.00690.01200.0436
100.09520.11790.11780.04310.15630.06480.02810.04090.0707
150.12270.12490.12310.04440.15580.05670.03400.03830.0935
200.12360.13020.12950.03830.15190.05710.03890.02460.1138
250.11970.12740.13060.04750.15730.06660.03690.02520.1254
300.11500.12970.14460.05330.16140.07720.03830.01830.1271
350.11950.12880.15030.03300.15930.07170.03630.02050.1294
400.11970.13260.15110.03590.16650.07360.03900.02750.1340
450.12260.12390.15660.03090.16750.07520.04910.02210.1385
500.11870.12170.15760.03160.16550.07620.04910.02710.1381
No. oftermsFR metrics
ΔChi-squareΔChiΔDFCΔKLDΔRSVΔCoDiceΔIGΔLRFΔPRF
50.0830.09560.09010.04430.13190.04620.00690.01200.0436
100.09520.11790.11780.04310.15630.06480.02810.04090.0707
150.12270.12490.12310.04440.15580.05670.03400.03830.0935
200.12360.13020.12950.03830.15190.05710.03890.02460.1138
250.11970.12740.13060.04750.15730.06660.03690.02520.1254
300.11500.12970.14460.05330.16140.07720.03830.01830.1271
350.11950.12880.15030.03300.15930.07170.03630.02050.1294
400.11970.13260.15110.03590.16650.07360.03900.02750.1340
450.12260.12390.15660.03090.16750.07520.04910.02210.1385
500.11870.12170.15760.03160.16550.07620.04910.02710.1381
Table 15.

Mean average precision difference of nine term selection algorithms at 50 documents

No. oftermsFR metrics
ΔChi-squareΔChiΔDFCΔKLDΔRSVΔCoDiceΔIGΔLRFΔPRF
50.0830.09560.09010.04430.13190.04620.00690.01200.0436
100.09520.11790.11780.04310.15630.06480.02810.04090.0707
150.12270.12490.12310.04440.15580.05670.03400.03830.0935
200.12360.13020.12950.03830.15190.05710.03890.02460.1138
250.11970.12740.13060.04750.15730.06660.03690.02520.1254
300.11500.12970.14460.05330.16140.07720.03830.01830.1271
350.11950.12880.15030.03300.15930.07170.03630.02050.1294
400.11970.13260.15110.03590.16650.07360.03900.02750.1340
450.12260.12390.15660.03090.16750.07520.04910.02210.1385
500.11870.12170.15760.03160.16550.07620.04910.02710.1381
No. oftermsFR metrics
ΔChi-squareΔChiΔDFCΔKLDΔRSVΔCoDiceΔIGΔLRFΔPRF
50.0830.09560.09010.04430.13190.04620.00690.01200.0436
100.09520.11790.11780.04310.15630.06480.02810.04090.0707
150.12270.12490.12310.04440.15580.05670.03400.03830.0935
200.12360.13020.12950.03830.15190.05710.03890.02460.1138
250.11970.12740.13060.04750.15730.06660.03690.02520.1254
300.11500.12970.14460.05330.16140.07720.03830.01830.1271
350.11950.12880.15030.03300.15930.07170.03630.02050.1294
400.11970.13260.15110.03590.16650.07360.03900.02750.1340
450.12260.12390.15660.03090.16750.07520.04910.02210.1385
500.11870.12170.15760.03160.16550.07620.04910.02710.1381

From above equation, we can deduce that having a very large value of Delta implies that the algorithm is affected by stemming in a negative way i.e. its performance has decreased majorly after applying stemming. On the other hand, least value of delta shows the small difference effect of stemming on algorithm i.e. performance of algorithm before and after stemming is almost same.

Table 11 illustrates the performance difference of term selection algorithms for 10 number of documents and terms T = 5, 10, 15, 20, …, 50. We can clearly observe that the overall performance of KLD is least affected by stemming. Chi and RSV are badly affected by stemming and have revealed very bad performance after stemming the dataset.

Table 12 highlights the difference in the performance of term selection algorithms for document =20 and terms= 5, 10, 15, …, 50. Largest values of deltas are obtained by RSV and DFC which shows high effect of stemming on these two algorithms. Opposite results are obtained by KLD once again as it has shown resistance toward stemming and its behavior after stemming stayed the same as before.

In Table 13, we have depicted the Deltas of nine term selection algorithms for 30 number of documents and pre-defined term test points (T = 5, 10, …, 50). KLD once again has refused to change its results after stemming and its precision before stemming are almost identical as it depicts least values of deltas at all term test points. However, at test point T = 25, IG has a very small value of delta which is almost comparable to the delta value obtained by KLD at same test point. Term selection algorithm that is most affected by stemming is RSV as it has the largest difference in precision value for all term test points.

Now in Tables 14 and 15 we observe the results on the algorithms before and after applying stemming on data using 40 and 50 documents, respectively, and varying the top expansion terms from 5 to 50 with a gap of 5 terms as 5, 10, 15, …, 50. A thorough inspection of the results mentioned in both tables illustrate that the performance of RSV is once again most affected by stemming. The precision of RSV obtained after stemming is much lower than the precision without stemming. At Document test point D = 40, results obtained by IG, LRF and KLD after stemming are almost same as before stemming. While in Table 15, only the precisions and results of IG and LRF are badly affected by stemming.

In conclusion, we can say that stemming in biological domain decreases the overall performance of term selection algorithms. RSV is very much vulnerable to the effect of stemming as its performance decreases the most after applying it on stemmed dataset. However, KLD has shown the most resistance against stemmed dataset and its precision before and after stemming stays almost same.

Conclusion

We have proposed a new term selection algorithm named as ‘DFC’ for QE. DFC has been compared with other eight state-of-the-art term selection algorithms. Experiments show that DFC outperforms all other eight term selection algorithms in 88% of the pre-defined test points. DFC also caters the problem of document misclassification that occurs while setting the threshold of relevant and non-relevant class in pseudo relevance feedback. From Table 1 it can be concluded that chi-square has caused query drift for 25 of the total queries. On the other hand, DFC has shown an improvement in precision of 20 queries. To summarize the performance of all nine term selection algorithms, we have concluded that at defined set of document threshold (10, 20, 30, 40, 50), comparative performance of DFC is (60, 90, 90, 100, 100%). We also noticed that as the number of feedback document is increased, performance of DFC also increased while other term selection algorithms have marked an unexpected decrease. We would also like to mention that for PRF based information retrieval in biomedical domain, stemming tends to decrease the precision of all nine term selection algorithms.

Conflict of interest. None declared.

References

1

Jerome
R.N.
,
Giuse
N.B.
,
Gish
K.W.
et al.  (
2001
)
Information needs of clinical teams: analysis of questions received by the Clinical Informatics Consult Service
.
Bull. Med. Libr. Assoc
.,
89
,
177
.

2

Rivas
A.R.
,
Eva
L.I.
,
Borrajo
L.
(
2014
)
Study of query expansion techniques and their application in the biomedical information retrieval
.
The Scientific World J
.,
2014
.

3

Singh
J.
,
Sharan
A.
(
2015
)
Relevance feedback based query expansion model using Borda count and semantic similarity approach
.
Comput. Intel. Neuro
.,
2015
,
96
.

4

Lesk
M.E.
(
1969
)
Word-word associations in document retrieval systems
.
Am. Doc
.,
20
,
27
38
.

5

VAN Rijsbergen
C.J.
(
1977
)
A theoretical basis for the use of co‐occurrence data in information retrieval
.
J. Doc
.,
33
,
106
119
.

6

Mitra
B.
,
Craswell
N.
(
2017
) Neural text embeddings for information retrieval. In:
Proceedings of the Tenth ACM International Conference on Web Search and Data Mining
.
ACM
.

7

Chaplot
D.S.
,
Salakhutdinov
R.
(
2018
)
Knowledge-based Word Sense Disambiguation using Topic Models
.
arXiv preprint arXiv
:
1801
.
01900
.

8

Wang
Y.
,
Huang
H.
,
Feng
C.
(
2017
) Query Expansion Based on a Feedback Concept Model for Microblog Retrieval. In:
Proceedings of the 26th International Conference on World Wide Web
.
International World Wide Web Conferences Steering Committee
.

9

Jothilakshmi
R.
,
Shanthi
N.
(
2017
)
Combining Multiple Term Selection Methods for Automatic Query Expansion in Pseudo Relevance Feedback using Rank Score Method
.
Asian J. Res. Soc. Sci. Human
.,
7
,
910
922
.

10

Miao
J.
,
Huang
J.X.
,
Ye
Z.
(
2012
) Proximity-based rocchio's model for pseudo relevance. In:
Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
.
ACM
.

11

Carpineto
C.
,
Romano
G.
(
1999
)
Towards more effective techniques for automatic query expansion. International Conference on Theory and Practice of Digital Libraries
.
Springer
,
Berlin, Heidelberg
.

12

Tsai
M.-F.
et al.  (
2007
) FRank: a ranking method with fidelity loss. In:
Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
.
ACM
.

13

Bouadjenek
M.R.
,
Verspoor
K.
(
2017
)
Multi-field query expansion is effective for biomedical dataset retrieval
.
Database
2017
.

14

Wei
C.-P.
,
Hu
P.J.-H.
,
Tai
C.-H.
et al.  (
2007
)
Managing word mismatch problems in information retrieval: a topic-based query expansion approach
.
J. Manag. Inf. Syst
.,
24
,
269
295
.

15

Miller
G.A.
(
1995
)
WordNet: a lexical database for English
.
Communications of the ACM
38
,
39
41
.

16

Zukerman
I.
,
Raskutti
B.
(
2002
) Lexical query paraphrasing for document retrieval. In:
Proceedings of the 19th international conference on Computational linguistics-Volume 1
.
Association for Computational Linguistics
.

17

Kakde
Y.
(
2012
) A survey of query expansion until june 2012. Indian Institute of Technology, Bombay.

18

Chu
W.W.
,
Liu
Z.
,
Mao
W.
et al.  (
2005
)
A knowledge-based approach for retrieving scenario-specific medical text documents
.
Cont. Eng. Prac
.
13
,
1105
1121
.

19

Stokes
N.
,
Li
Y.
,
Cavedon
L.
,
Zobel
J.
(
2009
)
Exploring criteria for successful query expansion in the genomic domain
.
Inf. Retr. Boston
,
12
,
17
50
.

20

Singh
J.
,
Sharan
A.
(
2016
)
Relevance feedback-based query expansion model using ranks combining and Word2Vec approach
.
IETE J. Res
.,
62
,
591
604
.

21

Houle
M.E.
,
Ma
X.
,
Oria
V.
,
Sun
J.
(
2017
)
Query expansion for content-based similarity search using local and global features
.
ACM Transactions on Multimedia Computing, Communications, and Applications
(TOMM),
13
,
25
.

22

Abdulla
A.A.A.
,
Lin
H.
,
Xu
B.
,
Banbhrani
S.K.
(
2016
)
Improving biomedical information retrieval by linear combinations of different query expansion techniques
.
BMC Bioinformatics
,
17
,
238
.

23

Xu
J.
,
Bruce Croft
W.
(
1996
) Query expansion using local and global document analysis. In:
Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
.
ACM
.

24

Pérez-Agüera
J.R.
,
Lourdes
A.
(
2008
)
Comparing and combining methods for automatic query expansion
.
arXiv preprint arXiv
:
0804
.
2057
.

25

Robertson
S.E.
(
1990
)
On term selection for query expansion
.
J. Doc
.,
46
,
359
364
.

26

Pal
D.
,
Mitra
M.
,
Datta
K.
(
2014
)
Improving query expansion using WordNet
.
J. Assoc. Inf. Sci. Technol
.,
65
,
2469
2478
.

27

PubMed Help
(
2005
) https://www.ncbi.nlm.nih.gov/books/NBK3830/(28 March 2018, date last accessed).

29

The Lemur Toolkit - Indri Query Language Quick Reference. http://lemurproject.org/lemur/IndriQueryLanguage.php (25 April 2018, date last accessed).

30

Xu
J.
,
Croft
W.B.
(
1996
) Query expansion using local and global document analysis. In:
Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
.
ACM
.

31

Robertson
S.
,
Zaragoza
H.
(
2009
)
The probabilistic relevance framework: BM25 and beyond
.
Foundations and Trends® in Information Retrieval
3
,
333
389
.

32

Cover
T.M.
,
Thomas
J.A.
(
1991
)
Entropy, relative entropy and mutual information
.
Elem. Info. Theo
.,
2
,
1
55
.

33

Yang
J.
,
Peng
W.
,
Ward
M.O.
(
2003
)
Interactive hierarchical dimension ordering, spacing and filtering for exploration of high dimensional datasets. Information Visualization, 2003. INFOVIS 2003
.
IEEE Symposium on. IEEE
,
2003
.

34

Lee
C.
,
Lee
G.G.
(
2006
)
Information gain and divergence-based feature selection for machine learning-based text categorization
.
Info. Process. Manag
.,
42
,
155
165
.

35

Salton
G.
,
Buckley
C.
(
1990
)
Improving retrieval performance by relevance feedback
.
J. Am. Soc. Info. Sci
.
41
,
288
297
.

36

Forman
G.
(
2003
)
An extensive empirical study of feature selection metrics for text classification
.
J. Mach. Learn. Res
.
3
,
1289
1305
.

37

Lavrenko
V.
,
Croft
W.B.
(
2017
)
Relevance-based language models
.
ACM SIGIR Forum. 51
. ACM.

38

Text REtrieval Conference (TREC)

2006
Genomics Track. https://dmice.ohsu.edu/trec-gen/2006data.html (27 Feburary 2018, date last accessed).

39

Apache Solr Lucene. http://lucene.apache.org/solr/ (4 April 2018, date last accessed).

Author notes

Citation details: Nabeel Asim,M., Wasim,M., Usman Ghani Khan,M. et al. Improved biomedical term selection in pseudo relevance feedback. Database (2018) Vol. 2018: article ID bay056; doi:10.1093/database/bay056

Present address: Muhammad Nabeel Asim, Al-Khawarzmi Institute of Computer Science, University of Engineering & Technology, GT Road, Lahore, Punjab, Pakistan.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.