Improved biomedical term selection in pseudo relevance feedback

Abstract Biomedical information retrieval systems are becoming popular and complex due to massive amount of ever-growing biomedical literature. Users are unable to construct a precise and accurate query that represents the intended information in a clear manner. Therefore, query is expanded with the terms or features that retrieve more relevant information. Selection of appropriate expansion terms plays key role to improve the performance of retrieval task. We propose document frequency chi-square, a newer version of chi-square in pseudo relevance feedback for term selection. The effects of pre-processing on the performance of information retrieval specifically in biomedical domain are also depicted. On average, the proposed algorithm outperformed state-of-the-art term selection algorithms by 88% at pre-defined test points. Our experiments also conclude that, stemming cause a decrease in overall performance of the pseudo relevance feedback based information retrieval system particularly in biomedical domain. Database URL: http://biodb.sdau.edu.cn/gan/


Introduction
Retrieving documents that match the user query is one of the foremost challenge in almost all information retrieval systems. Continuous increase in literature causes keywords mismatch problem between user query and retrieved documents (1). To retrieve documents by measuring similarity between user query and indexed documents is even more difficult in biomedical domain because genes, drugs and diseases may have numerous synonyms. For example, a user inputs a query containing keywords like 'Medical Practitioner' and corpus has only relevant documents however all the documents contain the words such as doctor, physician etc. It can be seen that all the terms of documents are conveying same information but these are named V C The Author(s) 2018. Published by Oxford University Press.

Page 1 of 16
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
(page number not for citation purposes) differently due to which mismatch problem will occur and these documents which are more relevant to the query as compared to others will not be retrieved. In order to tackle this problem local and global query expansion (QE) is used. In global QE, knowledge sources and dictionaries like (WordNet, PubMed) are used to generate candidate expansion terms (2).
In local QE, statistical information is used to find candidate expansion terms from corpus. In this approach, documents are retrieved based on user query and top k retrieved documents are considered relevant. To select candidate expansion terms from top retrieved documents, different term selection techniques like chi-square, information gain (IG), Kullback-Leibler divergence (KLD) and dice are used. It has been observed that the online available data has vividly increased in volume while the number of query terms is very scarce (3).
According to Lesk et al. average query length used to be 2.30 (4) words and it remained same even after 10 years (5). At present, there has been a rise in the trend of providing quite lengthy queries containing (five or more words), but still most common queries contain only couple of words (6). Therefore, the scope of QE has increased over the time. QE can also decrease the performance of information retrieval. In global QE candidate expansion terms extracted from dictionaries may cause decrease in performance due to word ambiguity problem. If we have a query like 'Which bank provides more profit?', to expand this query, we will find synonyms of query terms from dictionaries. In this query word 'bank' can be used in two different scenarios. It can be either used to refer financial institution or river bank. Therefore, in global QE word sense disambiguation in query words is mandatory. Lesk algorithm is used for word sense disambiguation (7).
In local QE all the retrieved documents against a particular user query are not relevant to the user query (8). This may lead to the imperfect and faulty terms pool (the pool of all terms present in top retrieved documents) that may contain many redundant and irrelevant terms. Expanding the query with such terms may even drift the query to retrieve irrelevant items (3). Hence idea behind the selection of candidate expansion terms from terms pool is to first remove these redundant or irrelevant terms from the term pool. Term selection for QE will allow only the selection of most relevant terms against particular user query. Therefore, these days term selection for QE is one of the hottest topics of research in the domain of information retrieval (9).
There are two major types of term selection methods for QE: (i) based on corpus statistics and (ii) based on term association. The choice of these methods depends on the document retrieval models e.g. Okapi BM25, TFIDF and Language Models (3). The selection methods based on term association are used to evaluate the goodness of terms based on their co-occurrence in the feedback documents. Whereas, selection methods based on corpus statistics are used to estimate the goodness of the terms based on their distribution in the corpus. In biomedical domain, it is still a huge challenge for researchers to develop an extraordinary performing term selection method for QE that must be able to outperform available methods with a very high edge (10).
Mostly widely used term selection method 'Chi-Square' suffers from document misclassification problem as its ability to select most affective and worthy terms for QE gets affected by the defined threshold of relevant and nonrelevant class in pseudo relevance feedback. To tackle mentioned problem, we propose a new technique document frequency chi-square (DFC) and compare it with eight term selection algorithms including two different versions of chi-square proposed by Carpineto (11). Moreover, in biomedical domain effects of pre-processing on the performance of pseudo relevance feedback are also discussed. We used mean average precision (MAP) to evaluate the integrity of presented algorithm on TREC 2006 Genomic (12) dataset.

Related work
Efficient information retrieval systems are required to get relevant information against particular user query from rapidly growing biomedical literature (13). A major concern in information retrieval system is the word mismatch problem in which the same concept may be described using semantically similar but having syntactically different from of terms in both query and documents (14). For example, user query may contain a phrase like 'cure of depression', but the corpus documents may have different yet semantically similar phrase like 'depression treatment'. Both are referring to same concept with different words. This problem can be solved using two approaches: query paraphrasing and QE.
In query paraphrasing approach, query words are replaced by their synonyms in order to generate query paraphrases. In above example, 'cure' can be replaced by its synonym 'treatment' to generate the paraphrase 'treatment of depression'. Generated paraphrases are then used to retrieve documents from corpus. Zukerman et al. used WordNet (15) and parts of speech information to find the synonyms for paraphrase generation. Their experiment revealed a reasonable improvement in the process of retrieving relevant documents despite having issues in partof-speech (POS) tagging (16).
QE techniques can further be categorized as global and local techniques. In global QE, dictionaries and knowledge resources are used to find expansion terms (17). Chu et al. performed global QE by selecting the candidate expansion terms using knowledge resources of UMLS Meta-Thesaurus and Semantic Networks. They showed 33% improvement in performance of ohsumed dataset based 40 queries, by expanding these queries using domain specific knowledge resources and document retrieval models (18). On the other hand, Stokes et al. (19) used various biomedical knowledge resources like GO, EntrezGene, ADAM etc. to improve the overall performance of information retrieval system. They also claimed that the performance of information retrieval system (19) can be increased by focusing on two factors: choice of good document ranking algorithm; and use of domain specific knowledge resources.
One of the concerns with global QE is the fact that due to unstoppable progress in new discoveries and ongoing research, available knowledge resources are in constant need of update. However, it is difficult to update the available knowledge resources rapidly. Therefore, researchers of information retrieval community are focusing on improving the system using local QE. In this approach, user queries are provided to retrieval models (Okapi BM25, TFIDF) which rank the corpus documents by measuring similarity between queries and documents. Top K documents are labeled as relevant to user information. These retrieved documents are used to generate term pool which contains all terms present in relevant documents. Different techniques like chi-square, IG, KLD, CoDice etc. are used to select terms from generated term pool. Jagendra et al. improved the performance of local QE method by introducing an aggregation technique for term selection. They combined four term selection techniques [KLD, co-occurrence, Robertson selection value (RSV) and IG] using proposed aggregation method. In order to apply Borda combination technique, all the individual term selection methods are applied and lists of candidate terms are obtained from all the methods. These ranked lists are then used to select the final QE terms. Terms having highest aggregation score chosen as the final expansion terms. Jagendra et al. illustrated that some of the expansion terms caused query drift (20). In order to tackle this problem, they performed semantic filtering by applying word2vec approach and showed 2% improvement in results.
Some researchers are also looking for ways to combine both local and global QE techniques (21,22). In this regard, Pal et al. proposed a methodology which combined the terms generated from WordNet and two local QE (23) term selection techniques [i.e. KLD (24) and RSV (25)]. They showed that precision of retrieval model could be improved by extending the query with candidate terms generated from local and global QE (26). Abdulla et al. combined terms from both global and local QE. For global QE, they used knowledge resources like PubMed (27) and MetaMap (28), whereas for local QE, Lavrenko relevance feedback (LRF) (29) and MFT (30) techniques were used. A linear combination approach was introduced to combine the scores generated by individual techniques. This combined score was used to select the final QE terms. They selected one method from global QE and one from local QE. By doing so, they experimented with various combination pairs and found that the best performance was obtained using linear combination approach on PubMed (https:// www.ncbi.nlm.nih.gov/pubmed/) and LRF (22).
In our experimentation, we have exploited pseudo relevance feedback in which documents are ranked against particular user query. Top ranked k documents are selected as relevant for the selection of candidate of expansion terms. As there are no explicit defined criteria to select threshold (top k) for documents, there is a strong chance that arbitrarily selected threshold may cause document misclassification problem as some known relevant documents may get wrongly classified as relevant and vice versa. Traditionally used chi-square does not tackle mentioned problem while selecting expansion terms. We proposed a modified version of 'Chi-Square' which is able to alleviate the problem of document misclassification occurred due to selection of arbitrary threshold. We have evaluated our proposed term selection algorithm against eight state-of-the-art term selection algorithms and have shown the overall comparison. We have also tested the effect of stemming on information retrieval in particularly biomedical domain.

Methodology
This section presents the methodology of pseudo relevance feedback emphasizing on the pre-processing of dataset. The dataset obtained from TREC website exists in HTML format having irrelevant information like email addresses, article digital signature, journal publishing dates and years etc. In order to remove this irrelevant content from the dataset, Apache Tika parser (https://tika.apache.org/0.7/ parser.html) is used. Furthermore, all stop words such as is, am, are, about, etc. are removed from the dataset and user query by exploiting the default stop words list of solr named as 'stop.txt'. It contains 33 English stop words. After this, we converted all the terms into their base form using Porter Stemmer. The steps involved in pre-processing of HTML documents are shown in Figure 1.
To measure the effect of stemming on the performance of retrieval task, we have indexed the dataset with and without stemming.
Performance of pseudo relevance feedback depends upon two significant factors: number of top relevant documents retrieved by document retrieval model, and term selection algorithm (20). Famous documents retrieval models are Okapi BM25, language models [unigram, bigrams, n-grams (23)], TF-IDF etc. In our experimentation, we have used Okapi BM25 as our document retrieval model.
Before feeding the user query to document retrieval model, all stop words are removed from user query. Since we have two different types of datasets i.e. stemmed and non-stemmed, therefore, user query is stemmed only for stemmed dataset. User query is then provided to document retrieval model which retrieves a list of ranked documents. Top k ranked documents are chosen for pseudo relevance feedback and only unique terms of these documents are used to create term pool. Various term selection techniques (mentioned in Section 5) are used to rank the terms for QE. Only top n terms are used to expand particular user query which is then sent back to retrieval model for final document retrieval. Using this expanded query, final ranked documents are retrieved. Figure 2 illustrates all the phases of PRF technique sequentially.

Okapi bm25 weighting algorithm
Okapi BM25 is a probabilistic model that not only assigns weights to documents but also rank them according to their relevance against particular query. It has been widely used in biomedical domain for retrieval of information. Mathematical expression of document ranking is given as (31): where • k1 and k 3 are the parameters that are used to weight the effect of term frequency in document and query, whereas b is used as tuning constant to control normalization. • freq id depicts the frequency of the occurrence of the term in document d. • freq iq is the occurrence frequency of term in query q. • dl and avdl illustrate document length and average document length in the corpus, respectively.
whereas, SJ is the Robertson Sparck Jones weight, calculated using the formula below SJ ¼ log ðrt þ 0:5Þ =ð R j j À rt þ 0:5Þ ðn À rt þ 0:5Þ =ðN À n À R j j þ 0:5Þ (2) where jRj is the number of relevant documents of a specific topic, rt is the number of relevant documents that contain the term i, N is the total documents present in the corpus and n denotes the number of documents containing that term.

Term selection metrics
It is pretty obvious that corpus may have redundant and irrelevant terms that can cause query drift. To avoid this, all terms of corpus are ranked on the basis of statistical information used in various term ranking methods. In this section we will discuss eight such term ranking methods in context of QE.

A. Kullback-Leibler divergence (KLD)
KLD (24) is widely used technique in information theory (32), statistical language modeling based speech processing and natural language applications (25). It assigns score to terms based on their probability in relevant documents and corpus.
where P R (term) is the probability of term's presence in top retrieved relevant documents R. It can be calculated as: And P C (term) is the probability of term's presence in the corpus, calculated as: Equation (3) is used to assign scores to terms present in the term pool. This technique assigns scores fall in the range of 0-1. The term having 0 score is considered as irrelevant term. Similarly, a score of 1 shows that the term is an excellent candidate for QE.

B. Co-occurrence based query expansion
Co-occurrence is a term association based method used to assign scores to the terms present in the term pool. This method assigns score by measuring the relationship of candidate terms with query words (32). Rijsbergen (33) has described it as an algorithm that finds relationship between corpus and query terms. In order to find the co-occurrence association between two terms, co-efficients like CoJaccard, CoDice and Cosine are used. It can be calculated as: where df i and df j are the frequency of documents in which term i and term j occur, respectively. Similarly, df ij is the number of documents in which both terms i and j occur together.
Expanding the query with highly similar terms may also cause query drift problem. In order to avoid query drift, the concept of inverse document frequency (IDF) is used. To handle this problem, codegree is calculated which also caters IDF as well. Let qi be the query term and ct be the candidate term, then codegree and IDF can be calculated using following expression And IDF ct ð Þ ¼ log 10 N N c (8) where N c is the number of documents in corpus that have candidate term ct, N is the total number of documents present in corpus and D is the number of top retrieved documents. To obtain the value for a candidate term against all query terms, following formula can be used:

C. Information gain (IG)
IG is an algorithm that utilizes the knowledge about the presence or absence of particular term in documents to find the degree of class prediction (34). Let C ¼ fC 1 ; C 2 g be the set of classes where C1 belongs to top retrieved relevant documents and C2 belongs to non-relevant documents. Value of IG for term t can be calculated as: where P ðtÞ is the probability of term t's occurrence, t denotes non-occurrence probability i.e. P t ð Þ ¼ 1 À PðtÞ. P ðc j jtÞ is the conditional probability that the j th class occurs given term t. Similarly, Pðc j jtÞ stands for the conditional probability of j th class given the term t is nonexistent, whereas Pðc j Þ is the probability of j th class itself. This value is used to measure the importance of a term with respect to the two classes. This gives the score to the terms present in term pool. Ultimately high scoring terms can then be used for QE purpose.

D. Probabilistic relevance feedback (PRF)
This measure assigns score to the terms present in term pool by calculating their probability in relevant and nonrelevant documents (35). A term having higher probability in relevant class is considered more suitable candidate term for QE. Mathematical expression of PRF is obtained as: where P relevance ðtermÞ is the probability of term in relevant documents and P nonÀrelevance ðtermÞ is the probability of term in non-relevant documents.

E. Chi-square (CS)
A statistical measure used to measure the divergence of two events is known as chi-square (36). For a term t, it measures how much independent t is from relevant and irrelevant class. The lesser the independence, the higher will be the score for that term. Mathematical expression of chisquare is given below where p R ðtÞ is the probability of term t present in relevant documents, and p C ðtÞ is the probability of term in corpus.
In experimentation we also used chi-square version without square used by (11).

F. Lavrenko relevance feedback (LRF)
This technique uses the formula derived from Lavrenko relevance model (37). It is the technique based on language model. The score for the QE terms can be found by using the formula: In above equation, PðtjGÞ is the probability of occurrence of the term t in collection. Whereas, PðtjM R Þ can be found using the formula below: where TFðt; RÞ is the frequency of the term in relevant document R and the denominator is the summation of all the term frequencies for a relevant document. The k is the parameter that can be adjusted during experimentation. Researchers have found that k¼0.6 shows best results (22).
Proposed term selection metric: document frequency chi-square (DFC) Chi-square is one of the widely used algorithms for term selection in text classification. It has been used by Carpineto et al. for pseudo relevance feedback based term selection but unfortunately its performance was not up to the mark because term selection for QE in pseudo relevance feedback is very different from term selection in text classification. In pseudo relevance feedback, there exist only two classes which are highly skewed. We first retrieve documents based on user query and select top k documents as relevant while the rest of the documents are treated as non-relevant. However, there is no defined criterion to choose the threshold between relevant and non-relevant ranked list of documents. There is a possibility that a nonrelevant document may get classified as relevant document. Similarly, possibility of getting a relevant document in non-relevant class also exists. In order to fully understand the effect of this thresholding, let us consider a corpus of 10 documents which contain three documents ðD1; D2; D 3Þ of actual relevant class and rest are from non-relevant class. In pseudo relevance feedback, after document ranking, if we decide threshold at D4, we will get the following sets of documents: Let there be terms t 1 -t 50 in corpus. We consider a scenario in which t 1 occurs 10 times in R however it is only occurring in D4 document. The same term occurs three times in NR, such that it appears two times in D5 and one times in D6 document. When distribution based on term frequency is considered, chi-square will consider t 1 as a good term for QE which is not true. Now if the distribution is considered in context of document frequency which is binary in nature and only considers the presence of term in documents, we notice that using this distribution, document frequency of t 1 is only 1 in R whereas it is 2 in NR. As t 1 has higher document frequency in non-relevant class, therefore DFC will not rank it as a discriminative term. DFC not only considers the term presence in relevant documents, but also keeps track of other important factors like terms' absence in relevant class and similarly term presence and absence in non-relevant class as well. Mathematically, its formula can be written as:

Dataset and evaluation measure
In order to address the information retrieval system that targets the needs of biomedical scientists and geneticists, TREC 2006 Genomic Track (38)  MAP is used to evaluate the performance of nine term selection algorithms using Okapi BM25 as retrieval model. This evaluation measure is widely used in information retrieval system. Mathematical expressions of average and MAP are given below a: Average precision This measure compares the documents ranked by retrieval model with pre-defined set of documents ranked by domain experts against particular query.
where r is rank, N denotes the number of retrieved documents, relðrÞ is a function that tells whether a document is relevant or not (binary) and PðrÞ stands for precision.
b: Mean average precision It summarizes the ranking results obtained from multiple queries by averaging the AverageP.

Practical illustration of TREC data
This section summarizes the background of strategical decisions taken in context of typical behavior of the system over different queries. It also depicts the source of query drift in quest of further improvements while producing and comparing results. Table 1 shows performance difference of two algorithms (DFC and chi-square) and baselines for 36 queries of TREC Dataset. All results have been calculated on the following benchmark: documents ¼40, top terms ¼10. Delta(DFC-CS) shows the difference in precision of DFC and Chi-Square. The most positive value of delta(DFC-CS) shows that DFC has outperformed chi-square. On the other hand, the most negative value depicts victory of chisquare over DFC with a huge margin. By observing the differences, we notice that query 201 has the most positive value of delta(DFC-CS) whereas query 207 has most negative.
Delta(DFC-BS) and delta(CS-BS) are the differences in the performance of information retrieval system after applying QE using algorithms (DFC and chi-square) and without applying any QE (baseline).
These columns show the effect on the performance after applying QE techniques. Positive value of the delta shows an increase in performance after applying QE whereas negative value depicts decline in performance due to QE. It is pretty easy to see that negative value of delta in both cases is directly proportional to the query drift. It can be seen from the table that 16 out of 35 queries have shown a decrease in performance due to query drift using DFC. On the other hand, by applying QE using chi-square, only 9 out of 36 queries have shown an improved performance. For delta(DFC-BS), the best performance has observed for query 225 and for delta (CS-BS), query 226 has marked the most increase in precision after applying QE. Highlighted values at the bottom of the table illustrates mean average precision difference of mentioned algorithms.
In order to further explore chi-square term selection algorithms, query 201 and 207 are selected as they have revealed best performance for DFC and chi-square, respectively. These two algorithms are applied again on query 201 and 207 to obtain top 10 terms from top 40 retrieved documents. The selected terms are listed in Tables 2 and 3.
Original query is expanded by adding one term at a time and precision is measured just to reveal the positive or negative effects of newly added term over QE. Results of incremental QE obtained after iterating over all 10 terms are shown in Tables 2 and 3. As observed from the tables, expanding user query with selected terms has marked a reasonable boost in the performance of specified query.
Tables 4 and 5 depict the unique terms selected by chisquare and DFC for queries 201 and 207, respectively. Both tables also show the document frequency based parameters ðtdf r ; tdf r ; tdf nr ; tdf nr Þ as well as the probabilities used by chi-square. To lay out a clear picture of the importance of terms against each algorithm, ranks of these unique terms as determined by their scores of chi-square and DFC are also shown.
As shown in the Table 4, chi-square has assigned highest score to the term braf while DFC ranks calipel as the top term. A close inspection of document frequency parameters show that braf is present in 14 relevant documents and 69 non-relevant documents. On the other hand, calipel is present in five documents of relevant class and it is entirely absent in non-relevant documents. Due to this reason, DFC considers it a highly discriminative term to differentiate between relevant and non-relevant class. Similarly, we observe second term ranked by both algorithms. DFC has placed v5899e at second rank whereas is selected as second best term by chi-square. We explain this by observing the fact that is present more times in nonrelevant documents as compared to v599e.
Likewise, other terms can also be observed from the table. Similarly, from Table 5 it can be seen that etidronate is ranked as the best term by both algorithms. DFC has selected alendronate as second best term and chi-square placed fetuin at second rank. Fetuin is present in only 4 documents of relevant class and 275 of non-relevant class documents. However, alendronate is present in 24 relevant documents and 123 non-relevant documents. By analyzing and comparing these parameters, it is pretty easy to see that alendronate is more suitable candidate than fetuin as it is present more times in relevant documents and also has lesser occurrence in non-relevant class.

Experimental setup and results
We use an open source search platform known as 'Solr' (39) for experimentation. It includes features of full text search and real time indexing. In experimentation, Okapi BM25 is used as retrieval model. In this section we briefly explain about experimental setup and compare the results of all term selection techniques against defined test points.

A. Results without stemming
To analyze the effect of pre-processing on biomedical data, we have used two different methods for indexing of the corpus documents as discussed in the Section 3. This section depicts the results of nine term selection algorithms in the form of tables at pre-defined test points. Expectedly, all feature selection techniques do not produce their peak results at the same defined set of parameters. These parameters are number of top retrieved relevant documents and candidate expansion terms that get merged with the query. For sake of laying out the clear picture of the performance of pseudo relevance feedback and better comparison of term ranking algorithms, we have shown a graph containing the peak results only against the best parameters of terms for all techniques found from below mentioned tables. Tables 6-10 illustrate MAP of nine term selection algorithms on pre-defined benchmark test points at top terms (5, 10, 15, . . ., 50) and documents (10, 20, . . ., 50). Boldface values in these tables indicate the highest performance of a particular term selection algorithm across all the mentioned term selection algorithms at a specific number of terms. Table 6 highlights the best performing term selection algorithms over following defined set of test points (i.e. top documents ¼10, top terms ¼5, 10,15,20,25,30,35,40,45,50). It can be clearly seen that LRF outperforms the rest of term selection algorithms at following test points T ¼ 5, 10, 15, 20. Likewise, DFC exhibits best performance in the remaining test points. RSV does not perform up to the mark as its performance kept decreasing gradually with the increase in number of terms. KLD follow the footsteps of RSV but it somehow manages to beat RSV in a race of being called as worst performing algorithm.
It has also been observed that the performance of chi (without square) and PRF show an overall decline in score with gradual increase in number of top selected terms. We can also observe from the table that the scores of CoDice and IG kept increasing until the term test point T ¼ 15, and for the remaining test points, decrease in performance is observed. On the other hand, chi-square follows a mix sort of trend as its performance kept decreasing slightly on couple of test points at first and then all of a sudden start increasing but then it gradually decreases for remaining term test points. Table 7 illustrates the performance of term selection algorithms for 20 number of documents and top terms T ¼ 5, 10,15,20,25,30,35,40,45,50. As the table suggests, it is pretty obvious to say that KLD outperforms the rest of term selection algorithms only at following test point T ¼ 5. Surprisingly, DFC exhibits best performance in all the remaining test points. In addition, RSV does not perform up to the mark again even with the increase of top documents, as its performance (32) kept decreasing gradually with the increase in number of terms. The performance of IG, LRF and chi (without square) follow a pattern in which they have highest MAP at term test point ¼10, whereas for the rest of the test points, gradually decreasing scores are observed. Chi-square based on probability shows a peculiar behavior as the performance first arbitrarily increases with gradual increase of top selected terms. This increase in performance is observed until T ¼ 30 and after that the performance drops and an almost constant score is observed. As far as CoDice and PRF are concerned, no clear pattern is observed in their performance. Some test points cause a slight increase or decrease in performance while others keep the performance constant.
In Table 8, we have depicted the results of term selection algorithms obtained at document test point ¼30 and for all defined terms test points ¼5, 10,15,20,25,30,35,40,45,50. As the table suggests, it is pretty obvious to say that KLD outperforms the rest of term selection algorithms only at following test point T ¼ 5. Surprisingly, DFC exhibits best performance in all the remaining test points. In addition, Chi (without square) is the worst performer and its performance kept decreasing gradually with the increase in number of terms. LRF and IG start with a very good score at T ¼ 5 but with the increase in number of top selected terms, their performance also kept getting worst. On the other hand, PRF follows an almost constant trend as the difference between its best and worst score is only 0.011. The performance of term selection algorithms such as chi-square, RSV and CoDice follow a mixed pattern. As the number of top terms are increased, the results of mentioned term selection algorithm sometimes increase and all of a sudden decrease at the very next test point.
For top document ¼40 and 50, we have shown the best performance of nine term selection algorithms in Tables 9 and 10, respectively. As the table suggests, it is pretty clear that DFC exhibits best performance in all the test points. Table 9 depicts that the performance of KLD, RSV, CoDice, IG and PRF keep decreasing gradually with the increase in number of terms. It also marks that Chi (without square) is the worst performer as it shows the least score at T ¼ 50. However, algorithms such as LRF and chi-square follow no clear pattern as their score vary from one test point to another by either decreasing or increasing suddenly.
While studying the performance of term selection algorithms in Table 10, we observe that LRF depicts the worse performance and shows gradual decrease in performance with increasing number of terms. KLD, RSV, CoDice and PRF also follow a decreasing pattern as they mark their best performance only at T ¼ 5 and eventually kept getting decrease until term test point 50. Conversely, we observe that algorithms such as chi-square and IG show an unpredictable behavior in their performance. The scores of chisquare first increase up to T ¼ 15, and then decrease as number of terms approaches to 50. IG shows an even more abrupt behavior as the score keeps on increasing and decreasing at different term test points. Figure 3 result summarizes the performance of nine term selection algorithms in terms of MAP against number of documents. Trends of all term selection algorithms (chisquare, KLD, RSV, CoDice, IG, LRF) along with newly proposed technique (DFC) and baseline are shown only at peak values retrieved from Tables 6-10. As the graph suggests, it is pretty easy to see that DFC and KLD have outperformed the rest but in a straight comparison, DFC is a clear winner. Although at start there is a clear difference between the performance of DFC and LRF, but eventually with the increase in number of documents, DFC performance has gradually improved and reached the highest value of 0.3. As a result, we conclude that LRF outperforms the rest of the algorithms between 10 to nearly 15 documents, whereas the performance of DFC is highest for almost next 5 documents. For around next 10 documents, KLD has shown a slightly better performance than DFC but after that DFC has emerged as the winner among all term selection algorithms.

B. Results with stemming
This section compares the performance of the nine term selection algorithms before and after stemming.
Tables 11-15 depict the difference in MAP of nine term selection algorithms on the pre-defined test points (i.e. number of top documents ¼10, 20, 30, 40, 50 and number of top terms ¼5, 10, 15, . . ., 50). For every algorithm, this MAP difference is denoted by Delta and is calculated as: From above equation, we can deduce that having a very large value of Delta implies that the algorithm is affected by stemming in a negative way i.e. its performance has decreased majorly after applying stemming. On the other hand, least value of delta shows the small difference effect of stemming on algorithm i.e. performance of algorithm before and after stemming is almost same. Table 11 illustrates the performance difference of term selection algorithms for 10 number of documents and terms T ¼ 5, 10, 15, 20, . . ., 50. We can clearly observe that the overall performance of KLD is least affected by stemming. Chi and RSV are badly affected by stemming and have revealed very bad performance after stemming the dataset. Table 12 highlights the difference in the performance of term selection algorithms for document ¼20 and terms¼ 5, 10, 15, . . ., 50. Largest values of deltas are obtained by RSV and DFC which shows high effect of stemming on these two algorithms. Opposite results are obtained by KLD once again as it has shown resistance toward stemming and its behavior after stemming stayed the same as before.
In Table 13, we have depicted the Deltas of nine term selection algorithms for 30 number of documents and pre-defined term test points (T ¼ 5, 10, . . ., 50). KLD once again     Tables 14 and 15 we observe the results on the algorithms before and after applying stemming on data using 40 and 50 documents, respectively, and varying the top expansion terms from 5 to 50 with a gap of 5 terms as 5, 10, 15, . . ., 50. A thorough inspection of the results mentioned in both tables illustrate that the performance of RSV is once again most affected by stemming. The precision of RSV obtained after stemming is much lower than the precision without stemming. At Document test point D ¼ 40, results obtained by IG, LRF and KLD after stemming are almost same as before stemming. While in Table 15, only the precisions and results of IG and LRF are badly affected by stemming. In conclusion, we can say that stemming in biological domain decreases the overall performance of term selection algorithms. RSV is very much vulnerable to the effect of stemming as its performance decreases the most after applying it on stemmed dataset. However, KLD has shown the most resistance against stemmed dataset and its precision before and after stemming stays almost same.

Conclusion
We have proposed a new term selection algorithm named as 'DFC' for QE. DFC has been compared with other eight state-of-the-art term selection algorithms. Experiments show that DFC outperforms all other eight term selection algorithms in 88% of the pre-defined test points. DFC also caters the problem of document misclassification that occurs while setting the threshold of relevant and nonrelevant class in pseudo relevance feedback. From Table 1 it can be concluded that chi-square has caused query drift for 25 of the total queries. On the other hand, DFC has shown an improvement in precision of 20 queries. To summarize the performance of all nine term selection algorithms, we have concluded that at defined set of document threshold (10,20,30,40,50), comparative performance of DFC is (60, 90, 90, 100, 100%). We also noticed that as the number of feedback document is increased, performance of DFC also increased while other term selection algorithms have marked an unexpected decrease. We would also like to mention that for PRF based information retrieval in biomedical domain, stemming tends to decrease the precision of all nine term selection algorithms.
Conflict of interest. None declared.