Baseline and extensions approach to information retrieval of complex medical data: Poznan's approach to the bioCADDIE 2016

Abstract Information retrieval from biomedical repositories has become a challenging task because of their increasing size and complexity. To facilitate the research aimed at improving the search for relevant documents, various information retrieval challenges have been launched. In this article, we present the improved medical information retrieval systems designed by Poznan University of Technology and Poznan University of Medical Sciences as a contribution to the bioCADDIE 2016 challenge—a task focusing on information retrieval from a collection of 794 992 datasets generated from 20 biomedical repositories. The system developed by our team utilizes the Terrier 4.2 search platform enhanced by a query expansion method using word embeddings. This approach, after post-challenge modifications and improvements (with particular regard to assigning proper weights for original and expanded terms), allowed us achieving the second best infNDCG measure (0.4539) compared with the challenge results and infAP 0.3978. This demonstrates that proper utilization of word embeddings can be a valuable addition to the information retrieval process. Some analysis is provided on related work involving other bioCADDIE contributions. We discuss the possibility of improving our results by using better word embedding schemes to find candidates for query expansion. Database URL: https://biocaddie.org/benchmark-data


Introduction
Biomedical research produces ever increasing amount of digital data, which is stored in a variety of formats and hosted in a multitude of different sites. These sites could be generated by original researchers, attached to journals as supplementary material, organized as datasets and kept in databases or repositories. The most common information source is literature in the form of indexed journals that in V C The Author(s) 2018. Published by Oxford University Press.

Page 1 of 14
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
(page number not for citation purposes) electronic form reside of Pubmed platform or publisher portals. The article format has its advantage-ease of reading. Articles contain mostly unstructured information that is hard to use specialized processing, comparison, aggregation and integration. Therefore, we need transformation of this information into more structured form that can be stored in databases, collection and repositories. This process requires development of useful data structures and indexing and extraction tools. Data is a set of values of qualitative or quantitative variables. Pieces of data are individual pieces of information. A dataset or collection of data often corresponds to the contents of a single database table, or a single statistical data matrix, where every column of the table represents a particular variable. Generic ontologies and metadata models designed for description of datasets, supplement domainspecific ontologies to describe the research field. The enormous amount of biomedical literature, the existence of data of different granularity and data heterogeneity, as well as the lack of common metadata, makes it difficult to selectively access increasingly complex relevant information.
As pointed out by (20), 'A typical dataset available in, for instance, the gene expression repositories may contain a description, a list of keywords and a list of organisms. A typical dataset available in the protein structure repositories contains, in addition, a list of genes and a list of research articles'. Thus, a global pharmaceutical company, for instance, may need close to 30 different databases to complete a clinical study. These sources of data require recording provenance for datasets and data curation. Moreover, the data resulting from biomedical experiments often possess an implicit hierarchy (1). In terms of granularity needed for specific databases, a PubMed article needs to be decomposed into snippets which describe structured data markup. Snippets may be organized using a comprehensive data type ontology which will provide definitions of types of data (Protein, Phenotype, Gene Expression, Nucleotide Sequence, Clinical Trials, Imaging Data, Morphology, Proteomics Data, Physiological Signals, Epigenetic Data, Data from Papers, Omics Data, Survey Data, Cell Signalling and Unspecified). Snippets in different databases may often be found at different levels of a database schema. Since different types of metadata are of importance for given specialized databases, historically their schemas were developed independently, and do not conform to any standardized pattern. Since datasets are combination of structured and unstructured data, often presented in incompatible ways (e.g. the same information with different tags), using them in a complex processing can be quite difficult. Futhermore, a significant percentage of specific data that had been reported in clinical reports does not made its way into journals (2). Nevertheless, data needs to be compared and verified.
Often, cost and utility considerations make it necessary to try a multi-sponsored clinical development approach termed Portfolio of Innovative Platform Engines, Longitudinal Investigations and Novel Effectiveness to generate a new hypothesis. In such environment (3), this need for shared collaborative data governance forces a use of integrated data-therefore, improving the effectiveness of retrieval is paramount to finding state-of-the-art methods of diagnosis, testing and treatment for individual patients. Existing platforms such as Google and PubMed serve their purpose providing an up-to-date sources of information with various additional functionalities but it is difficult to assess their effectiveness. Thus, the crucial aspect for addressing this complexity is the availability of annotated distributed datasets created by the scientific community, with which researchers can test the effectiveness of various approaches. That in turn leads to better data structures and indexes of various granularities. This can be achieved only within a shared task environment, which enables researchers from many different institutions to work together at solving important scientific problems. In the biomedical area, Text REtrieval Conference (TREC) and bioASQ have contributed the most towards achieving this goal. Collaboration occurs at multiple level: definition of test collections, task definition, evaluation and analysis of results. For the last several years, the National Institute of Standards of Technology's the TREC has concentrated on finding the most relevant PubMed articles and clinical trial data in response to selected medical records within its clinical decision support (CDS) track evolving into Precision Medicine (4). In this context, the bioASQ (5) challenge concentrates mainly on the following broad tasks:

bioASQ Task on Online Biomedical Semantic
Indexing-classification of new PubMed documents into the MeSH hierarchy concepts. 2. bioASQ Task on Biomedical Semantic query answering (QA) related to information retrieval and query answering-one of the most complex semantic tasks in natural language processing (NLP).
Previous TREC CDS and earlier medical tracks and bioASQ challenges had many specific task orientations, data sources and retrieval conditions. For example, some TREC sources were either full publications or abstracts. The topics of a question could be electronic health records (EHR) admission notes curated by physicians. Notes could be of Diagnosis, Test and Treatment type. Notes could be much longer compared with concise bioCADDIE questions. Currently, the format for run submissions of TREC and bioCADDIE is the standard trec_eval format. The bioASQ contest shares a deep semantic approach to answer questions with bioCADDIE when word embeddings (WEs) are used for query expansion or within document vector framework.
Based on these tasks, the Biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium, funded by the US National Institute of Health Big Data to Knowledge program, aims to empower researchers to find data the most efficient way and expand sources and types of data. These would include opinion on research on non-scientific portals (i.e. conversations about scholarly content) together with monitoring attention surrounding particular work (altmetric).
BioCADDIE (6) has developed DataMed, a search engine prototype of Data Discovery Index (DDI), using the data tag suite (DATS) model to support the DataMed discovery index (7). This enables searching data of various types and formats (while maintaining a core set of elements), curated by separate institutions. DataMed based on ISA formatted metadata aims to facilitate the discovery of a digital object. At this time, DataMed has indexed close to 1 400 000 datasets drawn from 66 repositories (8).
The bioCADDIE challenge concerned finding most relevant docnos (elements of datasets) in response to 15 questions provided by bioCADDIE experts. The structure of the questions followed the DataMed prototype idea of the rdf type of relations between entities ('data type' ¼ w, 'biological process' ¼ x, 'species/organism' ¼ y and 'phenotype' ¼ z) (9). The graph structure of a query suggests that if we also transformed documents into graph structure the matching process would be at the level of relations and not keywords.
The aim of the 2016 bioCADDIE Challenge (9) was the retrieval of datasets from a collection that is relevant to the needs of biomedical researchers; the purpose was to facilitate the reutilization of collected data and enable the replication of published results. Such work is the focus of WG4 of the bioCADDIE consortium: Use Cases and Testing Benchmarks. The goal is to develop usability specifications/requirements and appropriate benchmarks with associated testing content for DataMed.
To address this goal sections, later discuss the following aspects: • The Related work section discusses the content of already published bioCADDIE articles • The Methodology section presents the methods, algorithms and solutions prepared by our team, divided into following subsections: • The Overview, describing the model of our information retrieval system • The Collection, with information on the bioCADDIE datasets • An Analysis of document structure and content, presenting the differences among various repositories • A Selection of documents with valuable data for indexing, with the description of our algorithm evaluating whether a document is worth indexing • An Index of data, including information of corpus preparation for indexation • Query preprocessing • Query expansion, describing the methods chosen to expand the query • Information retrieval and evaluation, with information on the retrieval platform • The Results and discussion section is divided into the following subsections: • Selection of the optimal baseline system • Query expansion • Further analysis • The Conclusions and future work section summarizes the main outcome of the article

Related work
At present, details of bioCADDIE Challenge systems exist for selected contributions. Apart from standard similar preprocessing similar to that presented in this work, processing can be divided into advanced preprocessing, retrieval and re-ranking. The University of California San Diego (UCSD) team that obtained the top infNDCG result (9) implemented a two-step 'retrieval plus re-ranking' strategy (10). Based on this idea, they developed a method to find the Google top 10 returned documents and then transformed these documents into queries for relevant datasets. This strategy was used by East China University in their winning contribution to TREC CDS 2015 (11). Their baseline was Elastic search (a Lucene-based search engine that is part of a DataMed technology).
The Elastic search top 5000 retrieved datasets were reranked based on the concatenated documents using the pseudo sequential dependence (PSD) model (12). The best run used the PSD-allwords model. UCSD used the concept matching formula with Dirichlet smoothing, with weights based on the annotated dataset repository. In contrast to the original algorithm in (12), an actual term frequency was increased by a constant ¼ 5. UCSD found (as we do) that neither ordered nor unordered bigrams have improved performance. We would like to point out that the UCSD results presented in (10) do not exactly match the official results (9). Elsevier (13) used two approaches: word embeddings and ontology-based indexing (queries and data sources were tagged with named entities from MeSH and Entrez Gene) with indexing and search platform Apache Solr. For WEs, fastText (14) gave better results than word2Vec (15) and GloVe (16) that we both used. FastText, based on a skip-gram model, uses character n-grams and smaller windows that translate to better WEs for query expansion.
Elsevier used an additional advanced modification of queries: • Abbreviated species names were expanded to full names (e.g. M to Mus). • Greek characters were replaced with English spelling.
It has been noted in (13), for example, that for 'glycolysis' (a word that does not appear in the bioCADDIE questions), the word2Vec model returned 'tca_cycle', 'mitochondria_remodelling' and 'reroute'. FastText delivered more reasonable similar words/phrases. For example, for the phrase 'glycolysis', the top three similar phrases returned by fastText were 'gluconeogenesis', 'glycolytic' and 'glycolytic_pathway'.
However, it is well-known that WE methods are extremely sensitive to a training corpus (we used the PubMed abstracts). With word2vec, we obtained the following most Elsevier obtained the best result with Elsevier four run modified queries (all additional modifications) þ concept expansion þ multi-phase execution; Search: Apache Solr, stemmed index) but only 2% better than their baseline.
SIBTex (17) divided query terms into non-relevant, relevant and key, assigning larger weights to key relevant terms compared with relevant terms. This is the same strategy that we used for expanded terms. Universal protein resource (UniProt) was used to constrain query and datasets to a set of 14 biomedical topics (18). They used the Gensim word2vec library (as we did) for finding expansion candidates. Their best run SIBTex 3 was achieved with a baseline þ query expansion with weighted terms þ results categorization in the post-processing phase.
OHSU assumed a variable number and relative weighting of MeSH terms for query expansion in the work after the challenge. Additional runs determined the optimal number of MeSH terms and weighting. Their best overall score used five MeSH terms with a 1:5 terms: words weighting ratio (19). This is the same ratio we used in our best run when query expanded terms are derived from word2vec.
The University of Melbourne, UM (20) provided useful determination of appearance of most important metadata in bioCADDIE used repositories. This information could be helpful for determination whether a query term belongs to a concept expressed by metadata or using weights for answers coming from different repositories. UM applied transformation of the initial query into a multi-field query that is then enriched with terms that are likely to occur in the relevant datasets.

Methodology
The overview The information retrieval process, we used was divided into four steps: 1. Analysis of repositories structure and their information content. 2. Selection of the optimal baseline system. 3. Selection of optimal possible system extension. 4. Optimization of parameters of the complete system.
The model of the system developed to generate information retrieval for the bioCADDIE challenge includes the following elements: 1. Preparation of database with valuable information from datasets 2. Indexing of data collection 3. Query preprocessing 4. Preparation of two vector space models based on data from bioCADDIE datasets and PubMed abstracts 5. Query expansion with the use of prepared vector space models and pseudo-relevance feedback (PRF) (provided by Terrier) 6. Information retrieval by the Terrier engine 7. Evaluation of the results.

The collection
The bioCADDIE corpus was a collection of metadata (structured and unstructured) from biomedical datasets generated from a set of 20 individual repositories (Table 1). A total of 794 992 XML documents were made available for use from the set of indices that was frozen from the DataMed backend on 24 March 2016 (21). Data in each document was organized into the following tags: <DOCNO>: document number, <TITLE>: document title, <REPOSITORY>: biomedical repository used to generate document, <METADATA>: various data from the repository presented in json format.

Analysis of document structures and their information content
Each repository uses different json schema to organize data. Moreover, in some cases a variation was noted within the same repository (Table 1).
To prepare a text corpus for indexing, tags and keys with potentially valuable information were selected and their values were exported to the SQL database. The data was then assigned to one of three categories: Title, Keywords or generalized Description. For one of the repositories (geo), the generalized description contained additional text data, obtained from geo database online resources, based on the 'geo_accesion' code found in the metadata ( Table 2).

Selection of documents with valuable data for indexing
Because documents from some repositories (e.g. dryad, geo) contained very little useful information (see examples in Table 3), we decided to assess if a document's content is worth indexing using MeSH. MeSH, which stands for 'Medical Subject Headings', is a vocabulary thesaurus used by the National Library of Medicine (NLM) to index articles stored in PubMed (22).
At this point it does matter whether we use words or lemmatized words, so we chose to remain with the former. Terrier tokenises a query and documents so various word forms are treated exactly the same. In WE methodology various words forms represent different elements of space but when these words became expanded terms only the token form count.
For each category (Title, Keywords or Description) of each record (from the previously prepared SQL database), a score was calculated according to the following heuristic algorithm that removes meaningless records before indexing (i.e. shown in Table 3): 1. Let X represent the total number of words in the record. 2. Let Y1 represent the number of words which are recognized as English words. 3. Let Y2 represent the number of words which are not recognized as English words (e.g. 'MIP-2', 'CD69' and 'LDLR') but are recognized as MeSH words found in and X is >2, take the record for indexing. 7. For the Keywords category, if the Score is >0 (as follows from the condition at Step 4), take the record for indexing.
The Descriptor/Concept/Term structure makes it possible to attach various data elements in MeSH to the appropriate object. This sentence is directly taken from https://www.nlm.nih.gov/mesh/concept_structure.html. Word (linguistic notion), term (appears in a query), data element (part of a taxonomy structure) differ in contexthere they are used in the meaning of word.

Indexing of data
After the removal of documents without valuable data, the text corpus for indexing was prepared in the form of an xml file, with the content of every document placed within a DOC tag (a format required by Terrier). Such prepared text corpora were tokenized and indexed by the Terrier 4.2 engine (8).

Query preprocessing
The queries were provided as natural language sentences, containing a lot of noise words. To improve the retrieval, stop-words and common non-informative phrases (e.g. 'find', 'data' and 'related to') were removed from each query.

Query expansion
To expand the queries, we used WEs, choosing the word2vec algorithm (15). Two vector space models were calculated the first based on the corpus from the bioCADDIE collection and the second utilizing the much larger text corpus based on PubMed article abstracts. Calculated vectors were then used to find the words most similar to query terms. To enable setting the different weights for original and expanded query terms, the query was not passed through the tokenizer (class SingleLineTRECQuery). Additional query expansion was carried out by the Terrier engine in the form of PRF utilizing the Rocchio algorithm.

Information retrieval and evaluation
Information retrieval was done using the Terrier 4.2 platform. The results were then evaluated using the qrel file provided by the challenge organizers.

Results and discussion
The complexity and fragmentation of the repositories made it difficult to index the data. For the original challenge, due to lack of time and inexperience of our team with DataMed, the data was not fully indexed and we achieved a poor result, shown in Table 4 (9).
Having made modifications of our system, our present results are much better. Application of our algorithm for selection of documents with valuable data for the indexing revealed that 97.71% of documents had 'Title' assessed as valid for indexing (see Table 1 for details). A similar value was observed for 'Description' (97.95%). Only slightly more than half of documents (54.49%) had valid keywords (this was mainly due to the fact that in many datasets keywords were not present). One hundred and fifty-five datasets were assessed as having no valid 'Title', 'Keywords' and 'Description'. Only one of them was present in the qrels file (dataset no. 5322) and was marked as 'nonjudged' (À1).

Selection of the optimal baseline system
Our selection of Terrier (23)-the open-source search engine written in Java-was motivated by its maturity and its use of state-of-the-art retrieval weighting models and techniques that can be used to index large collection of various documents.
In particular, some of the notable weighting models implemented include Okapi BM25 (best matching model), term frequency inversed document frequency (TFIDF) and a whole group of Divergence From Randomness Framework, DFR [mostly originating in (24)]. DFR models have their origin in information theory (Amati, Encyclopedia). A word that is randomly distributed according to some distribution in documents is not informative, whereas a word that does not obey this distribution conveys information. The models were obtained by representing the three components of the framework: selecting a basic randomness model, applying the first normalization and normalizing the term frequencies with respect to the document-length. In this work, the socalled Normalization 2 was applied with the hyperparameter c ¼ 1.
We direct a reader to the original source (26) for complex model formulas. So far, it has not been demonstrated theoretically why some of these models perform better than others.
Another valuable feature implemented in Terrier is PRF query expansion-a mechanism allowing for extraction of n most informative terms from m top ranked documents (ranking created in the first search run) which are then added to the original query in the second retrieval rank. Terrier provides both parameter-free (Bose-Einstein 1; Bose-Einstein 2; Kullback-Leibler) and parameterized (Rocchio) models for query expansion (27). The Rocchio feedback approach was developed using the vector space model. The modified vectors are moved in a direction closer or farther away, from the original query depending on whether documents, are related or non-related.
In recent work (28), several leading systems were evaluated within the Open-Source Information Retrieval (IR) Reproducibility Challenge for the Gov2 test collection to select the best DFR variant. Among the options was Terrier 4.0 with DPH ranking function, which is a hypergeometric parameter-free model from the Divergence from Randomness The results of the current Poznan consortium work are shown in italics.
family of functions (8). The query expansion version-the 'DPH þ Bo1 QE' uses PRF, which is known to find potentially relevant terms by first querying the index and looking for new terms in high-ranking documents. Specifically, 10 terms are added from three PRF documents.
Research by in (28) found that the 'DPH þ Bo1 QE' run of Terrier 4.0 was statistically significantly better than all other runs including Terrier's BM25 run, with all other differences not significant. In particular, it was 0.04 better compared with the Lucene-based solutions for the mean average precision (MAP) at 1000 measure. We corroborated this finding with the relatively successful Poznan University of Technology (PUT) TREC CDS 2016 contribution (29), where Terrier DPH Bo1 was used, and the data consisted of a subset of the PubMed articles.
The baseline information retrieval results are presented in Table 5. Fourteen weighting models implemented in Terrier were tested, with the log-logistic DFR model providing the best infNDCG.
For the Biocaddie data, which are not continuous data, surprisingly the best results for infNDCG were achieved with LGD, not BB2 (DPH Bo1), which provides the best results for infAP and P@10. These results could not had been predicted before the evaluation of the Challenge results. Therefore, for original challenge our results could have been 0.02 lower in comparison to what we present now.
Our baseline results compare quite favourably with the best original baseline bioCADDIE teams' results in spite of the fact that no advanced preprocessing was used. The best Terrier option LGD gives the infNDCG value 0.4355, compared with UCSD 0.4498 (official bioCADDIE evaluation)/0.433 (10), and Elsevier' 0.4292 (13), UIUC GSIS 0.4207, SIBTex 0.3898 (17).

Query expansion
Expanding queries by adding potentially relevant terms is a common practice in improving relevance in IR systems. There are many methods of query expansion. Relevance feedback takes the documents on top of a ranking list and adds terms appearing in these document to a new query. In this work, we use the idea to add synonyms and other similar terms to query terms before the PRF. This type of expansion can be divided into two categories. The first category involves the use of ontologies or lexicons (relational knowledge). In biomedical area UMLS, MeSH (22), SNOMED-CT, ICD-10, WordNet and Wikipedia are used (30). Generally, the result of lexicon type expansion is positive (in the bioCADDIE contest see for example (19,20)). We did not use this method in our work because of lack of access to MeSH medical text indexer service. The second category is WE, i.e. word2vec-mapping a word on a corresponding vector. This belongs to a class of distributional semantics, feature learning techniques in natural language processing. Such language modelling derives word space from linguistic items in context. Space with one dimension per word is transformed to a continuous vector space with much lower dimension. Meaning is obtained by defining a distance measure between vectors corresponding to lexical entities (here words). In the WE query expansion methods, terms are added to a query based on their similarity to original query terms. Goodwin and Harabagiu (31) used the skip-gram word2vec method for query expansion with negative effect compared with the baseline, as we did for TREC CDS (29).
Analysis of the effects of query expansion is difficult, as stressed in (32). There, it was shown that various methods gave very different top expansion terms in response to a query 'foreign minorities Germany in Google (as of April 2009)'. The methods were automatic-query expansion, mutual information, local context analysis Rocchio, binary independent model, Chi-square, Robertson selection value, Kullback-Leibler and relevance model. Only the binary independent model, Chi-square and Kullback-Leibler gave 'frisians' and sorbs '2' as the top two expanded terms. Some of the methods got none of the intended correct terms among the first eight expanded terms.
In this work, we used MeSH only for filtering, so that query expansion terms stayed in the medical domain. The query was expanded with most similar terms obtained from a collection of PubMed Biomedical journal citations (titles and abstracts) and from the Biocaddie data challenge collection. Similarity was calculated for each dataset using word2vec, an efficient model allowing for learning vector representations of words from unstructured text data (15) with the following parameters: • PubMed collection: number of dimensions ¼ 100; window size ¼ 5; minimum word count ¼ 10; this resulted in the collection of 1 498 219 words; • BioCaddie collection: number of dimensions ¼ 100; window size ¼ 20; minimum word count ¼ 5; this resulted in the collection of 296 503 words.
A similarity threshold was set to 0.9 for vectors generated from PubMed abstracts and 0.8 for vectors calculated on the basis of bioCADDIE datasets (lower values resulted in dissimilar query terms).
As in (29) and (31), if queries are expanded with WE obtained terms and added to a list of query terms with the same weight as the original terms, the results, in general, get worse, because a query drift is introduced. In Question 9 (question pertains to 'ob' and Mus musculus), adding terms such as 'mouse' or 'mice' to a question does not improve the result.
The most important result of this work is observation that the results improve if query expanded terms are given a much smaller weight than the original terms.
The weight of original query terms was set to 100, terms obtained from PubMed to 20 and terms provided with bioCADDIE embeddings to 1. This is justified by the relative smallness of the bioCADDIE dataset.
In (26), we used MeSH not only for filtering but also for query expansion, with positive results. For the purpose of this work, we use MeSH only for filtering because the free access interface was discontinued.
We tried query expansion with WE using two approaches: 1. The skip-gram method (15) on abstracts of the entire PubMed using Gensim library (33). 2. The Glove method (16) on free TREC 2016 PubMed documents.
In our case, vectors obtained from word2vec and Glove were quite different, and in case of Glove gave negative results (data not shown). However, this may be related to the relative smallness of the corpora used. We plan to extend the current work to larger corpora (e.g. 34) for neural network training. We focused on the Terrier Rocchio method optimizing the beta parameter, a number of top documents and a number of extracted terms to obtain an optimal infNDCG result. For the same conditions, the Rocchio query expansion method slightly outperforms the Terrier parameterfree expansion method Bo1 http://terrier.org/docs/v3.5/jav adoc/org/terrier/matching/models/queryexpansion/Bo1. html). For LGD with word2vec, the difference is 0.0049. For infAP the reverse occurs-the parameter-free expansion slightly outperforms Rocchio by 0.0034.
Terrier PRF was configured to use the Rocchio algorithm with the following parameters: number of top documents used for query expansion ¼ 2; number of terms extracted from each document ¼ 2; beta parameter for Rocchio algorithm ¼ 0.5.
The results of information retrieval with expanded query are presented in Table 6. Once again, LGD was found to provide the best infNDCG measure. The percentage-wise gain obtained by the query expansion over the baseline result is a little over 4%, smaller than achieved in (29). However, the bioCADDIE data have quite irregular structure (some data types missing in many documents), and this might make a difference.

Further analysis
To better understand the results, we did evaluation for individual questions (Table 7) for our best result: LGD with query expanded with word2vec and Terrier PRF. Strikingly, the highest value of measure is for Question 15 (for which, similar to Question 7 no Score 2 of evaluation was assigned).
Further analysis of which particular databases carry information gain is required. For example, neuromorpho provided 11% of the contribution to infNDCG measure, although it constitutes <5% of data volume. Table 8 presents the details of run options for the LGD algorithm using the same or different weights for original and expanded terms and shows that expansion terms should not have the same weight as original terms.
We evaluated the results using the query relevance file with partially relevant documents denoted as non-relevant. We have noticed that search results benefit from query expansion in any form. We have evaluated three forms of expanding the query: no expansion (denoted as NoEXP), Terrier default query expansion (denoted as Terrier) and query expansion with the WEs (denoted as Emb). Results are presented in Table 9.
We can see that commonly used BM25 and its extension InL2 gives surprisingly good results, better than the best performing algorithm in the full evaluation-LGD. In terms of cumulative gain, TF-IDF is the worst performing algorithm. Improvement for results obtained with query expansion is consistent across all algorithms. Composition of both types of query expansions gives the best results, reaching a normalized discounted cumulative gain of 0.2687 for the InL2 algorithm and 0.2086 Average Precision for the LGD algorithm.

Conclusions and future work
Shared tasks bioCADDIE challenge fulfilled an important role in the advancement of biomedical Information retrieval methods using data snippets as datasets. Our postchallenge analysis indicates that bioCADDIE data is quite different from continuous biomedical data. There are quite a number of documents that basically present the same information duplicated in NML databases. Manual expansion, in general, makes the results worse. Word2vec based query expansion improves the results but expansion term weights have to be much smaller than the original weights. For effectiveness of word2vec, a method for calculating the similarity of candidate expansion terms to the original query terms is crucial. In this work, we use the pure word2vec.
The work of Fudan group within the bioASK contest (43) used deep semantics comparing query and document text on a sentence basis (D2V, document vectors). D2V-TFIDF, which concatenates both dense and sparse semantic representations, performed very well in application to ranking of MeSHLabeler. It should be stressed that in (15), the pure word2vec method (with cosine similarity) was presented as better than it actually is by choosing an easy type of corpus such as countries and capitals. Much better results are obtained when sense disambiguation (44) and hubness reduction is applied to the vector space. For similarity tasks, the results in (45), where three different corrections to word2vec were used (retrofit, hubness removal and ranking type similarity), are up to 30% better than with the other method (15). Such a method (enhanced to relatedness) could allow direct comparison of query and target terms.
Other query expansion schemes are based on WE exist (41)(42). Terrier provides a state-of-the-art baseline system but our perspective is that PRF and phrase query expansion could be significantly improved within Terrier.
Direct comparison of this work results with original bioCADDIE results is not warranted. Nevertheless, our results are strong. They are close to the top in most measures, and the best in infAP measure.
To summarize, the main conclusions of this article are the following: 1. Use of language models created on the basis of distribution semantics to expand the query (using WE) has the potential to significantly improve WE results in the near future. 2. Assigning different weights to words in a query, depending on whether the words were added in the expansion process or originating from the original content of the query significantly improves the result.  the selection of the appropriate ranking function and the adjustment of the PRF extension parameters (parametric, with the coefficient b, to use the two best articles, instead of the standard three).
In achieving the competitive results of this work, we used no advanced preprocessing, neither manual tasks nor system training. These results could be treated as a new baseline. It is our belief that with more sophistication by including the aforementioned elements, particularly in application to individual questions, we can potentially improve infNDCG by 0.05. Even small improvement amounts to a large economic gain as in the 2012 survey (46), it had been found that that doctors performed an average of six professional searches a day during their course of work.
The bioCADDIE challenge results need to be further analysed to understand which features of participating team algorithms contributed to effectiveness of results for particular measures. Such extended analysis was performed or TREC CDS 2014 (47).
Comparing all bioCADDIE runs based on the infAP, infNDCG, NDCG and P@10 there is surprisingly little correlation between evaluated results for these measures (20). The UCSD team was ranked first in term of infNDCG but would rank ninth in the ranking based on the classic NDCG metric. The UCSD method was optimized for infNDCG but has not been universally strong across measures. This challenge deserves further work and should contribute the development of a DDI prototype.
Finally, the result of bioCADDIE effort could be useful for determination of relevance of particular data. For example, evaluation performed in (48) showed that the genome-wide association studies dataset finder outperformed PubMed significantly in retrieving literature with desired datasets. This could indicate better usefulness of datasets compared with literature for some semantic tasks.