Discovering biomedical semantic relations in PubMed queries for information retrieval and database curation

Identifying relevant papers from the literature is a common task in biocuration. Most current biomedical literature search systems primarily rely on matching user keywords. Semantic search, on the other hand, seeks to improve search accuracy by understanding the entities and contextual relations in user keywords. However, past research has mostly focused on semantically identifying biological entities (e.g. chemicals, diseases and genes) with little effort on discovering semantic relations. In this work, we aim to discover biomedical semantic relations in PubMed queries in an automated and unsupervised fashion. Specifically, we focus on extracting and understanding the contextual information (or context patterns) that is used by PubMed users to represent semantic relations between entities such as ‘CHEMICAL-1 compared to CHEMICAL-2.’ With the advances in automatic named entity recognition, we first tag entities in PubMed queries and then use tagged entities as knowledge to recognize pattern semantics. More specifically, we transform PubMed queries into context patterns involving participating entities, which are subsequently projected to latent topics via latent semantic analysis (LSA) to avoid the data sparseness and specificity issues. Finally, we mine semantically similar contextual patterns or semantic relations based on LSA topic distributions. Our two separate evaluation experiments of chemical-chemical (CC) and chemical–disease (CD) relations show that the proposed approach significantly outperforms a baseline method, which simply measures pattern semantics by similarity in participating entities. The highest performance achieved by our approach is nearly 0.9 and 0.85 respectively for the CC and CD task when compared against the ground truth in terms of normalized discounted cumulative gain (nDCG), a standard measure of ranking quality. These results suggest that our approach can effectively identify and return related semantic patterns in a ranked order covering diverse bio-entity relations. To assess the potential utility of our automated top-ranked patterns of a given relation in semantic search, we performed a pilot study on frequently sought semantic relations in PubMed and observed improved literature retrieval effectiveness based on post-hoc human relevance evaluation. Further investigation in larger tests and in real-world scenarios is warranted.


Introduction
Many natural language queries are submitted to search engines on the Web every day, and an increasing number of online search engines target domain-specific search services. For example, Yelp (www.yelp.com) facilitates restaurant searching while PubMed (www.ncbi.nlm.nih.gov/ pubmed) retrieves scholarly publications in biomedicine.
Today's search engines typically treat natural language queries as lists of terms and retrieve documents containing those terms. However, documents with different words but similar semantics may be overlooked. Take the search engine in biomedical domain, PubMed (1), for example. Semantically similar as the queries chlorthalidone vs hydrochlorothiazide and chlorthalidone versus hydrochlorothiazide are, PubMed returns 2.5 times more relevant articles when users compare these two drugs using versus than using vs. Such performance difference in retrieval effectiveness may be reduced and/or the levels of user satisfaction may be maintained if queries of similar semantic meaning were presented at search time. In this regard, this paper learns to discover semantic relations between bio-concepts (such as chemicals and diseases) on the Web for possible help of biocuration and retrieval effectiveness. Specifically, this paper aims to identify semantically similar context words (like the vs and versus example), referred to as context patterns thereafter, in PubMed queries that assert specific relations between two entities. We focus on semantically understanding PubMed queries with exactly two bio-entities as bio-NLP research in entity relations has long focused on relations between dual entities: chemical-disease relations (2), protein-protein interaction (3), gene events (4), drug-drug interaction (5) and disease co-morbidities (6).
We present a novel unsupervised framework, SIP (semantically similar pattern finder), that discovers twoargument context patterns that are semantically similar but lexically different. Table 1 shows example SIP discovery of synonymous context patterns associated with semantic bio-relations involving chemicals/drugs (denoted as #C) and diseases (denoted as #D). SIP leverages the semantic information of biological entities in Web queries to differentiate pattern semantics, based on observations that semantically similar patterns such as #C induced #D and #D due to #C share significantly more chemical and disease pairs among Web queries than patterns like #C induced #D and treatment of #D with #C which are not semantically similar. Intuitively, SIP estimates patterns' semantic similarity by their distributional similarity, whether their distributional contexts are participating entities or semantic topics. In specific, the SIP framework discovers patterns of similar semantics in three main steps. First, it determines patterns' participating entities which constitute entity space. Next, SIP transforms entity space into latent topic space for pattern semantics analysis/understanding. It learns the transformation by analyzing PubMed queries using latent semantic analysis (LSA). Finally, SIP yields pattern pairs with high distributional similarity in LSA topics and proposes them as semantically similar patterns.
Our SIP framework is unique as it targets biomedical queries, gaining importance in Web searches and biomedical research (1,7). Second, SIP leverages search crowds' wisdom (i.e. user entities in Web queries) to discern context patterns' semantics and estimate patterns' semantic similarity. This makes SIP unsurpervised requiring no training/seed data for related pattern discovery. Third, SIP serves as one of the pioneering work to analyze pattern semantics based on realworld user queries in either NLP or bio-NLP community. Last but not least, SIP exploits LSA to project entities in queries into lower-dimension latent topics, avoiding specificity in entity mentions, and SIP transforms the problem of finding semantically similar patterns into one of finding patterns with distributional similarity in LSA topics.
The results of our work can benefit biocuration and semantic information retrieval. For example, the automated semantically similar patterns can be used by biocurators for assisting bio-relation curation and article triaging (e.g. (8)), or can be passed on to search engines to expand search results for better recall of relevant documents (e.g. (9)). This paper focuses on discovering semantically similar patterns and its evaluation, together with its real-life applications in two use cases (see Application Section for more details).

Related work
Curating relationships between biological entities and concepts is an active task carried out by many groups such as CTD (gene-disease-chemical) (10), BioGrid (proteinprotein interaction) (11) and PharmaGKB (drug-gene) (12). The proposed work could potentially contribute to improved curation quality and productivity in two main ways: a) our discovered patterns could be directly used by curators to locate relevant papers more effectively (i.e. with better coverage and precision) in their routine literature search; and b) our patterns could be integrated into automated textmining systems for assisting relation curation. Semantic search, or searching with semantics, has been an area of active research for improving keyword-based retrieval systems by taking semantics into account. Semantics of the documents to be searched or semantics of the search terms may be leveraged in the process. In biomedicine, understanding the semantics of user queries has received much attention since (13,14). For instance, (15) analyzes query length, query specificity and query clarity of TREC and CLEF shared tasks. Another interesting work (16) imposes position constraint on search terms in retrieved documents. Such in-proximity constraint aims to preserve semantic relations of search terms in multi-word queries. Moreover, past research has studied the effectiveness of semantically expanding queries on biological entities, concepts, or controlled vocabulary for improved retrieval performance (17,18). Following this line of trend and term disambiguation (19), here we aim to understand the semantics of biomedical queries on a deeper level than individual concepts, but in the form of context patterns and entity relations.
In contrast to the previous work, we are the first to examine the applicability of LSA in query/pattern semantics and to discover semantically similar context patterns in user queries, inspired by the success of using LSA for lexical similarity estimation (20). Furthermore, compared to (21)'s single drug side-effect pattern recognition, we automatically discover bio-relational patterns related to diverse semantics of #C compared with #C, #C in combination with #C, #C #C interaction, #C induced #D, treatment of #D with #C, #D #C deficiency, dietary #C and #D, etc. simply by using bio-entities in PubMed queries as knowledge. The unsupervised nature of our framework makes it highly scalable: needing no seeds, it can easily be extended to cover various entity types (e.g. genes) and to understand the semantics of corresponding relations (e.g. #G responsible for #D where #G denotes genes).

Problem statement
We now formally state the problem that we are addressing: We are given a collection of PubMed queries QL and a context pattern p that specifies a biological relationship between two entities. Our goal is to automatically discover a reasonable-sized set of patterns in QL that are semantically similar to p in biomedical search context. For this, we represent queries in QL as context patterns in entity space and project such representations into latent topic space using LSA, such that patterns' semantic similarity can be estimated by their distributional similarity among LSA latent topics and those patterns having high LSA topic similarity with p can be proposed as its paraphrases. Figure 1 summarizes the workflow of our method while Figure 2 elaborates on semantically similar patterns identification at run-time. Detailed process is discussed in the following sections.

Transforming spaces
We propose to address the problem of finding semantically similar context patterns in an unsupervised manner by finding patterns with high distributional similarity in LSAlearned latent topics. Figure 1 outlines the procedure to transform PubMed queries into patterns in entity space and LSA space for this purpose. Algorithm 1 shows the corresponding steps. Note that we consider SIP unsupervised in that SIP does not require any training/seed data for pattern semantics understanding.
In the first step, we perform stemming and named entity recognition on PubMed queries QL. We use (22) to stem query words (e.g. reduce third-person singular verb 'induces' to the base form 'induce' and plural noun 'differences' to singular 'difference') for pattern analysis. We then use tmChem (23), DNorm (24) and GNormPlus (25) to recognize chemical/drug, disease/disorder and gene/protein in queries, respectively. These are state-of-the-art entity recognition tools that are publicly available (http://www.ncbi.nlm.nih.gov/ CBBresearch/Lu/Demo/tmTools/). Although different text genres may lead to different performance, they in general can achieve 0.8-0.9 in F-measure based on previous benchmarking evaluations (2). Sample stemmed and semantically tagged queries are shown in Table 2 where <X> denotes the start of an entity while </X> the end, and in our paper X can be C, D and G which respectively correspond to a chemical, disease and gene entity. Note that our bio-entities are identified in a greedy fashion with priority given to longer text spans.  Step 2 of Figure 1 collects queries with exactly two entities into QL 0 . In contrast to using entity seeds for pattern recognition (21), unsupervised SIP leverages participating entity pairs in user queries to semantically constrain the 'contexts' of the queries' non-entity words (i.e. Step 4a and 4b), thus understanding the semantic relations between entities. The wisdom of search crowds and searchers' perception, encoded in search queries, are also valued in (26,27), and our experiments in Experiments Section suggest user entity pairs in Web queries serve as good knowledge to capture query/pattern semantics.
In the third step, dual-entity queries in QL 0 are formulated and mapped to distinct context patterns. This is done based on recognized named entity types. For instance, semantically tagged query chlorthalidone</C> vs <C > hydrochlorothiazide</C> becomes pattern #C vs #C (see Table 2). Note that this paper focuses on patterns (a) involving two chemicals (e.g. #C vs #C) and (b) between a chemical and a disease (e.g. #D due to #C). Hereafter, we denote the former CC task, discovering semantically similar chemical-chemical patterns, and the latter CD task, discovering those of chemical-disease.
Inspired by distributional similarity (28)(29)(30), Step 4a learns a pattern's semantics by its contextual/participating entities in PubMed queries, i.e. entity space. For example, the pattern #D associate with #C is distributionally and semantically associated with a set of disease-chemical entity pairs in the query log: <skin necrosis, warfarin>, <myocardial infarction, isoproterenol>, <intraoperative floppy iris syndrome, tamsulosin>, etc. We use matrix M i Â j to represent our context patterns in entity space where i denotes the number of unique patterns and j the number of unique cooccurring entity pairs. Matrix element M[x,y] verifies the reference of the entity pair y in the pattern x in QL 0 : value 1 indicates the reference exists, 0 otherwise. Our CC/CD task has its own M, ensuring subsequent LSA transformation and semantically similar pattern finding are confined to a specific entity type pair. Table 3 shows sample M for CC task while Table 4 shows the M for CD task. As we can see in these two sample M's, the contextual entity pairs (reflected by zeros and ones) coarsely categorize the patterns into upper-left and bottom-right groups. This is genuinely how SIP learns to discern pattern semantics.
Learning pattern semantics by patterns' specific participating entities, however, come with issues of data sparseness and specificity: a certain entity pair could only be mentioned in a handful of patterns, and entities may be topically-related (e.g. carcinoma and tumor are related to cancer, malignant melanoma to skin cancer, simvastatin to statin and simvastatin to lovastatin). Therefore, we further transform entity space into latent topic space to avoid these issues (Step 4b). Specifically, we leverage LSA (31) to learn entity pairs' semantic topics and to reduce dimensionality from the number of distinct entity pairs (j) to the number of distinct LSA semantic topics (t) where t ( j. This equates to transforming pattern representations in entity space, M, into pattern representations in LSA topic space, M 0 . LSA constructs the t-topic semantic space by a number of steps, namely, performing rank-reduced singular value decomposition on the matrix in entity space, retaining t largest (significant) singular values and approximating the matrix in   the least-squares sense. Finally, a lower-dimension i-by-t matrix approximation (M 0 ) to the original i-by-j matrix (M) is learned in an attempt to model pattern semantics in terms of t LSA topics. Note that although similar method such as probabilistic LSA (pLSA) (32) could also be used for rank reduction, pLSA does not outperform LSA in both our tasks.
In this paper, we refer to SIP as an unsupervised framework because it requires no specific manually annotated seeds or training data for pattern semantics analysis. Although the open-source entity recognition tools (i.e. (23)(24)(25)) used in Step 1 need entity annotations, such annotations and these tools are not designed and re-trained for the purpose of discovering context patterns with similar meaning, and entity recognition can always be achieved by less-satisfying dictionary methods.

Discovering semantically similar patterns
Once context patterns are semantically recognized in LSA space as M 0 i Â t , instead of their lexical forms, SIP estimates patterns' semantic similarity by their distributional similarity in LSA latent topics. SIP proposes semantically similar candidate patterns using the procedure in Figure 2. Algorithm 2 shows the detailed steps.
First, matrix Sim i Â i is initialized to record (semantic) similarity scores between patterns and List i Â N to store each pattern's top-scored N patterns in similarity. Similar to space transformations, finding candidates of semantically similar patterns is done independently from one entity type pair to another. As a result, the similarity calculation of chemical-chemical patterns does not concern that of chemical-disease patterns, and i refers to the number of the unique patterns in our CC task or that in our CD task.
Next, for pattern p and p 0 (p 6 ¼ p 0 ), SIP first extracts their LSA topic vectors from M 0 i Â t . These vectors represent the patterns in LSA space and describe pattern semantics in t LSA topics. Then, SIP estimates the semantic similarity of patterns p and p 0 by the cosine similarity of their LSA t-topic distributions as where V x denotes the LSA vector for pattern x and V x [t 0 ] denotes the scalar component of V x along the axis of LSA topic t 0 (1 t 0 t).
For each pattern p, SIP yields a set of patterns whose similarity scores are among its top N as its semantically similar candidates. At last, sets of paraphrasable pattern pairs are obtained. Table 1 shows example discovery of semantically similar context patterns on our working prototype.

Experiments
SIP is designed to learn the semantics of context patterns by entities involved. Although both scholarly publications and Web queries provide such information (i.e. the entities that patterns keep), we prefer Web queries because user queries tend to bond entities in proximity. As such, SIP is trained and evaluated over Web queries. In this section, we first present our PubMed query data, for discovering semantically similar entity relations or context patterns and the process to construct our test set. Then, we describe the parameter settings for SIP and outline the evaluation process. Finally, experimental results are reported and discussed.

Knowledge source and test set
Knowledge source: PubMed queries A total of six-month's worth of 35 968 309 PubMed queries (24.3 million unique queries) was collected for our experiment of pattern semantics understanding. Queries with exactly two entities were stemmed, entity-tagged and re-formulated into context patterns following the procedure in Figure 1 for semantically similar pattern finding in Figure 2. Table 5 shows some frequent dual-entity context patterns or entity relations in PubMed queries. Frequent chemical-chemical patterns cover relations of drug/chemical comparison (e.g. #C versus #C), interaction (e.g. #C and #C interaction) and so on, whereas frequent chemicaldisease patterns cover semantics of chemical-induced side effects (e.g. #C induce #D), drugs' therapeutic effects (e.g. treatment of #D with #C), etc.

Test set construction
We constructed our test set semi-automatically in two steps. We first ordered PubMed context patterns according to their frequency and the diversity of their participating entity pairs in our query log. We then manually examined the top-ranked patterns and considered a pattern suitable for testing if it is a common, general biomedical pattern (in contrast to specific ones such as #C oxidase #C and #C transporter #C) and it should not be ambiguous about entity relations. Our final test set consisted of 68 chemicalchemical and 120 chemical-disease testing patterns (see Table 6 for examples). For each of these patterns, we performed the evaluation on the list of top-ranked similar patterns returned by SIP.

System settings and evaluation process
System settings for SIP We evaluated SIP framework on different numbers of LSA topics: 10, 20, 40, 60, 80, 100, 150, 200 and 300. We started with a small topic number of 10 and increased the number faster to 300 because of the fact that 300 6 100 topics have been used to analyze lexical semantics of general documents (33) and that, compared to full-text general documents, we had a much smaller and constrained vocabulary. On the other hand, to avoid possible noise in Web queries, we restricted SIP to the most frequent 500, 1000, 1500, 2000, 2500 and 3000 chemical-chemical/ chemical-disease entity pairs in PubMed queries when constructing CC/CD task's entity space in Figure 1.

Evaluation process
All 54 system settings for SIP (9 different numbers of LSA topics Â 6 different numbers of frequent entity pairs) were evaluated in our CC and CD tasks. In evaluation, candidate semantically similar pattern pairs were pooled from the 54 SIP alternatives and our baseline, and were manually judged for semantic similarity. As the authors concurred on each other's semantic judgement most of the time (85%) in prior-experiment analysis, only one of the authors examined the pooled results blindfolded. In total, 1687 unique pattern pairs in CC task and 3609 unique pairs in CD task were manually evaluated and annotated as: Strict match. A pattern pair is considered to be strictmatch if, in biomedical context, its patterns are semantically the same (e.g. #C induce #D and #D due to #C) or highly similar (e.g. #D child #C and pediatric #D #C). Relaxed match. A pattern pair is considered to be relaxed-match if, in biomedical context, its patterns are semantically related and one of its patterns entails or contextually subsumes the other. For example, #C reduce #C and #C effect on #C are relaxed-match semantically similar patterns since #C reduce #C entails #C effect on #C, whereas #C induce #D and #C induce #D in rat are relaxed-match since #C induce #D subsumes the contexts of #C induce #D in rat (the same applies to #C induce #D and #C induce #D treatment). No match. A pattern pair is considered to be no-match if it is neither one of the above.
Based on the annotations, standard information retrieval measures-mean reciprocal rank (MRR) and normalized discounted cumulative gain (nDCG) (34)-were used to evaluate system ability to return relevant, semantically similar, patterns at top N positions. While MRR measures the effort to locate the first true semantically similar pattern pair in the candidate list (the closer it is to 1 means less effort), nDCG measures system performance in ranking true semantically similar pairs earlier in the list (the closer it is to 1 means better performance).  In our experiments, systems were expected to discover strict-match pattern pairs. However, finding relaxed-match ones could also be beneficial to biocuration and information retrieval. For instance, #C reduce #C depicts a specific context of its relaxed-match counterpart #C effect on #C and narrows down information need in search, and #C induce #D treatment provides the also-want-to-know for its relaxed-match #C induce #D which indicates an opportunity of automatic query suggestion/completion (35). As a result, we also evaluated systems on finding relaxed-match pattern pairs. Specifically, system performance on discovering strict-match/relaxed-match semantically similar patterns was measured in terms of MRR@N and averaged nDCG@N where N ¼ 1, 3, 5 or 10. And since similar trends were observed across different values of N, we only present the results with N ¼ 3 in the next subsection for simplicity.

Results of chemical-chemical (CC) semantic relations
The performance of SIP on finding strict-match chemicalchemical patterns (i.e. the strict-match CC task) is summarized in Figure 3. In this figure, histograms represent the (a) MRR and (b) nDCG performance of different SIP settings concerning the LSA topics and the most frequent entity pairs, and colors are used to differentiate LSA topic numbers. For instance, green bars, labelled as T60, denote the SIP performance when set with 60 LSA topics. And 60-topic SIP (i.e. green bars) performed differently when accompanied with different numbers of frequent entity pairs (i.e. 500, 1000, 1500, 2000, 2500, 3000): 60-topic SIP achieved around 0.4 MRR using 500 frequent entity pairs but achieved around 0.8 MRR using 3000. The results of T200 and T300 are omitted as system performance degraded drastically after T100 (i.e. T150, T200 and T300). Figure 3 also plots SIP's best performance (i.e. the solid lines) with respect to each number of entity pairs used. For example, when using 2000 frequent entity pairs, SIP achieved the best 0.81 MRR with 20 LSA topics, thus T20 0.81 labelled. For comparison, the dotted lines represent the performance of our baseline, which simply estimated patterns' semantic similarity by the cosine similarity of their specific participating entity pairs in the queries without using LSA topic information. In other words, our baseline is basically SIP framework excluding the component of latent semantic analysis (i.e. Step 4b in Figure 1).
As shown in Figure 3, the performance of smaller topic numbers (t 80) tends to improve with increasing entity pairs and their performance becomes steady at 2500-3000 entity pairs: increasing the number of frequent entity pairs from 500 to 1000 gave MRR and nDCG the largest margin of improvement whereas increasing from 1000 to 1500 yielded the second largest. Nonetheless, with larger topic numbers (t ! 150), SIP did not always benefit from the entity pair increase and did not perform well.
Encouragingly, SIP with small topic numbers significantly outperformed the baseline which tends not to benefit from using more entity pairs either. SIP achieved the highest MRR score of 0.86 and the highest nDCG score of 0.87 when as few as 20 LSA topics were used with 3000 entity pairs. And a MRR and nDCG above 0.85 indicate that the first-ranked candidate pairs were almost always correct.

Results of chemical-disease (CD) semantic relations
Using the same strict-match criterion and figure configuration in Figure 3, Figure 4 summarizes the results on Figure 3. System performance on the CC task with different LSA topic numbers (10-150) and different numbers of the most frequent entity pairs (500-3000). Strict match is required. The solid line represents best-performing SIP while the dotted line represents the baseline. discovering semantically similar chemical-disease semantic relations. Similar to the CC task, SIP generally benefited from more entity pairs in the CD task and the 500-1000 entity pair increase led to SIP's largest margin of improvement. Again SIP significantly outperformed the baseline by a large margin. One thing worth mentioning is that, compared to the CC task, both SIP and the baseline yielded lower performance: while SIP dropped from a MRR of 0.86 to 0.73 and a nDCG of 0.87 to 0.74, the baseline drastically dropped to a MRR of 0.28 and a nDCG of 0.28. This is mainly because our CD task contained a broader spectrum of semantic contexts/relations (i.e. the chemical-disease relations in PubMed queries were more diverse).
Since discovering relaxed-match patterns can also be beneficial, we further examined system performance with both strictand relaxed-match patterns allowed. Figure 5 reports corresponding nDCG results on our CC and CD tasks. As expected, SIP gained from relaxing the matching criterion and achieved an improved performance of nDCG closer to 0.9 and 0.85 in semantically understanding the chemical-chemical and chemical-disease patterns, respectively.
Overall, entities in user queries serve as good knowledge to differentiate query semantics. Projecting user entities into LSA latent topics further helps discover semantically similar entity-relations, or context patterns, on the Web. Also, compared to word sense induction in general documents, smaller LSA topic numbers in the range of 30 6 10 can yield the best results for biomedical strict-match CC and CD tasks. And using numbers as small as the top frequent 2500-3000 entity pairs in the PubMed query log can achieve satisfying performance across diverse semantic relations.   Tables 7 and 8 show SIP-proposed strict-match patterns in the CC and CD task respectively. They are consolidated across different SIP settings @ N ¼ 10. SIP effectively discovered synonymous patterns for a spectrum of entity relations (e.g. #C vs #C and comparison between #C and #C, combine #C and #C and #C in combination with #C and #C induce #D and #D due to #C). One thing worth mentioning is that SIP discovered multiple-sense pattern #C with #C and #D with #C and semantically associated them with senses #C combine with #C and #C interaction with #C for CC (Table 7), and senses #D associate with #C and #D treatment with #C for CD (Table 8) respectively. Although SIP focused on context patterns instead of words to avoid ambiguity and yielded satisfying results, it would be interesting to investigate the impact of such ambiguous context patterns (e.g. short context patterns containing only prepositions) on SIP in the future.

Discussion
In the experiments, we compared SIP with our baseline to highlight the importance of LSA space transformation. We did not directly compare our method with other pattern recognition methods such as (21) that are based on entity co-occurrence at either sentence or abstract level because our goal was to discover information needs (relations) that are frequently sought by PubMed users. Additionally, SIP was designed to work in query space where semantics (in queries) tend to be clear and specific, and entities and relations (in queries) tend to bond in proximity, which may not hold for some entity relations in literature space. Finally, SIP was designed to discover semantically similar patterns without supervision. That is, no training/seed data are required for the purpose of pattern semantics understanding. Therefore, it is not straightforward to compare SIP with traditional methods which require and start with entity seeds describing the pattern (e.g. (21) leverages chemical-disease entity seeds having chemical-induced-disease relation to recognize the said relation). Nonetheless, when examining the output of (21), we observed complementary results for the chemical-cause-disease relation. For instance, SIP had query-specific #D from #C but missed #D during #C.

Applications of SIP-derived semantically similar patterns
In this section, we apply SIP strict-match pattern pairs to two specific biomedical tasks: biomedical document retrieval and bio-entity relation extraction (See Figure 1). We show that SIP output can benefit the process of biocuration and semantic information retrieval (IR).

Biomedical document retrieval
In the field of biology and life sciences where entities have abundant alias, retrieving documents containing the exact user search words may not be sufficient. As a result, PubMed (1) uses Medical Subject Headings (i.e. MeSH terms) expansion by default (9) and searches for query  words not only in documents but also associated MeSH headings. By doing so, PubMed alleviates the issue of biomedical term mismatch between document words and query words. Take the query albuterol for example. PubMed will return documents containing albuterol and documents without albuterol but (annotated) with the same MeSH heading as albuterol. Thus, documents not containing albuterol but containing the synonyms of albuterol such as proventil, salbutamol and ventolin, will also be returned. Nonetheless, since general terms/phrases are out of the scope of MeSH headings, PubMed can still suffer from general-purpose vocabulary mismatch during search. Consider a real user query albuterol vs levalbuterol. PubMed's retrieval effectiveness could be improved if PubMed semantically understands the query by exploiting SIP synonymous pattern pairs, (#C vs #C, #C versus #C) in this case, and returns the accumulated search results from both the original query albuterol vs levalbuterol and SIPmotivated counterpart albuterol versus levalbuterol (see Table 9). As shown, PubMed retrieves relatively 118% more documents for the new query (35 documents vs 16 documents). In addition, examining the retrieved PubMed titles shows that with SIP's query expansion, albuterol versus levalbuterol for albuterol vs levalbuterol, one can obtain relatively 100% more relevant documents (22 vs 11) in this case. Retrieving more relevant documents is essential to biocuration, semantic IR, and article triaging of many biomedical shared challenges (36).
The benefit of SIP in semantic IR, alleviating vocabulary mismatch that is not covered by MeSH, can also be observed in another real user query methotrexate combined with tofacitinib where SIP proposes pattern #C combine with #C and #C in combination with #C are synonymous (see Table 10). Based on the first-page PubMed responses shown in Table 10, PubMed clearly achieves better retrieval performance with the expanded query methotrexate combined with tofacitinib OR methotrexate in combination with tofacitinib. In the near future, the applicability of SIP patterns in PubMed literature search as query expansion will be examined more extensively and quantitatively.

Biomedical relation extraction
Similar to many bio-NLP challenge tasks such as chemical-disease relation extraction (2), protein-protein interaction extraction (3), drug-drug interaction extraction (5) and identification of gene events (4) and disease co-morbidities (6), SIP focuses on two-argument, dual-entity, relations. In this subsection, we examine SIP applicability in a real-life relation extraction problem and compare SIP Table 9. PubMed responses to query submission (a) albuterol vs levalbuterol and (b) albuterol vs levalbuterol OR albuterol versus levalbuterol where (a) is the original user query while (b) is (a)'s new query expanded using SIP pattern knowledge (a) Search results for the original query albuterol vs levalbuterol (b) Search results for the new query albuterol vs levalbuterol OR albuterol versus levalbuterol effectiveness in helping biocuration with simple co-occurrence method. Specifically, we exploit SIP strict-match patterns to address the problem of 2016 BioCreative chemical-disease relation extraction subtask (2): extraction of chemicalinduced-disease (CID) relations. We experiment on the 2016 official development set, consisting of 500 PubMed abstracts and extract chemical-induced-disease (CID) relations in a number of steps. First, starting with a representative CID pattern #C induce #D (words are stemmed), we consolidate its SIP strict-match patterns from different settings. This process examines newly-discovered patterns and adds their synonymous patterns iteratively. In total, we collect 24 SIP context patterns associated with the CID relation including #C cause #D, #D due to #C, #D associate with #C, #D cause by #C and #C and the risk of #D. Second, for any chemical-disease pair in a PubMed abstract, we extract its context (or contextual words) in the abstract and manage to best match the contextual words to a SIP pattern out of the 24, if any. Table 11 shows example PubMed contextual words surrounding chemical-disease pairs and contextual words' best-matched SIP CID patterns. Note that in this step we require the chemical-disease pair to appear in the same sentence. Finally, we consider the chemical-disease pairs whose PubMed contextual words have matched our SIP patterns to be candidates having CID relation.
As Table 12 shows, the above pattern-matching approach assisted by SIP output outperforms co-occurrence baselines relatively by 47 and 10% where co-occurring chemical-disease pairs in abstracts and sentences are proposed as CID candidates. We believe that, without the computational overhead of stemming and machine learning/training, such approach can be the first step to help accelerate biocuration and that its performance in relation extraction can be further improved if incorporated more CID patterns and/or co-developed with machine learning techniques.