PIPE: a protein–protein interaction passage extraction module for BioCreative challenge

Identifying the interactions between proteins mentioned in biomedical literatures is one of the frequently discussed topics of text mining in the life science field. In this article, we propose PIPE, an interaction pattern generation module used in the Collaborative Biocurator Assistant Task at BioCreative V (http://www.biocreative.org/) to capture frequent protein-protein interaction (PPI) patterns within text. We also present an interaction pattern tree (IPT) kernel method that integrates the PPI patterns with convolution tree kernel (CTK) to extract PPIs. Methods were evaluated on LLL, IEPA, HPRD50, AIMed and BioInfer corpora using cross-validation, cross-learning and cross-corpus evaluation. Empirical evaluations demonstrate that our method is effective and outperforms several well-known PPI extraction methods. Database URL:


Introduction
Due to the rapidly growing number of research articles, researchers have found it difficult to retrieve the articles of their interest. To identify the specific ones that meet their requirements, biomedical researchers tend to leverage the relationship between entities mentioned in these publications. Among all types of biomedical relations, protein-protein interaction (PPI) has played a critical role in the field of molecular biology due to the increasing demands for the automatic discovery of molecular pathways and interactions from literature (1,2). Understanding PPIs can help in predicting the function of uncharacterized proteins by distinguishing their role in the PPI network or comparing them to proteins with similar functionality (3). Additionally, composing networks of molecular interactions are useful in identifying functional modules or uncovering novel associations between genes and diseases. In essence, the ultimate goal of PPI extraction is to recognize various interactions including V C The Author(s) 2016. Published by Oxford University Press.

Page 1 of 16
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
(page number not for citation purposes) transcriptional and translational regulations, post translational modifications and dissociation between proteins within biomedical literature (4), so to find the criteria to judge whether a pair of proteins in the same sentence contains any interaction between them. To extract PPIs from biomedical literatures in an effective manner, we present PIPE, a module named for PPI Pattern Extraction used in the Collaborative Biocurator Assistant Task (BioC) (5) at BioCreative V. The purpose of BioC is to create BioC (6)-compatible modules which complement one another and integrate them into a system that assists BioGRID curators. The track is divided into eight different subtopics and each subtask was addressed independently, including gene/protein/organism named entity recognition and protein-protein/genetic interaction passage identification. The mission of our team is to develop a module that can identify passages that convey interactions between protein mentions. To develop a PPI passage extraction module, we model PPI extraction as a classification problem and propose an IPT structure to represent syntactic, content and semantic information in text. The CTK is then adopted to integrate IPT with support vector machines (SVMs) to identify sentences referring to PPIs within biomedical literatures. Results of experiments demonstrate that the proposed CK method is effective in extracting PPI. In addition, the proposed interaction pattern generation approach successfully exploits the interaction semantics of text by capturing frequent PPI patterns. The method consequently outperforms the feature-based PPI method (7)(8)(9)(10), the kernel-based PPI method (4, [11][12][13] and the shortest path-enclosed tree (SPET) (14) detection method, which are all widely used to identify relations between named entities. Our method also achieves comparable performances to those of multi-kernel-based methods (7,15,16).
The rest of the article is organized as follows. In 'Related Work' section, we review previous work, and briefly introduce kernel-based PPI methods. We describe the proposed PIPE system in 'Methodology' section. 'Experiments' section shows the experimental results and presents further comparison of our work with related work. Finally, we conclude our work in 'Concluding Remarks' section.

Related work
Most PPI extraction methods can be regarded as supervised learning approaches. Given a training corpus containing a set of manually-tagged examples, a supervised classification algorithm is employed to train a PPI classifier to recognize whether an interaction exists in a sentence. Feature-based approaches and kernel-based approaches are frequently used for PPI extraction, where the former exploit instances of both positive and negative relations in a training corpus to identify effective text features. For instance, Landeghem et al. (8) proposed a rich-feature-based (RFB) method that applied feature vectors in combination with automated feature selection for PPI extraction. In addition, a co-occurrence-based method was introduced by Airola et al. (7), which explored co-occurrence features of dependency graphs for representing the sentence structure. However, feature-based methods often have difficulty in finding the effective features to extract entity relations (14). In order to address this problem, the kernel-based methods have been proposed to implicitly explore various features in a high dimensional space by employing a kernel to directly calculate the similarity between two objects (17). Formally, a kernel function is a mapping K : X Â X ! ½0:1Þ: from input space X to a similarity score, where / i (x) is a function that maps X to a higher dimensional space without needing to know its explicit representation. Such a kernel function makes it possible for us to compute the similarity between objects without enumerating all features, therefore reducing the burden of feature engineering on structured objects in Natural Language Processing (NLP) research, such as the tree structure in PPI extraction (18,19). Examples include Erkan et al. (20) defining two kernel functions based on the cosine similarity and the edit distance among the shortest paths between protein names in a dependency parse tree; Satre et al. (9) also developed a system named AkanePPI, which extracted features using the combination of a deep syntactic parser to capture the semantic meaning of the sentences with a shallow dependency parser for the tree kernels. The latter enabled automatic generation of rules to identify pairs of interacting proteins from a training corpus. The tree kernel-based method is widely used in PPI extraction due to its capability to utilize the structured information derived from sentences, especially for the constituent dependencies knowledge. Vishwanathan and Smola (12) proposed a subtree (ST) kernel which considered all mutual subtrees in the tree representation of two compared sentences. Here a ST comprised a node with all its descendants in the tree; two STs were considered identical if nodes in both STs had identical labels and order of their children. Likewise, Collins et al. (21) introduced a subset tree (SST) kernel that relaxed the constraint that requires all leaves to be included in the substructures at all times while preserving the grammatical rules; for any given tree node, either none or all of its children were included in the resulting subset tree. In addition, Moschitti (11) adopted a partial tree (PT) kernel which was more flexible by virtually allowing any tree sub-structures; the only constraint was that the order of child nodes must be identical. Both SST and PT kernels are CTKs. Kuboyama et al. (13) proposed a spectrum tree (SpT) kernel which emphasized the simplest syntax-tree substructures among these four tree kernels; it compared all directed vertex-walks, each of which represented by a sequence of edges connecting syntax tree nodes of length q. When comparing two protein pairs, the number of shared sub-patterns, or tree q-grams, were measured as the similarity score.
Current studies attempt to use multiple kernels to overcome the shortcoming of information loss of single kernel approaches. For instance, Miwaa et al. (16) proposed a composite kernel (CK) approach for PPI extraction that extracted and combined several different layers of information from a sentence with its syntactic structure by using several parsers. They outperformed other state-of-the-art PPI systems on four out of the five corpora because the combination of multiple kernels and parsers could gather more information and cover a certain fraction of the losses. In addition, Giuliano et al. (15) defined the Shallow Linguistic (SL) kernel as the sum of the global context and the local context kernel. For the global context kernel, the feature set was generated based on the position of words appearing in a sentence under three types of patterns ('fore-between, between' and 'between-after') relative to the pair of investigated proteins. Each pattern was represented using 'bag of words' as a term frequency vector; the global context kernel was in turn defined as the total count of mutual words in these three vectors. For the local context kernel, they utilized orthographic and SL features of sentences respect to the candidate proteins of the pair, of which the similarity was calculated using dot product. On the other hand, Airola et al. (7) integrated a parse structure sub-graph and a linear order sub-graph to develop the allpath graph kernel (GK). The former sub-graph represented the parse structure of a sentence and included words or link vertices; a word vertex contained its lemma and its parts-of-speech (POS), while a link vertex contained its link only. Both types of vertices possessed their positions relative to the shortest path. The linear order sub-graph represented the word sequence in the sentence. Thus, it accommodated word vertices, each of which contained its lemma, relative position to the target pair and POS. The experimental results demonstrated that their method is effective in retrieving PPIs from biomedical literatures.
The above discussion suggests that the hierarchicalstructured features in a parse tree might not be fully utilized in previous work. On the other hand, we believe that the tree structure features could play a more important role than previously reported. Since convolution kernels (22) is capable of capturing structured information in terms of sub-structures (which provides a viable alternative to flat features), we therefore integrated the syntactic, content and semantic information of text into an interaction pattern tree structure to capture the sophisticated nature of PPIs. The concept is incorporated in PIPE to discriminate interactive text segments. Figure 1 shows an overview of the proposed PPI extraction method, which is comprised of three key components: 'interaction pattern generation, IPT construction' and 'CTK'. The paragraphs and related protein names in paragraphs are extracted from original XML files with the help of official BioC API, while candidate sentence generation produces a set of candidate sentences by capturing every sentence that contains at least two types of protein names. The candidate sentences then undergo the semantic class labeling (SCL) process, which help group together the synonyms. Since we treat PPI extraction as a classification problem, we use the interaction pattern generation component to automatically produce representative patterns for mentioned interactions between proteins. Subsequently, the IPT construction is used to integrate the syntactic and content information with generated interaction patterns for text representation. Finally, the CTK measures similarity between IPT structures via SVM to classify interactive expressions, followed by saving the results using official BioC API in XML format. Each component is elucidated in detail in the following sections.

Candidate sentence generation
Our system is constructed with official BioC library. The API offers built-in functions for us to parse the documents as paragraphs and retrieve the annotations. It also provided function for separating sentences in a paragraph; nevertheless, to be able to use this function, the input XML files are required to place each sentence between specific tags. Since the example files for this sub-task do not possess such information, we have no other option but to come up with our own version of sentence splitter. To extract the sentences from the paragraphs that contain possible PPI, we retrieve each sentence that have at least two kinds of protein names. Since the names are specified with annotations and are already available with the metadata that comes with each paragraph, we can do a first-level filtering of each paragraph to see whether it contains over two different types of protein, with which the program proceeds if it does. We save the distinct protein names in a paragraph to a set PG.
Later on, we try to segment the sentences in a paragraph. Intuitively, using string splitting functions in programming language library with period as the separator seem like a good choice; unfortunately, bio-related documents tend to have periods used in purposes other than dividing sentences, such as in float number or abbreviation. As such, we took rule-based approach by inserting several conditions for the program to ignore the period under certain situations. Examples include neglecting period in Figure 3, 'ph 6.5', 'Lin et al.' etc. After each paragraph has been broken down, we save the sentences into a set S ¼ {s 1 ,. . .,s i }, where each sentence is denoted as S i . Each sentence is tokenized to find the protein names, since most protein names exist in the form of unigram. For protein names that contain spaces within, substring match of the whole sentence is used. In cases where there exist more than one type of protein in a single sentence, a function is used to list all protein names contained in a sentence and save them into a temporary set K. The program would then generate all possible pair-wise protein combinations, each of which will be attached at the end of the original sentence to produce a 'candidate sentence'. The algorithm is illustrated in Figure 2. Output is then written to a text file, each line of which consist of the original sentence and two protein names.
Provided with the generated candidate sentences, we further process them with normalization and parsing before feeding them to the next step; normalization replaces all protein names found in a sentence with the label of "PROTEIN", whereas parsing identifies the POS for each term. For instance, the sentence 'We have identified a third Sec24p family member, which we call Iss1p, as a protein that binds to Sec16p' contains recognized genes 'Sec24p', 'Iss1p' and 'Sec16p'; thus we obtain a corresponding entity set E ¼ {s 1 , p 1 , p 2 , p 3 } ¼ {'We have identified a third Sec24p family member, which we call Iss1p, as a protein that binds to Sec16p.', 'Sec24p', 'Iss1p', 'Sec16p'}, which could produce candidate sentences {s 1 , p 1 , p 2 }, {s 1 , p 2 , p 3 }, {s 1 , p 1 , p 3 }. The corresponding normalized sentences (n i ) and parsed sentences (qj ) for an original sentence s i are added to its candidate sentence to form 'expanded candidate sentences'; in this case, we get {s 1 , n 1 , q -1 , p 1 , p 2 }, {s 1 , n 2 , q -2 , p 2 , p 3 } and {s 1 , n 3 , q -3 , g 1 , g 3 }, as illustrated in Figure 3. To explore in more detail, the content of the expanded candidate sentence set {s 1 , n 2 , q -2 , p 2 , p 3 } is shown in Figure 4 as an example. The data are now ready for SCL.

Learning interaction pattern from biomedical literature
The human perception of a PPI is obtained through the recognition of important events or semantic contents to rapidly narrow down the scope of possible candidates. For example, when an expression contains strongly correlated words such as 'beta-catenin', 'alpha-catenin 57-264' and  'binding' at the same time, it is natural to conclude that this is an expression of PPI, with a less likelihood of a noninteractive one. This phenomenon can be used to explain how humans can skim through an article to quickly capture the interactive expression. In light of this rationale, we propose an interaction pattern generation approach to automatically produce representative patterns from sequences of PPI expressions.
We formulate interaction pattern generation as a frequent pattern mining problem, starting by feeding the expanded candidate sentences sets obtained in the previous phase into SCL process. To illustrate the process of SCL, consider the instance I n ¼ 'Abolition of the gp130 binding site in hLIF created antagonists of LIF action', as shown in Figure 5. 'gp130' and 'hLIF' are two given protein names first tagged as PROTEIN1 and PROTEIN2, respectively. Remaining tokens are later stemmed using the porter stemming algorithm (23). Finally, trigger words 'bind' and 'antagonist' are labeled with their corresponding types by using our compiled trigger word list extracted from a BioNLP corpus (24). Evidently, the SCL can group the synonyms together through the same label. This enables us to find distinctive and prominent semantic classes for PPI expression in the following stage.
After SCs are introduced into the sequences, we construct a graph based on the co-occurrence of distinct SCs to describe the strength of relations between them. Since these sequences are of an ordered nature, the graph is directed and can be made with association rules. In order to avoid generating interaction patterns with insufficient length, we empirically set the minimum support of a SC to 20 and the minimum confidence to 0.5 in our association rules.
[According to (25), rule support and confidence are two measures of rule value. Typically, association rules are considered valuable if they satisfy both a minimum support threshold and a minimum confidence threshold. Therefore, in our article, we set the minimum support of a SC to 20; i.e. we only consider SCs whose occurring frequency are more than 20.] This setting is derived from the observation that the rank-frequency distribution of SCs followed Zipf's law (26), therefore does the normalized frequency of interaction patterns. SCs with lower frequencies are generally irrelevant to PPI. For that reason, we select the most frequent occurring SCs with accumulated frequencies exceeding 70% of the total SC frequency count in the positive PPI sentences. An association rule is represented as Equation (1). Figure 6 is an illustration of a semantic graph. In this graph, vertices (SC x ) represent semantic classes; edges  represent the co-occurrence of two classes, SC i and SC j , where SC i precedes SC j . The number on an edge denotes the confidence of two connecting vertices. After constructing all semantic graphs, we then generate interaction patterns by applying the random walk theory (27) in search of high frequency and representative classes for PPIs. Assuming we have a semantic graph G defined as G ¼ (V, E) (jVj¼ v, jEj¼ u), a random walk process consists of a series of random selections on the graph. Every edge E nm has its own weight M nm , which denotes the probability of a semantic class SC n followed by another class SC m . For each class, the sum of weights to all neighboring classes N(SC n ) is defined as Equation (2), while the probability matrix of the graph is defined as Equation (3). A series of a random walk process now essentially becomes a Markov Chain. According to (28), the cover time of a random walk process on a normal graph is 8SC n ; E n 4u 2 with the selection of frequent SCs and their neighbors as the starting nodes of a random walk process. We conclude that with the use of random walk in finding frequent patterns on the interactive graph, we not only could capture combinations with low probability but also shorten the processing time.
where support min ¼ 20; confidence min ¼ 0: Although the random walk process can help us capture representative interaction patterns in semantic graphs, it can also create some redundancy; a merging procedure is required to eliminate the redundant results by retaining patterns with long length and high coverage, and dispose of bigram patterns that could be covered by another pattern. For  [ Reduction of SC labels through pattern selection is critical; it allows the successful execution of more sophisticated text classification algorithms, which leads to improved performance for PPI extraction. These algorithms cannot be executed on patterns before they are processed since redundant SC labels will result in excessively high execution time, making them impractical (26). To perform pattern selection, we use the log likelihood ratio (LLR) (26), an effective feature selection method to discriminate SCs for PPI instances. Given a training dataset comprised of positive instances, LLR employs Equation (4) to calculate the likelihood of the assumption that the occurrence of a semantic class SC in the expressions of PPI is not random, where I denotes the set of positive PPI sentences in the training dataset; N(I) and N(´I) are the numbers of positive and negative PPI sentences, respectively; and N(SC^I) is the number of positive PPI sentences containing the semantic class SC. The probabilities p(SC), p(SCjI), and p(SCj´I) are estimated using maximum likelihood estimation. A SC with a large LLR value is thought to be closely associated with the interaction. Lastly, we rank the interaction patterns in the training dataset based on a summation of these semantic classes' LLR values and retain the top 20 for representing PPIs.

IPT construction
Next, we represent a candidate sentence by the proposed IPT structure, which is the SPET of a sentence enhanced by three operations: 'branching', 'ornamenting' and 'pruning'. In reference (14), the authors show that SPET is effective in identifying the relation between two entities mentioned in a text sentence. Specifically, the SPET of a candidate sentence is the smallest sub-tree of the sentence's syntactic parse tree that links target protein p i and p j . To show how we are improving SPET with IPT, we exemplify the operators by applying them to the sentence 'Active, phosphorylated CREB, which is important to brain development, effects CRE-dependent genes via interaction with CBP which tightened the connection between CREB and downstream components.'. that expresses the interaction between 'CREB' and 'CBP'. Figure 7a show the syntactic parse tree of the example sentence and the corresponding SPET is illustrated in Figure 7b. The three operations are described as follows.
IPT branching. Although Zhang et al. (29) demonstrated that the SPET is effective in identifying the relation between two entities mentioned in a textual sentence, the information in SPET is sometimes insufficient for detecting interaction between target proteins. For instance, in Figure 7a, the term 'tightened' and the corresponding syntactic constituent are critical for recognizing the interaction between CREB and CBP. However, they are excluded from the sentence's SPET, as shown in Figure 7b. To include useful sentence context, the branching operator first examines the existence of verb behind the last target protein of the sentence. If a verb and the target protein form a verb phrase in the sentence's syntactic parse tree, the verb is treated as a modifier of the target protein and is concatenated into the IPT. As shown in Figure 8, the branched IPT has included richer context information than its original.
IPT pruning. In later process we adopt SVM to classify interactive patterns. SVM is a vector space classification model that hypothesizes data of different classes form distinct contiguous regions in a high-dimensional vector space (25,26); the hypothesis; however, is invalid if data representation is chosen improperly. We observed that IPTs would contain redundant elements that would influence the performance of interaction classification therefore we use the pruning operator to condense IPTs via the following procedures.
i. Middle clause removal: Middle clauses of inter-clause candidate sentences may (or may not) be irrelevant to protein interactions. To discriminate middle clauses associated with the proteins, we adopted the Stanford parser (30) to label dependencies between text tokens (words). A labeled dependency is a triple of dependency name, governor token and dependent token. The labeled dependencies form a directed graph G ¼ <V, E>, where each vertex in V is a token and the edges in E denote the set of dependencies. Figure 9 shows the dependencies extracted from the example sentence and the corresponding dependency graph is showed in Figure 10. Next, we search for the protein dependency path which we defined as the shortest connecting path of the target protein-pair in G. The example's protein dependency path is highlighted in red in Figure   make the target proteins associated. In Figure 11, the middle clause "that is important to brain development" is pruned because it is the complement of CREB, which is irrelevant to protein interactions. ii. Stop word removal: Frequent words are not useful for expressing interactions between proteins. For instance, the word 'with' in Figure 8 is a common preposition and cannot be utilized to discriminate interactive expressions. To remove stop words and the corresponding syntactic elements from the IPT, we sort words according to their frequency in the text corpus, and the most frequent words are used to compile a stop word list. More specifically, we selected the most frequent words whose accumulated frequencies reached 80% of the total word frequency count in the five corpora, since the rank-frequency distribution of words follows Zip's law (26). Protein names and verbs are excluded from the list for refinement, since both are key constructs of protein-protein interactions. iii. Duplicate element removal: Nodes in an IPT would be duplicated and therefore are redundant. A node is duplicated if it has a single child and its tag is also identical to that of its parent. For instance, the node VP in the last branch of Figure 8 is a duplicate node.
Since the tree kernel we adopted to compute the similarity between text sentences is based on the percentage of overlap between IPTs, duplicate nodes would degrade our system performance. To reduce their influence, the pruning operator deletes all duplicate nodes in an IPT. As shown in Figure 11, the pruned IPT is more concise and clearer than its original.
IPT ornamenting. Finally, the generated interaction patterns can help us capture the most prominent and representative patterns for expressing PPI. Highlighting interaction patterns closely associated with PPIs in an IPT would improve the interaction extraction performance. For each IPT that matched an interaction pattern, we add  an IP tag as a child of the tree root to incorporate the interactive semantics into the IPT structure (as shown in Figure  12).

Convolution tree kernel
Kernel approaches are frequently used in SVM to compute the dot product (i.e. similarity) between instances modeled in a complex feature space; here we employ the CTK (21) for measuring the similarity between sentences. A convolution kernel captures structured information in terms of substructures, hence we can represent a parse tree T by a vector of integer counts of each sub-tree type (regardless of its ancestors): /ðTÞ ¼ ð#subtree 1 ðTÞ; ...;#subtree i ðTÞ; ...; #subtree n ðTÞÞ; (5) where subtree i (T) is the occurrence number of the ith sub-tree type (subtree i ) in T. Since the number of different subtrees is exponential with the parse tree size, it is computationally infeasible to directly use the feature vector /ðTÞ. To solve this computation issue, the CTK computes the syntactic similarity between the above high dimensional vectors implicitly as follows: where N 1 and N 2 are the sets of nodes in trees T 1 and T 2 . I subtree i ðnÞ is a function whose value is 1 if there is a subtree i rooted at node n, and zero otherwise. Specifically, the CTK K CTK considers the number of common sub-trees as   the measurement of syntactic similarity between two interaction pattern trees IPT 1 and IPT 2 as follows: N 1 and N 2 are the sets of nodes in IPT 1 and IPT 2 , respectively. In addition D(n 1 , n 2 ) evaluates the common sub-trees rooted at n 1 and n 2 and is computed recursively as follows: i. if the productions (i.e. the nodes with their direct children) at n 1 and n 2 are different, D(n 1 , n 2 ) ¼ 0; ii. else if both n 1 and n 2 are pre-terminals (POS tags), D(n 1 , n 2 )¼1Âk; iii. else calculate D(n 1 , n2 ) recursively as: þ Dðchðn 1 ; kÞ; chðn 2 ; kÞÞÞ (8) where #ch(n 1 ) is the number of children of node n 1 ; ch(n, k) is the kth child of node n; and k(0<k < 1) is the decay factor used to make the kernel value less variable with respect to different sized sub-trees. The parse tree kernel counts the number of common sub-trees as the syntactic similarity measure between two PPI instances. The time complexity for computing this kernel isOðjN 1 j Á jN 2 jÞ (21).

Evaluation dataset
Due to the very recent completion of the BioCreative V BioC task, during edition of this article we have yet received the official annotation of the data used; therefore, we evaluated our method with five publicly available corpora that contain PPI annotations: LLL (31), IEPA (32), HPRD50 (33), AIMed (34) and BioInfer (35). AIMed, IEPA, HPRD50 and LLL were constructed specifically for PPI, while BioInfer is a more general-purpose corpus. All of them are commonly served as the standard corpora for training and testing PPI extraction programs. Specifically, AIMed contains 200 abstracts from PubMed that were identified as containing PPI by Database of Interacting Proteins [DIP (32)), from which the interactions between human genes and proteins in the abstracts were annotated manually. Additionally, certain abstracts that do not contain PPIs were added as negative examples. The current release of AIMed corpus is comprised of 225 of abstracts (10). BioInfer contains annotations for not only PPI but also other types of events. Pairs of interacting entities were extracted from DIP and used as query inputs to PubMed retrieval system, from which the returned abstracts were broken down into sentences; only sentences possessing more than one pair of interacting entities were kept. A random subset of the sentences was also annotated for entities of protein, gene and RNA relationships. After combining the above resultant sets into a PPI corpus, BioInfer consists of the maximum number of instances among the five corpora within 1100 sentences. In addition, IEPA was constructed of 486 sentences containing a specific pair of co-occurring chemicals from PubMed abstracts; the interactions between pairs of entities were annotated while the majority of the entities were proteins. Unlike the above corpora, HPRD50 was constructed by taking 50 random abstracts referenced by the Human Protein Reference Database [HPRD (33)]. Human proteins and genes were identified by ProMiner (36)

Experimental setting and evaluation methods
The description of the corpora is shown in Table 1; both the size and the distribution of positive/negative elements are shown. All corpora are parsed using the Stanford parser (http://nlp.stanford.edu/software/lex-parser.shtml) to generate the output of parse tree and POS tagging. For our implementation, we use the Moschitti's tree kernel toolkit (22) to develop the convolution kernel of an IPT. Following conventions, we set the parameters C for SVM to the ratio of negative instance to positive ones in respective corpora, and k for the CTK to default 0.4 (18,19,29 (26), as well as the microaverage used for comparing the average performance. These measures are defined based on a contingency table of predictions for a target corpus C k . The precision P(C k ), recall R(C k ), F 1 -measure F 1 (C k ) and micro-average F l are defined as follows: TP(C k ) denotes the number of true positives, i.e. the number of positive instances that are correctly classified. The FP(C k ) denotes the number of false positives, which are negative instances that are erroneously classified as positives. Analogously, TN(C k ) and FN(C k ) stand for the number of true negatives and false negatives, respectively. The F 1 value is used to determine relative effectiveness of the compared methods.

Results and Discussion
For our CV experiment, the proposed IPT structure uses three operators, 'branching, pruning' and 'ornamenting', to enhance SPET. In the following, we evaluate the performance of these operators to demonstrate the effectiveness of IPT. Table 2 shows the marginal performances in 10-fold cross-validation of applying IPT branching, pruning and ornamenting, denoted as þIPT branching , þ IPT pruning and þIPT ornamenting respectively. As shown in the table, IPT branching (i.e. þIPT branching ) outperforms SPET because branching operator correctly incorporated extra context information to remedy the context-limited problem of the SPET (see Section 'IPT construction'). The pruning operator further improves the system performance for successfully eliminating indiscriminative and redundant IPT elements and thereby helps SVM learn representative syntactic structures of PPIs. Notably, the IPT ornamenting operator improves the F 1 performance significantly, because generated interaction patterns are highly correlated with PPIs. Thus, tagging them in the IPT structure helps our method discriminate PPI passages. As the operators polish the SPET from different perspectives without having conflicts with one another. Consequently, applying the operators altogether achieves the best performance. The proposed IPT kernel uses the PPI patterns to enhance the SPET, and it is compared with several featurebased, kernel-based and multiple kernel PPI extraction methods mentioned in related work to demonstrate its effectiveness. As shown in Table 3, the proposed method significantly outperforms AkanePPI. Furthermore, the syntax tree-based kernel methods (i.e. ST, SST, PT and SpT) only examine the syntactic structures within texts but cannot sense the semantics of protein interactions. In contrast, our method analyzes the semantics and contents (i.e. PPI patterns) within the text to identify PPIs, making its performance superior to those of the syntax tree-based kernel methods. It is noteworthy that the syntax tree-based kernel methods are at times only on par with the co-occurrence approach in terms of F 1 -measure. This can be observed on the relatively small corpus LLL, in which their results practically coincide with the co-occurrence method. On the other hand, PIPE delivers good result on both precision and F 1 -measure in a broader corpus such as BioInfer. The RFB and Cosine method also outperform SPET, AkanePPI and syntax tree-based kernel methods as they incorporate dependency features to distinguish PPIs. Nevertheless, although the Cosine method can accomplish higher performance by further considering term weighting, it is difficult to demonstrate word relations through symbolic representations in this approach. On the contrary, our method can extract word semantics and generate PPI patterns that can capture relations between distant-located mentions in the text; consequently we can achieve comparable outcome. SL, GK and CK approaches outperform our method because their hybrid kernels can adequately encapsulate information required for relation prediction based on sentential structures involved in two entities; nonetheless, our method is able to capture more PPI instances through the acquired PPI patterns. Thus, we can achieve higher recall than both CK-based approaches on all five corpora, which leads to a comparable overall performance. Table 4 lists our results regarding the CL performances. Five additional methods were used in comparison with our proposed method. First, it is interesting to note that while the SPET had a F 1 -measure of 41.6% in the CV setting, it showed a decrease by 12% in the CL setting due to the lower performance in AIMed and BioInfer; SpT, Cosine and edit methods too suffered a significant drop in their performance. SpT achieved rather poor performance in this scenario, especially on the IEPA corpus. It obtained a very low score due to the extremely low recall. The Cosine and edit method were on par with SpT, each of which surpassed the other two in certain corpora. The SL kernel showed a modest drop on the average F 1 -measure by about 6%, and demonstrated a relatively consistent performance across all five corpora in terms of the major evaluation measures. Finally, our method exhibited the highest stability, with each and every case under the CL setting outperforming those of the CV results. The overall performance of our IPT kernel is improved with the CL setting, and also outperformed all other methods on the five corpora.
Due to the existing variety of the nature of the five corpora, such as the types of named entities annotated, the definition of what exactly constitutes an interaction, and the relative positive/negative distributions of relation pairs, we conducted a CC evaluation to shed light on whether the learned models can be generalized beyond the specific characteristics of the training data. Table 5 shows the CC results, in which different methods were trained on one corpus, and subsequently tested on the four remaining corpora. The rows and columns correspond to the training and test corpora, respectively. Cross-validated results were ' generated from the HPRD50 corpus is capable of matching the positive instance 'Amyloid beta protein stimulation of phospholipase C was absent from LA-N-2 cells previously treated with norepinephrine, trans-1-amino-1,3-cyclopentanedicarboxylic acid, bombesin, or amyloid beta peptide' in the IEPA corpus, which describes the interaction between the protein 'Amyloid beta protein' and 'phospholipase C'. In addition, our method trained on the IEPA corpus achieved comparable performances to that of the CK when tested on LLL and HPRD50. This also demonstrates that the generated PPI patterns from our method of IEPA are effective in matching positive instances of the tested corpora. For instance, the generated interaction pattern '[PROTEIN1]-> [Negative_regulation]-> [PROTEIN2]-> [Localization]' from the IEPA corpus is able to capture texts such as 'Both leptin and insulin can reduce hypothalamic NPY production and secretion', in which 'leptin' and 'NPY' represent PROTEIN1 and PROTEIN2, respectively. On the other hand, the performance of our method is slightly inferior to both multi-kernel-based approaches when trained on the smallest LLL corpus. More specifically, when trained on larger corpora (IEPA and HPRD50), our method can generate more extensive PPI patterns, leading to a broader coverage and hence a higher recall. As a result, our method is more effective than the others, since the generated PPI patterns can retrieve more information within PPIs.
Note that the evaluation results using other corpora are no better than those from internal 10-fold CV. This is because the annotation policies are different, and the classifiers cannot predict these differences. The model based on an original corpus performs better than the models based on other corpora in other cases, but the results are up to 7.3% better F1-score than for the best performing model based on other corpora. However, the results on the LLL corpus using classifiers trained on IEPA are better than the 10-fold CV result using LLL corpus itself for training. Based on our further analysis, we conclude that IEPA and LLL are very similar regarding PPI. Thus, learning with IEPA is more robust than 10-fold CV within LLL. It is interesting to note that PIPE is able to perform well when trained on IEPA, which is much smaller than AIMed and BioInfer. In general, learning with larger corpora produces better performance. Nevertheless, the better annotation quality of IEPA enables PIPE to learn discriminative interaction patterns.
Based on our preliminary observations, PIPE is able to achieve comparable performance on BioC corpus, which contains mostly full-text articles. This is because a relatively high proportion of PPI passages are short, and PIPE can thus capture interaction expressions. For instance, the candidate segment 'We have identified a third Sec24p family member also known as Iss1p, as a protein that binds to Sec16p.' is correctly recognized as PPI passage due to successful match of generated interaction pattern '[PROTEIN1]-> [Binding]-> [PROTEIN2]-> [Negative_ regulation]'. However, based on our further analysis of the detection performance, our approach cannot effectively deal with longer candidate segments. For one, PIPE incorrectly classifies 'Chromosomal deletion of LST1 is not lethal, but inhibits transport of the plasma membrane proton-ATPase (Pma1p) to the cell surface, causing poor growth on media of low pH' as a PPI passage. This is because it is possible that the long text segments in the syntactic structures were so complex that they confused the dependency parsing process. As a result, the generated protein dependency paths were prone to errors that affected the accuracy of the removed middle clause and the corresponding extraction performance. In addition, we paired proteins in order to enumerate text segments that may convey PPIs; nevertheless, the issue of coreference resolution is not considered in this article as related studies are still in progress (38,39). Therefore, a relatively low proportion of PPI passages cannot be captured by the candidate segment generation algorithm if the target protein name is referred to by a pronoun. We acknowledge this as an important issue for future research. In summary, the proposed IPT kernel approach is able to generated discriminative interaction patterns that can describe the syntactic and semantic relations within a PPI expression and assist in detecting the interactions. We consider it as the foundation for a more profound understanding of the PPI structures to enhance the SPET. This method not only outperforms feature-based and kernel-based approaches, but also achieves comparable performances to those of multi-kernelbased methods. In addition, the patterns are easily interpretable by humans, and can be considered as the fundamental knowledge in understanding PPI expressions.

Concluding remarks
Automated extraction of PPIs is an important and widely studied task in biomedical text mining. To this end, we proposed an interaction pattern generation approach for acquiring PPI patterns, which was utilized in the Collaborative Biocurator Assistant Task at BioCreative V. We also developed a method that combines the SPET structure with the generated PPI patterns to analyse the syntactic, semantic and context information in text. It then exploits the derived information to identify PPIs in  Bold typeface indicates our best overall result for a corpus (differences under 1 base point are ignored). biomedical literatures. Our experiment results demonstrate that the proposed method is effective and also outperforms well-known PPI extraction methods.
In the future, we will investigate other aspects, such as the dependency construction in texts, to incorporate even deeper semantic information into the IPT structures. We will also utilize information extraction algorithms to extract interaction tuples from positive instances and construct an interaction network of proteins.