Abstract

Motivation: Wnt signaling is a very active area of research with highly relevant publications appearing at a rate of more than one per day. Building and maintaining databases describing signal transduction networks is a time-consuming and demanding task that requires careful literature analysis and extensive domain-specific knowledge. For instance, more than 50 factors involved in Wnt signal transduction have been identified as of late 2003. In this work we describe a natural language processing (NLP) system that is able to identify references to biological interaction networks in free text and automatically assembles a protein association and interaction map.

Results: A ‘gold standard’ set of names and assertions was derived by manual scanning of the Wnt genes website (http://www.stanford.edu/~rnusse/wntwindow.html) including 53 interactions involved in Wnt signaling. This system was used to analyze a corpus of peer-reviewed articles related to Wnt signaling including 3369 Pubmed and 1230 full text papers. Names for key Wnt-pathway associated proteins and biological entities are identified using a chi-squared analysis of noun phrases over-represented in the Wnt literature as compared to the general signal transduction literature. Interestingly, we identified several instances where generic terms were used on the website when more specific terms occur in the literature, and one typographic error on the Wnt canonical pathway. Using the named entity list and performing an exhaustive assertion extraction of the corpus, 34 of the 53 interactions in the ‘gold standard’ Wnt signaling set were successfully identified (64% recall). In addition, the automated extraction found several interactions involving key Wnt-related molecules which were missing or different from those in the canonical diagram, and these were confirmed by manual review of the text. These results suggest that a combination of NLP techniques for information extraction can form a useful first-pass tool for assisting human annotation and maintenance of signal pathway databases.

Availability: The pipeline software components are freely available on request to the authors.

Contact:dstates@umich.edu

Supplementary information:http://stateslab.bioinformatics.med.umich.edu/software.html

INTRODUCTION

Detailed signal pathway annotation and model construction can be an arduous task for human readers to accomplish. The task is complicated for heavily investigated pathways like the Wnt signal transduction cascade or other major cellular pathways due to the large volume of papers published for biological interactions involving members of those pathways. In the Wnt signal transduction literature, for instance, there were 239 MeSH-annotated ‘Signal Transduction’ Wnt pathway MEDLINE articles in 2003, and 889 articles for the period from 2000 to 2004. Expanding the search to include other co-factors or major proteins in the pathway expands the results to many thousands of articles.

For a pathway like the Wnt pathway, up-to-date models are essential for investigators in the field; without accurate models, experimental results may be placed outside of the proper biological context or key insights may be missed altogether if the model structure is incorrect. Comprehensively annotated models of complex pathways like Wnt are also essential for hypothesis-generation and experiment validation, yet with the exception of periodic reviews on the subject, there are few sources of Wnt-signaling information that are kept consistent with the latest published literature.

In the past, various groups (Andrade and Valencia, 1997; Blaschke, 1999; Daraselia et al., 2004; Iliopoulos et al., 2001; Koike et al., 2003; Raychaudhuri et al., 2002; Stephens et al., 2001; Wilbur and Yang, 1996) have used natural language processing (NLP) systems to extract biological molecule annotation information (Andrade and Valencia, 1997), to detect protein–protein interaction information (Bader and Hogue, 2002; Blaschke, 1999; Marcotte et al., 2001), or to improve indexing and recall into searches from MEDLINE abstracts (Iliopoulos et al., 2001; Stapley and Benoit, 2000). Methods included a mixture of text mining and indexing, with some groups using classification by Bayesian statistics (Wilbur and Yang, 1996), structured grammar matches (Temkin and Gilder, 2003), or word filtering of known entities, as well as the use of partial and full parsers. Full parsers have been employed to discover protein–protein interactions with promising results, highlighting the utility of this approach (Daraselia et al., 2004); however they are not available as open-source.

We have developed an automated NLP-based system to assist in the generation of up-to-date pathway models from the literature that can automatically detect and rank key interacting proteins in an article corpus like that of Wnt signaling.

The named entity module we present employs a word-statistic chi-squared test, but begins with a partial parser to derive the necessary named entities. Then, the full parser module provides deep phrase attachment, syntax annotation and grammatical relations, and extracts interaction statements by filtering results with a list of verbs and the named entity list derived from the partial parse.

We avoid the need to generate and maintain a large-scale named entity list by taking advantage of both the Link parser's (Sleator and Temperly, 1991) phrase attachment facilities, as well as fast partial-parser's (Abney, 1996a) noun phrase annotation to generate a list of words specific to Wnt signal transduction. Our system uses the fast partial parser coupled with a simple statistical test to automatically build a corpus-specific named entity list without requiring an extensive pre-computed synonym list. While this approach is only a first-pass disambiguation of the named entities found within the corpus, for the queries likely to be of interest to a human domain expert, we find this automated named entity annotation to be at least as specific as the human-constructed signaling pathway entities available in the public domain.

Following named entity extraction, we detect the actual interaction and protein-associations with the Link parser (Sleator and Temperly, 1991). The parser allows us to reduce grammatically complicated sentences into simplified ‘tuples’ which roughly correspond to specific biological assertions made in any particular sentence. The 3-tuple representation allows for fast search for a direct linking verb between two named entities. The search we perform yields various relevant possible additions to the canonical Wnt pathway, as well as provides provenance and annotation for a majority of the interactions present in the pathway where source material was not annotated.

METHODS: ARTICLE XML PROCESSING AND FULL PARSE

HTML retrieval and XML conversion

Full-text and MEDLINE articles are retrieved using NCBI's Linkout e-retrieval utility (National Center for Biotechnology Information—Entrez Programming Utilities, 2004). For an initial query, an XML file of retrieved UI (Pubmed ID) entries serves as a corpus index, from which local Perl script retrieves where possible the full-text article (via LinkOut URL) and MEDLINE entry. The latter entry serves as a backup entry for cases where full-text may not be present, or where the NCBI LinkOut URL yields only a PDF file.

For the Wnt signaling pathway, we queried Pubmed with:

(‘Signal Transduction’[MeSH] OR Wnt[All fields] OR Akt[All Fields] OR catenin[All Fields] OR frizzled[All Fields])

The query yielded 3523 articles (full analysis in supplementary data), of which 3369 could be retrieved in XML. Of these 3369 documents, the majority (2914) had a parseable abstract field (either from HTML or MEDLINE record), and of the 455 that did not, the papers were often review papers, with the XML tag marked as ‘TOP’. The full corpus composition is available as supplementary information at: http://stateslab.bioinformatics.med.umich.edu/software.html. The query was restricted to the past five years (1999/03/03 to 2004/03/01).

XML document structure parsing

To normalize successfully retrieved HTML papers, we developed a document-structure parsing script in Perl (v. 5.6.0) that extracts into XML-format the Titles, PMID, Abstract, Methods/Materials, Conclusions, Figures, Tables, and References sections of full-text articles: We parse sentences within all sections by default, only explicitly excluding sections parsed as ‘References’. It is important to note that of the 3369 retrieved papers, over 10% had no explicitly labeled ‘abstract’ section (even if one was provided in the MEDLINE).

Pre-processing and parse

For parsing, we process and exclude non-parseable sections like references and tables in each paper. Articles are then processed through a Link grammar parser (Sleator and Temperly, 1991) (version 4.1a; http://www.link.cs.cmu.edu/link/ftp.html) on a 16-node Linux cluster.

For each sentence, the parser yields word associations as a flat list with left-hand terms ‘attached’ by a grammar relation to terms on the right. The ‘subject–verb-object’ relations provided by the parser form the core assertions we wish to capture from the parse. The parser captures the main verb of each clause or sentence, links it with the proper subject noun, and object if present, yielding a subject–verb–object assertion which we extract as a 3-tuple.

METHODS: ASSERTION REPRESENTATION VIA LINK PARSING: SUBJECT–VERB–OBJECT TUPLES

Tuple format

The structures we call tuples are Link-grammar-parser derived structured, hierarchical representations of grammatical relations between phrases and words within sentences. Generally, each tuple takes the form of a three-component structure:

In our tuple format:

<int pmid=“12952940”><protA>Wnt</protA><protB>Frizzled</protB><assert><src_sent>…</src_sent><tuple><subj>…</subj><verb> … </verb><obj> … </obj></tuple></assert></int>

Each interaction int, contains two named entities protA and protB, with assert element which contains a sentence (src_sent) and a tuple element (tup). The tup contains a subject (subj), verb (verb) and an (object). The subject and object terms can be either single or multi-word nouns, attached to modifying prepositional phrases, adjectives and articles. Verbs are single words and are marked as verb. Objects follow the specific verb marked.

Some authors (Koike et al., 2003) employ sophisticated template-matching with partial parse-based algorithms when detecting interactions. These systems are faster than our parse, but often require substantial manual template generation for the partial parser.

Our interaction detection searched for phrases with two named entities flanking any of a select group of stemmed verbs. The verb list itself was manually compiled from a listing of verbs found in the corpus and from verbs in general usage likely to be found describing protein-interactions. These ‘direct’ and ‘indirect’ physical interaction verbs are split into:

• Direct interaction verbs: bind (bound), interact(-s,-ed), stabilize(-s,-d), phosphorylate(-s,-d), ubiquinate(-s,-d), sumoylate(-s,-d), degrade(-s,-d), block(s).

• Indirect interaction verbs: induc(-es,-ed), trigger(-s,-ed), block(s), enhance(s), synergize(s), cooperate(s), localizes, regul(-ates,-ion), activate(s), inhibit(s), control(s), translocate(s), antagonize(s), amplif(-y,-ies), transduce(s), degrade(s), trigger(s).

Tuple examples

The system outputs tuple assertions from sentences in XML:

<assert><src_sent>Wnt8 binds to LRP6 and Frizzled8.</src_sent><tup><subj>Wnt8</subj><verb mod=“v”>binds.v</verb><obj><p pp=“to”>LRP6</p></obj></tup></assert>

The sentence above, ‘Wnt8 binds to LRP6 and Frizzled8.’ yields two assertion tuples: the binding of ‘Wnt8’ to ‘LRP6’ and a matching tuple (not shown) for the binding of ‘Wnt8’ to ‘Frizzled8’.

In addition to direct interactions, in sentences where a verb suggesting an interaction is found within the object, we make the assertion as being the closest preceding matching verb or gerund matching within the phrase for the named entity in the object.

METHODS: AUTOMATIC NAME EXTRACTION FROM A PARTIAL PARSER

The Cass parser (Abney, 1996b) is a fast (10 000 sentences/hour) deterministic partial parser that we use to construct a named entity set specific to the current domain. The parser has several key advantages over a parser like Link that make it a worthwhile choice for a named entity recognizer, primarily its good specificity for detecting selected ‘phrase chunks’ of sentences at speeds which are many orders of magnitude greater than those achieved with a full parser like Link. This markup allows us to statistically compile named entity candidates (noun phrases) from the small topic-specific corpus against a massive background corpus (all ‘signal transduction’), while reserving the use of a computationally expensive full parser only for determining tuples in the small corpus.

We used the Cass parser to select named entities (noun phrases) for the Wnt pathway by comparing the occurrence of named entities in the Wnt-specific article corpus against their occurrence in a ‘background’ signal transduction literature corpus (10 000 records, yielding 8873 parsed articles corresponding to the PubMed query ‘Signal Transduction’[MeSH] from the previous two years).

By comparing the frequency of ‘Wnt’ to ‘signal transduction’ noun phrases, we calculated one-degree of freedom chi-squared values for Wnt Cass noun phrases relative to the signal transduction corpus and ranked them according to that chi-squared value. Significance was set as p < 0.001. Examples of over-represented Wnt terms included both single phrases and compound phrases.

For every NX term, X2 was calculated as:

(1)
${X}^{2}={\displaystyle \sum _{i=1}^{k}}\frac{\left(\frac{{w}_{i}}{W}-\frac{{s}_{i}}{S}\right)}{{s}_{i}/S}$
Note that not all terms were proteins, since the terms are noun phrases in general; proteins of interest were filtered at search time. Noun phrases we detected included both single (‘wnt’) and multiple-word forms that would otherwise be missed by a dictionary-based search (e.g. ‘casein kinase i epsilon’).

• wi: the number of occurrences of NX term i in the Wnt-specific corpus;

• W: the total number of NX terms in the Wnt-specific corpus;

• si: the number of occurrences of term i in the signal transduction corpus;

• Si: the number of occurrences of term i in the signal transduction corpus.

METHODS: AUTOMATIC NAME EXTRACTION USING A FULL PARSER

Full-parse phrase-derived named entity extraction from the Link parser

The second named entity-extracting module in the pipeline scans the tuples generated (Wnt-specific tuples) from the Link parse for tuples derived from sentences such as ‘X is … a protein’ and ‘the Y protein’. For every tuple formatted with ‘is’ as the verb, we find the subject, and if it is a single word or phrase, capture the predicate phrase for that tuple, and append the subject into an index entry one word at a time, recursively. For example:

After categories are formed and the first set of names is input, the system re-scans the entire corpus for phrases of the form ‘article X Y’, where article is either ‘a’, ‘an’, or ‘the’, Y is a term category (e.g. ‘protein’), and X is a non-whitespace term. This second pass allows us to capture a small additional fraction of terms of the form ‘the Wnt protein’, where the last word in the phrase is a solid term category like ‘protein’.

The end result of both passes is a series of categories or category files, comprising a shallow ontology. This auto-categorization system yielded 7066 distinct categories for the 3306-article Wnt-signaling specific corpus, and 24 474 terms within those categories, of which 24 323 were unique terms. The largest categories are not surprisingly commonly discussed terms, including ‘protein’, ‘gene’, ‘proteins’, etc.

We find the terms extracted are very specific as they are directly extracted from direct declarative statements in the corpus.

MANUAL ANNOTATION RESULTS

Our precision and recall are measured as to the correct fraction of overall interactions returned and the percentage of the interactions captured in the gold standard (Nusse, 2004), respectively. Results are given in Table 1.

Calculation of precision

We define precision as the fraction of correct tuples returned by the parser. These tuples are tuples where the sentence actually supported evidence for a direct physical binding interaction or mentioned an indirect but biological relationship between the two protein entities in the tuple.

From the corpus, we derived a set of 6787 tuples/interactions, of which 1210 were unique pair-wise. We tested 5% (randomly selected) of the data set (340 sentences), representing individual unique sentences with their tuples and the two interacting proteins, and hand-scored assertions for the accuracy of the tuple and named entity search to determine if the sentences support the interactions noted. This tests the performance of the parse/extraction software without explicitly biasing the sampling towards a subset of the corpus (e.g. interactions which only contain a few papers in the entire corpus). For the parser evaluation, we tally but ignore from the final count all name-detection errors as these are a function of the named entity module or of the human input.

‘Direct’ verb tuples are more useful for actual diagramming of physical pathways, but the ‘indirect’ interactions are still indicative of relationships between distant pathway components, and may be useful for validation of models built with the system. We are not measuring interaction directionality at present in the system.

Calculation of recall

The exact recall metric for a system like ours is difficult to calculate manually, as it would require determining the total number of ‘facts’ made about binding proteins in the articles scanned. We therefore calculate recall as the fraction of the gold standard interaction set we are able to reproduce compared to the Wnt genes homepage, rather than as the fraction of interactions detected against the absolute ‘assertion or interaction’ count in the corpus.

Domain specificity

By default, all returned interactions that are ‘correct’ are within the domain. The corpus itself is the domain we examine, and we expect a ‘Wnt’ corpus to therefore contain only within-domain interactions.

DISCUSSION: USE OF A PARTIAL PARSER FOR NAMED ENTITY EXTRACTION

The Cass parser lacks certain phrase attachment and coordination capabilities of Link, but we found that its relatively good accuracy and very high speed allowed us to use Cass as a named entity extractor. Cass' finite-state grammar rules allow us to extract multiple-word noun phrases without requiring the use of an external dictionary or coordination and integration with existing synonym lists.

In actual usage, we found that compiling extensive named entity lists from other databases provided little benefit, as in the end, interactions adding to the gold standard will be manually verified before being submitted as authoritative. Extracting the named entities from the text itself yields word phrases that are guaranteed to match (even if they are spelling variants), and allows extraction of useful assertions that can later be verified for accuracy. As expected, this process is extremely fast, but can occasionally introduce spurious ‘interactions’ between terms and common phrases.

RESULTS: COMPARISON OF AUTOMATIC WNT PATHWAY ANNOTATION AND THE EXISTING GOLD STANDARD

The system discovered various high chi-squared terms with additional or different annotations than those present in the gold standard:

The phosphorylation interaction between CKI-epsilon (CK1e) and APC

In the diagrammed gold standard Wnt-signaling pathway, no specific mention of CK1-epsilon (CKIe, CKI epsilon) interaction with APC is made, and on closer inspection, Kishida et al. (2001) do make a statement of the direct phosphorylation between the two molecules.

The phosphorylation of beta-catenin by CKII (CK2)

The Wnt genes gold standard mentions CK2 as CKII in the context of binding to Dishevelled, but does not specifically show direct interaction of CK2 with beta-catenin in the protein interaction figures although links to a paper describing phosphorylation of beta-catenin by CK2 are provided. Our search independently found two articles, including the cited articles (Song et al., 2003) and a morphological study (Rosner et al., 2002) which describe the direct interaction of CK2 with beta-catenin directly. The chi-squared values for CK2 and beta-catenin are 1179.50 and 40537.69, respectively, suggesting these terms are significantly over-represented in the Wnt literature as a whole, and suggesting this interaction should be a directly featured pair in the gold standard map.

Six3 and Wnt regulation

The Wnt genes website lists Six3 [Sine oculis homeobox (Drosophila) homologue 3] as a Wnt target gene (Lagutin et al., 2003). Six3 also feedbacks to repress Wnt expression, an interaction note mentioned on the website and specifically not mentioned in the table of Wnt feedback target genes; although again, a paper cited by the website describes this interaction (Braun et al., 2003).

Pathway expansion: Wnt downstream targets

Chen et al. (2001) report that Wnt-1 signaling inhibits apoptosis and caspase activation induced by cancer chemotherapy. Such distant pathway cross-talk events of activation and regulation between Wnt and other pathways are difficult to curate manually and by definition are often not fully referenced in ‘canonical’ diagrams. In particular, remote downstream activation or cross-talk between proteins downstream of the canonical pathway are areas where statements in the literature could be mined by automatic annotation software.

Wnt-7a and LMX-1b

Lmx1b is induced in the mouse dorsal mesenchyme by wnt-7a and it is both necessary and sufficient to specify dorsal limb pattern (Liu et al., 2003). The activation pattern was not noted in the Wnt genes website, but was found amongst the interactions by the machine parse (in article PMID 12588849) (Liu et al., 2003).

Typographical corrections: Pygopus and Pygopos

Human typists are not infallible, and the name recognizer component of the pathway automatically discovered the Pygopus name but missed the interaction with Pygopos. The latter term resulted in the term list after human entry, and manual review showed the spelling error arose from a spelling error on the annotation itself from the Wnt signaling canonical pathway. The example serves not as any particular criticism of the pathway map, but rather highlights the risk of relying on human typed input into pathway annotations; automated systems do not fatigue or commit unintentional typographical mistakes whereas human input can lead to a certain degree of error even in highly curated databases.

CONCLUSION

Our results with automatic component identification and interaction detection in the Wnt signaling pathway suggest that natural language techniques are able to substantially improve the coverage of canonical reference literature and signaling models. The high precision and processing speed of this automated signaling interaction pipeline demonstrates the value of full parsers and statistical techniques. Using this approach as a ‘first-pass’ filter into the literature can usefully assist scientists maintaining databases and information resources in complex and rapidly evolving fields such as signaling pathways. As with any fully automated system, however, the recall rates with respect to the known canonical models do not yet match those of an expert human reviewer.

In the future, we expect to capture directionality and type of interaction in a more robust way for our assertions. This will require more template development, and may require the use of an ontology for an outside reference source for error-detection of incorrect assertions. The role we most expect this system to serve is a real-time scanning facility for new articles, searching for newly discovered interactions. Automated computational methods are capable of analyzing a much broader coverage of literature than would be feasible for a human reviewer to perform. In this role, there is a premium on specificity to avoid overloading the manual reviewer with erroneous matches, and our results suggest that deep-parsing, automated natural language processing technology is now capable of achieving this requirement.

We found that our auto-categorization module, using statistical- and natural-language parsing techniques allowed us to build a named entity list at run-time, rather than requiring a cumbersome fixed named entity assembler before the processing. This approach was perhaps our main advantage in this pipeline, because unlike general English-language texts, the biomedical literature enjoys a substantial human-hierarchical index via the MeSH tags provided by MEDLINE.

MeSH indexing provides a powerful tool for building reference and background article sets that can be used to search a specific article corpus for biologically relevant named entities which are typically over-represented with high statistical significance. The fast partial parser CASS serves a useful role in assigning multiple-word entities. CASS is uniquely powerful in its ability to efficiently process very large collections of text. This speed is a result of algorithmic efficiencies which are unlikely to be matched by more complete full parsers. The combination of fast partial-parse, exploiting MeSH indexing and statistical analysis of multiple word phrases significantly simplifies our task of assembling a comprehensive term list.

At a deeper level of text interpretation, the Link parser provides us with grammatical relations, which allows us to move beyond simple association statistics to access the information encoded in the grammatical structure of sentences. While some sentences in biomedical text are too complex to be accurately parsed using current technology, we find that parsers such as Link are able to accurately and efficiently parse the majority of sentences in the molecular biology literature. Using the integrated approach described above, we are beginning to be able to analyze the knowledge encoded in biomedical text.

Table 1

Performance metrics from Wnt hand-input named entities

Count (%) Interaction as detected Pubmed ID Example sentence Example tuple (short format)
Total manually sample counted 340 — — — —
Total false positive names (ignored from both tallies below27 (7.9) Akt <-> Tir 12896980 Although Akt activity was also induced by Tiron and DPI, the other two free-radical scavengers examined, only selenite supported cell growth. LIN: [Akt activity.n] v:<was.v> [m:<induced.v> only [pp by Tiron]]
Total indirectly/categorically correct interactions (A pathway…B pathway…ignoring name errors) 175 (55.9%) Akt <-> PI3-kinase 14557259 Akt is activated by many growth factors and cytokines in a PI3-kinase-dependent manner. Akt v:<is.v> [m:<activated.v> [pp in [a PI3-kinase-dependent manner.n]] [pp by [many cytokines.n]]]
Total directly/physical interaction correct (A->binds->B ignore name errors) 108 (34.5%) Dvl <-> Axin 11113207 Consistent with these results, Dvlinteracts with Axin and inhibits GSK-3 beta-dependent phosphorylation of beta-catenin, APC, and Axin in the Axin complex. Dvl v:<interacts.v> [pp with Axin
Total correct names, but error in the parse (ignoring name errors): 30 (9.5%) Dvl<->Axin 11113207 Consistent with these results, Dvl interacts with Axin and inhibits GSK-3 beta-dependent phosphorylation of beta-catenin, APC, and Axin in the Axin complex Dvl v:<inhibits.v> [pp in [the Axin complex.n]]
Total Gold Standard Associations Detected 34 of 53 (58.4%)
Parse/extract precision for assertions with correctly selected names (175 + 108)/(340 − 27) = 90.4%
Recall versus gold standard (Wnt genes website) 34/53 (64.1%)
Separate unique interactions (overall) 1210
Count (%) Interaction as detected Pubmed ID Example sentence Example tuple (short format)
Total manually sample counted 340 — — — —
Total false positive names (ignored from both tallies below27 (7.9) Akt <-> Tir 12896980 Although Akt activity was also induced by Tiron and DPI, the other two free-radical scavengers examined, only selenite supported cell growth. LIN: [Akt activity.n] v:<was.v> [m:<induced.v> only [pp by Tiron]]
Total indirectly/categorically correct interactions (A pathway…B pathway…ignoring name errors) 175 (55.9%) Akt <-> PI3-kinase 14557259 Akt is activated by many growth factors and cytokines in a PI3-kinase-dependent manner. Akt v:<is.v> [m:<activated.v> [pp in [a PI3-kinase-dependent manner.n]] [pp by [many cytokines.n]]]
Total directly/physical interaction correct (A->binds->B ignore name errors) 108 (34.5%) Dvl <-> Axin 11113207 Consistent with these results, Dvlinteracts with Axin and inhibits GSK-3 beta-dependent phosphorylation of beta-catenin, APC, and Axin in the Axin complex. Dvl v:<interacts.v> [pp with Axin
Total correct names, but error in the parse (ignoring name errors): 30 (9.5%) Dvl<->Axin 11113207 Consistent with these results, Dvl interacts with Axin and inhibits GSK-3 beta-dependent phosphorylation of beta-catenin, APC, and Axin in the Axin complex Dvl v:<inhibits.v> [pp in [the Axin complex.n]]
Total Gold Standard Associations Detected 34 of 53 (58.4%)
Parse/extract precision for assertions with correctly selected names (175 + 108)/(340 − 27) = 90.4%
Recall versus gold standard (Wnt genes website) 34/53 (64.1%)
Separate unique interactions (overall) 1210

Interacting proteins are represented in bold.

We wish to thank Dr Stephen Abney, Dragomir Radev and H.V. Jagadish for many hours of thoughtful discussion and critical feedback. This project was supported in part by a grant for the NIH/National Library of Medicine R01 LM008106.

REFERENCES

Abney, S.
1996
J. Natural Language Eng.

2
337
– 344
Abney, S.
Statistical Methods and Linguistics

1996
, Cambridge, MA The MIT Press
1997
Automatic annotation for biological sequences by extraction of keywords from MEDLINE abstracts. Development of a prototype system.
Proc. Int. Conf. Intell. Syst. Mol. Biol.

5
25
– 32
2002
Analyzing yeast protein-protein interaction data obtained from different sources.
Nat. Biotechnol.

20
991
– 997
Blaschke, C., et al.
1999
Automatic extraction of biological information from scientific text: protein-protein interactions. Proceedings of the AAAI Conference on Intelligent Systems for Molecular Biology (ISMB) AAAI Press, pp.
60
–67
Braun, M.M., Etheridge, A., Bernard, A., Robertson, C.P., Roelink, H.
2003
Wnt signaling is required at distinct stages of development for the induction of the posterior forebrain.
Development

130
5579
– 5587
Chen, S., Guttridge, D.C., You, Z., Zhang, Z., Fribley, A., Mayo, M.W., Kitajewski, J., Wang, C.Y.
2001
Wnt-1 signaling inhibits apoptosis by activating beta-catenin/T cell factor-mediated transcription.
J. Cell Biol.

152
87
– 96
Daraselia, N., Yuryev, A., Egorov, S., Novichkova, S., Nikitin, A., Mazo, I.
2004
Extracting human protein interactions from MEDLINE using a full-sentence parser.
Bioinformatics

20
604
– 611
Iliopoulos, I., Enright, A.J., Ouzounis, C.A.
2001
Textquest: document clustering of Medline abstracts for concept discovery in molecular biology.
Pac. Symp. Biocomput.

384
–395
Kishida, M., Hino, S., Michiue, T., Yamamoto, H., Kishida, S., Fukui, A., Asashima, M., Kikuchi, A.
2001
Synergistic activation of the Wnt signaling pathway by Dvl and casein kinase Iepsilon.
J. Biol. Chem.

276
33147
– 33155
Koike, A., Kobayashi, Y., Takagi, T.
2003
Kinase pathway database: an integrated protein-kinase and NLP-based protein-interaction resource.
Genome Res.

13
1231
– 1243
Lagutin, O.V., Zhu, C.C., Kobayashi, D., Topczewski, J., Shimamura, K., Puelles, L., Russell, H.R., McKinnon, P.J., Solnica-Krezel, L., Oliver, G.
2003
Six3 repression of Wnt signaling in the anterior neuroectoderm is essential for vertebrate forebrain development.
Genes Dev.

17
368
– 379
Liu, C., Nakamura, E., Knezevic, V., Hunter, S., Thompson, K., Mackem, S.
2003
A role for the mesenchymal T-box gene Brachyury in AER formation during limb development.
Development

130
1327
– 1337
Marcotte, E.M., Xenarios, I., Eisenberg, D.
2001
Mining literature for protein–protein interactions.
Bioinformatics

17
359
– 363
National Center for Biotechnoly Information—Entrez Programming Utilities (NCBI).
2004
Nusse, R.
2004
The Wnt gene Homepage (Howard Hughes Medical Insitiute)
Raychaudhuri, S., Schutze, H., Altman, R.B.
2002
Using text analysis to identify functionally coherent gene groups.
Genome Res.

12
1582
– 1590
Rosner, A., Miyoshi, K., Landesman-Bollag, E., Xu, X., Seldin, D.C., Moser, A.R., MacLeod, C.L., Shyamala, G., Gillgrass, A.E., Cardiff, R.D.
2002
Pathway pathology: histological differences between ErbB/Ras and Wnt pathway transgenic mammary tumors.
Am. J. Pathol.

161
1087
– 1097
Sleator, D. and Temperly, D.
1991
Parsing English with a Link Grammar. Computer Science Technical Report CMU-CS-91-916 Carnegie Mellon University
Song, D.H., Dominguez, I., Mizuno, J., Kaut, M., Mohr, S.C., Seldin, D.C.
2003
CK2 phosphorylation of the armadillo repeat region of beta-catenin potentiates Wnt signaling.
J. Biol. Chem.

278
24018
– 24025
Stapley, B.J. and Benoit, G.
2000
Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts.
Pac. Symp. Biocomput.

529
–540
Stephens, M., Palakal, M., Mukhopadhyay, S., Raje, R., Mostafa, J.
2001
Detecting gene relations from Medline abstracts.
Pac. Symp. Biocomput.

483
–495
Temkin, J.M. and Gilder, M.R.
2003
Extraction of protein interaction information from unstructured text using a context-free grammar.
Bioinformatics

19
2046
– 2053
Wilbur, W.J. and Yang, Y.
1996
An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts.
Comput. Biol. Med.

26
209
– 222