Abstract

Motivation: Many practical tasks in biomedicine require accessing specific types of information in scientific literature; e.g. information about the methods, results or conclusions of the study in question. Several approaches have been developed to identify such information in scientific journal articles. The best of these have yielded promising results and proved useful for biomedical text mining tasks. However, relying on fully supervised machine learning (ml) and a large body of annotated data, existing approaches are expensive to develop and port to different tasks. A potential solution to this problem is to employ weakly supervised learning instead. In this article, we investigate a weakly supervised approach to identifying information structure according to a scheme called Argumentative Zoning (az). We apply four weakly supervised classifiers to biomedical abstracts and evaluate their performance both directly and in a real-life scenario in the context of cancer risk assessment.

Results: Our best weakly supervised classifier (based on the combination of active learning and self-training) performs well on the task, outperforming our best supervised classifier: it yields a high accuracy of 81% when just 10% of the labeled data is used for training. When cancer risk assessors are presented with the resulting annotated abstracts, they find relevant information in them significantly faster than when presented with unannotated abstracts. These results suggest that weakly supervised learning could be used to improve the practical usefulness of information structure for real-life tasks in biomedicine.

Availability: The annotated dataset, classifiers and the user test for cancer risk assessment are available online at http://www.cl.cam.ac.uk/~yg244/11bioinfo.html.

Contact:anna.korhonen@cl.cam.ac.uk

1 INTRODUCTION

Many practical tasks in biomedicine require accessing specific types of information in scientific literature. For example, a biomedical scientist may be looking for information about the objective of the study in question, the methods used, the results obtained or the conclusions drawn by the authors. Similarly, many biomedical text mining tasks (e.g. information extraction, summarization) focus on the extraction of specific types of information in documents only.

To date, a number of approaches have been proposed for the classification of sentences in scientific literature according to categories of information structure (or discourse, rhetorical, argumentative or conceptual structure, depending on the framework in question). Some of the approaches classify sentences according to typical section names seen in scientific documents (Hirohata et al., 2008; Lin et al., 2006), while others are based e.g. on argumentative zones (Mizuta et al., 2006; Teufel and Moens, 2002; Teufel et al., 2009), qualitative dimensions (Shatkay et al., 2008) or conceptual structure (Liakata et al., 2010) of documents.

The best current approaches have yielded promising results and proved useful for information retrieval, information extraction and summarization tasks (Mizuta et al., 2006; Ruch et al., 2007; Tbahriti et al., 2006; Teufel and Moens, 2002). However, relying on fully supervised machine learning (ml) and a large body of annotated data, existing approaches are expensive to develop and port to different domains and tasks, and thus intractable for use in real-life applications.

A potential solution to this bottleneck is to develop techniques based on weakly supervised ml instead. Making use of a small amount of labeled data and a large pool of unlabeled data, weakly supervised learning (e.g. semi-supervision, active learning, co/tri-training, self-training) aims to keep the advantages of fully supervised approaches. It has been applied to a wide range of natural language processing (nlp) and text mining tasks, including named-entity recognition, question answering, information extraction, text classification and many others (Abney, 2008), yielding performance levels similar or equivalent to those of fully supervised techniques.

In this article, we investigate the potential of weakly supervised learning for Argumentative Zoning (az) of biomedical abstracts. az provides an analysis of the argumentative structure (i.e. the rhetorical progression of the argument) of a scientific document (Teufel and Moens, 2002). It has been used to analyze scientific texts in various disciplines—including computational linguistics (Teufel and Moens, 2002), law, (Hachey and Grover, 2006), biology (Mizuta et al., 2006) and chemistry (Teufel et al., 2009)—and has proved useful for nlp tasks such as summarization (Teufel and Moens, 2002). However, the application of az to different domains has resulted in laborious annotation exercises that suggests that a weakly supervised approach would be more practical for the real-world application of az.

Taking two supervised classifiers as a comparison point—Support Vector Machines (svm) and Conditional Random Fields (crf)—we investigate the performance of four weakly supervised classifiers for az: two based on semi-supervised learning (transductive svm and semi-supervised crf) and two on active learning (Active svm alone and in combination with self-training). We apply these classifiers to az-annotated biomedical abstracts in the recent dataset of Guo et al. (2010). The results are promising. Our best weakly supervised classifier (Active svm with self-training) outperforms the best supervised classifier (svm), yielding high accuracy of 81 when using just 10% of the labeled data for training. When using just one- third of the labeled data, it performs as well as a fully supervised svm, which uses 100% of the labeled data.

The abstracts in the dataset of Guo et al. (2010) were selected on the basis of their suitability for cancer risk assessment (cra). This enables us to conduct user-based evaluation of the practical usefulness of our approach for the real-world task of cra. We investigate whether cancer risk assessors find relevant information in abstracts faster when the abstracts are annotated for az using our best weakly supervised approach. The results are promising: although manual annotations yield the biggest time savings: 10–13% (compared with the time it takes to examine unannotated abstracts), considerable savings are also obtained with weakly supervised ml annotations: 7–8% (using active svm with self-training).

In sum, our investigation shows that weakly supervised az can be employed to improve the practical applicability and portability of az to different information access tasks and that its accuracy is high enough to benefit a real-life task in biomedicine.

2 METHODS

2.1 Data

We used in our experiments the recent dataset of Guo et al. (2010), consisting of 1000 cra abstracts (7985 sentences and 225 785 words) annotated according to az. Originally introduced by Teufel and Moens (2002), az is a scheme that provides an analysis of the argumentative structure of a document, following the knowledge claims made by the authors. The dataset of Guo et al. (2010) has been annotated according to the version of az developed for biology papers Mizuta et al. (2006) (with only minor modifications concerning zone names). Seven categories of this scheme (out of the 10 possible) actually appear in abstracts and in the resulting dataset. These are shown and explained in Table 1. The table also shows one example sentence per category.

Table 1.

Categories of az appearing in the corpus of Guo et al. (2010)

Category Abbreviation Definition and example 
Background bkg The circumstances pertaining to the current work, situation, or its causes, history, etc. 
  e.g. Concerns about the possible toxic effects of workplace exposures in the synthetic rubber industry have centered on 1,3-butadiene (BD), styrene and dimethyldithiocarbamate (DMDTC)
Objective obj A thing aimed at or sought, a target or goal 
  e.g. The objective of this research was to evaluate techniques for the rapid detection of chromosomal alterations occurring in humans exposed to butadiene
Method meth A way of doing research, esp. according to a defined and regular plan; a special form of procedure or characteristic set of procedures employed in a field of study as a mode of investigation and inquiry 
  e.g. The hypoxanthine-guanine phosphoribosyltransferase (HPRT) and thymidine kinase (TK) mutant frequencies (MFs) were measured using a cell cloning assay
Result res The effect, consequence, issue or outcome of an experiment; the quantity, formula, etc. obtained by calculation 
  e.g. Replication past the N3 2'-deoxyuridine adducts was found to be highly mutagenic with an overall mutation yield of approximately 97%. 
Conclusion con A judgment or statement arrived at by any reasoning process; an inference, deduction, induction; a proposition deduced by reasoning from other propositions; the result of a discussion, or examination of a question, final determination, decision, resolution, final arrangement or agreement 
  e.g. Thus, in terms of mutagenic efficiency, stereochemical configurations of EB and DEB are not likely to play a significant role in the mutagenicity and carcinogenicity of BD
Related work rel A comparison between the current work and the related work 
  e.g. These data are much lower compared to previously reported values measured by GC-MS/MS
Future work fut The work that needs to be done in the future 
  e.g. Additional studies are needed to examine the importance of base excision repair (BER) in maintaining genomic integrity, the differential formation of DNA and protein adducts in deficient strains, and the potential for enhanced sensitivity to BD genotoxicity in mice either lacking or deficient in both biotransformation and DNA repair activity
Category Abbreviation Definition and example 
Background bkg The circumstances pertaining to the current work, situation, or its causes, history, etc. 
  e.g. Concerns about the possible toxic effects of workplace exposures in the synthetic rubber industry have centered on 1,3-butadiene (BD), styrene and dimethyldithiocarbamate (DMDTC)
Objective obj A thing aimed at or sought, a target or goal 
  e.g. The objective of this research was to evaluate techniques for the rapid detection of chromosomal alterations occurring in humans exposed to butadiene
Method meth A way of doing research, esp. according to a defined and regular plan; a special form of procedure or characteristic set of procedures employed in a field of study as a mode of investigation and inquiry 
  e.g. The hypoxanthine-guanine phosphoribosyltransferase (HPRT) and thymidine kinase (TK) mutant frequencies (MFs) were measured using a cell cloning assay
Result res The effect, consequence, issue or outcome of an experiment; the quantity, formula, etc. obtained by calculation 
  e.g. Replication past the N3 2'-deoxyuridine adducts was found to be highly mutagenic with an overall mutation yield of approximately 97%. 
Conclusion con A judgment or statement arrived at by any reasoning process; an inference, deduction, induction; a proposition deduced by reasoning from other propositions; the result of a discussion, or examination of a question, final determination, decision, resolution, final arrangement or agreement 
  e.g. Thus, in terms of mutagenic efficiency, stereochemical configurations of EB and DEB are not likely to play a significant role in the mutagenicity and carcinogenicity of BD
Related work rel A comparison between the current work and the related work 
  e.g. These data are much lower compared to previously reported values measured by GC-MS/MS
Future work fut The work that needs to be done in the future 
  e.g. Additional studies are needed to examine the importance of base excision repair (BER) in maintaining genomic integrity, the differential formation of DNA and protein adducts in deficient strains, and the potential for enhanced sensitivity to BD genotoxicity in mice either lacking or deficient in both biotransformation and DNA repair activity

Table 2 shows the distribution of sentences per category in the corpus: Result (res) is by far the most frequent category (accounting for 40% of the corpus), while Background (bkg), Objective (obj), Method (meth) and Conclusion (con) cover 8–18% of the corpus each. Two categories: Related work (rel) and Future work (fut) are low frequency categories, only covering 1% of the corpus each.

Table 2.

Distribution of sentences in the az-annotated corpus

 bkg obj meth res con rel fut 
Word 36 828 23 493 41 544 89 538 30 752 2456 1174 
Sentence 1429 674 1473 3185 1082 95 47 
Sentence (%) 18 18 40 14 
 bkg obj meth res con rel fut 
Word 36 828 23 493 41 544 89 538 30 752 2456 1174 
Sentence 1429 674 1473 3185 1082 95 47 
Sentence (%) 18 18 40 14 

Guo et al. (2010) reported the inter-annotator agreement between their three annotators: one linguist, one computational linguist and one domain expert. The agreement (κ=0.85) is relatively high according to Cohen (1960).

2.2 Automatic classification

2.2.1 Features and feature extraction

The first step in automatic classification is to select a set of features that may indicate az categories in abstracts. Following Guo et al. (2010), we implemented a set of features that have proved successful in related works, e.g. (Hirohata et al., 2008; Lin et al., 2006; Mullen et al., 2005; Teufel and Moens, 2002):

  • Location. The parts where a sentence begins and ends. Each abstract was divided into 10 parts (1–10, measured by the number of words).

  • Word. All the words in the corpus (The value of the Word features equals 1 if a certain word occurs in the sentence and 0 if not. The same applies to the following features.).

  • Bi-gram. Any combination of two adjacent words in the corpus.

  • Verb. All the verbs in the corpus.

  • Verb Class. 60 verb classes appearing in biomedical journal articles.

  • Part-of-Speech – pos. The pos tag of each verb in the corpus.

  • Grammatical Relation – gr. Subject (ncsubj), direct object (dobj), indirect object (iobj) and second object (obj2) relations in the corpus. e.g. (ncsubj observed_14 difference_5 obj).

  • Subj and Obj. The subjects and objects appearing with any verbs in the corpus (extracted from grs).

  • Voice. The voice of verbs (active or passive) in the corpus.

These features were extracted from the corpus using a number of tools. A tokenizer was used to detect sentence boundaries and to perform basic tokenization (in extreme cases, processing complex biomedical terms e.g. 2-amino-3,8-diethylimidazo[4,5-f ]quinoxaline). The C&C tools (Curran et al., 2007) were used for pos tagging, lemmatization and parsing. The lemma output was used for Word, Bi-gram and Verb features, and the gr output for gr, Subj, Obj and Voice features. The ‘obj’ marker in a subject relation indicates passive voice [e.g. (ncsubj observed_14 difference_5 obj)]. Verb classes were obtained automatically using unsupervised spectral clustering (Sun and Korhonen, 2009). To reduce data sparsity, we lemmatized the lexical items for all the features, and removed words and grs with <2 occurrences and bi-grams with <5 occurrences.

2.2.2 Machine learning methods

The next step is to assign sentences in abstracts to zone categories using machine learning. Support vector machines (svm) and conditional random fields (crf) have proved the best performing fully supervised methods in recent related works e.g. (Guo et al., 2010; Hirohata et al., 2008; Mullen et al., 2005; Teufel and Moens, 2002). We therefore implemented these methods as well as weakly supervised variations of them: active svm with and without self-training, transductive svm and semi-supervised crf.

Supervised methods: svm aims to find the maximum-margin hyperplane, which has the largest distance to the nearest data points of any class. The problem is defined as:  

formula
where x is data, y is its label, w is a normal vector to the hyperplane and forumla is the margin. We used Weka software (Hall et al., 2009) employing the smo algorithm (Platt, 1999b) with linear kernel for svm experiments.

crf is an undirected graphical model that defines a probability distribution over the hidden states (e.g. label sequences) given the observations. The probability of a label sequence y given an observation sequence x can be written as:  

formula
where Fj(y, x) is a real−valued feature function of the states and the observations; θj is the weight of Fj, and Z(x) is a normalization factor. We used Mallet software (http://mallet.cs.umass.edu) employing the l-bfgs algorithm (Nocedal, 1980) for crf experiments.

Weakly supervised methods: Active svm (asvm) starts with a small amount of labeled data, and iteratively chooses a certain amount of unlabeled data, about which the classifier is least certain, to be manually labeled (the labels can be restored from the fully annotated corpus) for the next round of learnig. We used an uncertainty sampling query strategy (Lewis and Gale, 1994). In particular, we compared the posterior probabilities of the best estimate given each unlabeled instance, and chose the instances with the lowest probabilities to be labeled for later use. The probabilities can be calculated by fitting a Sigmoid after the standard svm (Platt, 1999a) and, in the multi-class case, combined using a pairwise coupling algorithm (Hastie and Tibshirani, 1998). We used the -M flag in Weka for computing the posterior probabilities.

Active svm with self-training (assvm) is an extension of asvm where each round of learning has two steps: Transductive svm (tsvm) is an extension of svm that aims to:  

formula
where x(u) is unlabeled data and y(u) the estimate of its label. The idea is to find a prediction on unlabeled data such that the decision boundary has the maximum margin on both the labeled and the unlabeled (now labeled) data. The latter guides the decision boundary away from dense regions. We used UniverSVM software (http://3t.kyb.tuebingen.mpg.de/bs/people/fabee/universvm.html) employing the cccp algorithm (Collobert et al., 2006) for tsvm experiments.

  • Active learning

    • Train a new classifier on all the labeled examples.

    • Apply the current classifier to each unlabeled example.

    • Find n examples about which the classifier is least certain to be manually labeled.

  • Self-training

    • Train a new classifier on both labeled and unlabeled/machine-labeled data using the estimates from step (i)(b).

    • Test the current classifier on test data.

Semi-supervised crf (sscrf) can be implemented by entropy regularization (Jiao et al., 2006), which extends the objective function on Labeled data ∑L logp(y(l)|x(l), θ) with an additional term ∑UYp(y|x(u), θ)logp(y|x(u), θ) to minimize the conditional entropy of the model's predictions on Unlabeled data. We used Mallet software for sscrf experiments.

2.2.3 Evaluation methods

We evaluated the ml results in terms of accuracy, precision (forumla), recall (forumla) and F-score (forumla) against manual annotations. We used 10-fold cross-validation for all the methods to avoid the possible bias introduced by relying on any particular split of the data. The data were randomly assigned to 10-folds of roughly the same size. Each fold was used once as test data and the remaining nine folds as training data (with x% being manually labeled). The results were then averaged. As randomly selected labeled data were used for svm, crf, tsvm and sscrf, the results for these methods were averaged from five runs. Following Dietterich (1998), we used McNemar's test (McNemar, 1947) to measure the statistical significance of the differences between the results of supervised and weakly supervised learning. The chosen significance level was 0.05.

2.3 User test in the context of cra

A major time-consuming component of chemical cancer risk assessment (cra) is the review and analysis of existing scientific literature on the chemical in question. MEDLINE (http://www.nlm.nih.gov/databases/databases_medline.html) abstracts are typically used as a starting point in this work. Risk assessors (e.g. toxicologists, biologists) read the abstracts of interest, looking for various information in them (e.g. about the methods, results and conclusions of the study in question) (Korhonen et al., 2009). One way to speed up this work is to annotate the abstracts with categories of information structure so that the information of interest can be found faster. Guo et al. (2011) investigated this idea first and showed that time savings can be obtained in literature review when abstracts are annotated either manually or automatically (using fully supervised ml) according to different information structure schemes.

We evaluated our weakly supervised approach to az in a similar way but re-designed the evaluation of Guo et al. (2011) so that it is better controlled and covers a wider range of information. Cancer risk assessors working in Karolinska Institutet (Stockholm, Sweden) provided us with a list of 10 questions considered when studying abstracts for cra purposes. We turned any open-ended questions (e.g. Author's conclusions?) into more controlled ones (e.g. Is the outcome of the study expected, unexpected, or neither/neutral?) so that each question has either a yes/no or multiple choice answer (Table 3). We then designed an online questionnaire where each question–answer pair is displayed to an expert on a separate page and the zone(s) most relevant for answering the question are highlighted with colors as to attract expert's attention (Fig. 1).

Fig. 1.

An example of the questionnaire.

Fig. 1.

An example of the questionnaire.

Table 3.

Questions and highlighted zones

Question Zone 
Q1 Do the authors discuss previous or related research on the topic? y/n bkg rel 
Q2 Do the authors describe the aim of the research? y/n obj 
Q3 What is the main type of study the abstract focuses on? animal study/human study/in vitro study meth 
Q4 Is exposure length mentioned? y/n meth 
Q5 Is dose mentioned? y/n meth 
Q6 Is group size mentioned? y/n meth 
Q7 How many endpoints are mentioned? 0/1/more res 
Q8 Are the results positive? y/n/unclear res 
Q9 Is the outcome of the study expected/unexpected/neutral? con 
Q10 Do the authors mention a need for future research on the topic? y/n fut 
Question Zone 
Q1 Do the authors discuss previous or related research on the topic? y/n bkg rel 
Q2 Do the authors describe the aim of the research? y/n obj 
Q3 What is the main type of study the abstract focuses on? animal study/human study/in vitro study meth 
Q4 Is exposure length mentioned? y/n meth 
Q5 Is dose mentioned? y/n meth 
Q6 Is group size mentioned? y/n meth 
Q7 How many endpoints are mentioned? 0/1/more res 
Q8 Are the results positive? y/n/unclear res 
Q9 Is the outcome of the study expected/unexpected/neutral? con 
Q10 Do the authors mention a need for future research on the topic? y/n fut 

Two experts participated in the test: one professor level expert (A) with a long experience in cra (over 25 years) and one more junior expert (B) with a PhD in toxicology and over 5 years of experience in cra. Each expert was presented with the same set of 200 abstracts focusing on four chemicals (butadiene, diethylnitrosamine, diethylstilbestrol and phenobarbital): (i) 50 unannotated, (ii) 50 manually annotated, (iii) 50 assvm-annotated and (iv) 50 randomly annotated abstracts (i.e. annotated so that sentences were assigned to zones on the basis of their observed distribution in the training data). We compared the time it took for experts to answer the questions when presented with abstracts in (i)–(iv), and examined whether the differences are statistically significant (significance level of 0.05, Mann–Whitney U Test (Mann and Whitney, 1947; Wilcoxon, 1945)). In addition, we evaluated the impact of (i)–(iv) on the quality of experts' answers by examining inter-expert agreement.

3 RESULTS

3.1 Automatic classification

Table 4 shows the results for the four weakly supervised and two supervised methods when using 10% of the labeled data (i.e. ~700 sentences). assvm is the best performing method, with an accuracy of 81% and macro F-score of 0.76. asvm performs nearly as well, with an accuracy of 80% and F-score of 0.75. Both methods outperform supervised svm with a statistically significant difference (P<0.001). tsvm is the lowest performing svm-based method: its performance is lower than that of the supervised svm. Yet, it outperforms both crf-based methods. sscrf performs slightly better than crf with 1% higher accuracy and 0.01 higher F-score. Only two methods (asvm, tsvm) find six out of the seven possible zone categories. Other methods find five categories. The 1–2 missing categories are low frequency categories, accounting for 1% of the corpus data each (Table 2). The results for other categories also seem to reflect the amount of corpus data available per category (Table 2), with res being the highest and obj the lowest performing category with most methods.

Table 4.

Results when using 10% of the labeled data

 Acc
 
F-score
 
  mf bkg obj meth res con rel fut 
svm 0.77 0.73 0.79 0.60 0.70 0.84 0.69 – – 
crf 0.70 0.64 0.74 0.52 0.46 0.77 0.73 – – 
asvm 0.80 0.75 0.88 0.56 0.68 0.87 0.78 0.33  
assvm 0.81 0.76 0.86 0.56 0.76 0.88 0.76 – – 
tsvm 0.76 0.72 0.82 0.57 0.69 0.82 0.72 0.08 – 
sscrf 0.71 0.65 0.78 0.50 0.48 0.77 0.73 – – 
 Acc
 
F-score
 
  mf bkg obj meth res con rel fut 
svm 0.77 0.73 0.79 0.60 0.70 0.84 0.69 – – 
crf 0.70 0.64 0.74 0.52 0.46 0.77 0.73 – – 
asvm 0.80 0.75 0.88 0.56 0.68 0.87 0.78 0.33  
assvm 0.81 0.76 0.86 0.56 0.76 0.88 0.76 – – 
tsvm 0.76 0.72 0.82 0.57 0.69 0.82 0.72 0.08 – 
sscrf 0.71 0.65 0.78 0.50 0.48 0.77 0.73 – – 

mf: Macro F-score calculated for the five high frequency categories: bkg, obj, meth, res, con which are found by all the methods.

Figure 2 shows the learning curve of different methods (in terms of accuracy) when using 0–100% of the labeled data. assvm outperforms other methods, reaching its best performance of 88% accuracy when using ~40 of the labeled data. It outperforms asvm (the second best method) in particular when 20–40% of the labeled data is used. When using 33% of the labeled data, it performs already as well as fully supervised svm (i.e. using 100% of the labeled data). svm and tsvm tend to perform quite similarly with each other when >10% of the labeled data are used, but when less data are available, tsvm performs better. Looking at the crf-based methods, sscrf outperforms crf in particular when 10–25% of the labeled data are used. However, neither of them reaches the performance level of svm-based methods, which is in line with the results of fully supervised crf and svm in Guo et al. (2011).

Fig. 2.

Learning curve of different methods when using 0–100% of the labeled data.

Fig. 2.

Learning curve of different methods when using 0–100% of the labeled data.

To investigate which features are the most useful for weakly supervised learning, we took our best performing method assvm and conducted leave-one-out analysis of the features with 10% of the labeled data. The results in Table 5 show that Location is by far the most useful feature, in particular for bkg, meth and con. The overall performance drops 8% in accuracy and 0.09 in F-score when removing this feature. Removing POS has almost equally strong effect, in particular on bkg and meth. Also Voice, Verb class and GR contribute to general performance. Among the least helpful features are those which suffer from sparse data problems, e.g. Word, Bi-gram and Verb). They perform particularly badly when applied to low frequency zones.

Table 5.

Leave-one feature-out results for assvm with 10% of labeled data

 Acc.
 
F-score
 
  mf bkg obj meth res con rel fut 
Location 0.73 0.67 0.67 0.55 0.62 0.85 0.65 – – 
Word 0.80 0.78 0.87 0.70 0.74 0.85 0.72 – – 
Bigram 0.81 0.75 0.83 0.57 0.71 0.87 0.78 0.33 – 
Verb 0.81 0.79 0.84 0.77 0.73 0.87 0.75 – – 
VC 0.79 0.75 0.86 0.62 0.72 0.84 0.70 – – 
POS 0.74 0.70 0.66 0.65 0.66 0.82 0.73 – – 
GR 0.79 0.75 0.83 0.67 0.69 0.84 0.72 – – 
Subj 0.80 0.76 0.87 0.65 0.73 0.85 0.72 – – 
Obj 0.80 0.78 0.84 0.75 0.70 0.85 0.75 – – 
Voice 0.78 0.75 0.88 0.70 0.71 0.83 0.62 – – 
Φ 0.81 0.76 0.86 0.56 0.76 0.88 0.76 – – 
 Acc.
 
F-score
 
  mf bkg obj meth res con rel fut 
Location 0.73 0.67 0.67 0.55 0.62 0.85 0.65 – – 
Word 0.80 0.78 0.87 0.70 0.74 0.85 0.72 – – 
Bigram 0.81 0.75 0.83 0.57 0.71 0.87 0.78 0.33 – 
Verb 0.81 0.79 0.84 0.77 0.73 0.87 0.75 – – 
VC 0.79 0.75 0.86 0.62 0.72 0.84 0.70 – – 
POS 0.74 0.70 0.66 0.65 0.66 0.82 0.73 – – 
GR 0.79 0.75 0.83 0.67 0.69 0.84 0.72 – – 
Subj 0.80 0.76 0.87 0.65 0.73 0.85 0.72 – – 
Obj 0.80 0.78 0.84 0.75 0.70 0.85 0.75 – – 
Voice 0.78 0.75 0.88 0.70 0.71 0.83 0.62 – – 
Φ 0.81 0.76 0.86 0.56 0.76 0.88 0.76 – – 

Φ: Employing all the features.

3.2 User test

Table 6 shows the time (measured in seconds) it took for experts A and B to answer questions (individual and total) when presented with (i) unannotated, (ii) manually annotated, (iii) assvm annotated and (iv) randomly annotated abstracts (see Section 2 for details of the experts and abstract groups), along with the percentage of time savings obtained when using annotations (ii)–(iv) (compared with (i)). time stands for the sample mean, and save for the percentage of time savings. Table 7 shows the statistical significance (P-values, Mann–Whitney U Test) of the differences between the results for different abstract groups [e.g. (i) v.(ii)]. Looking at the overall figures (i.e. Total), both manual (ii) and assvm (iii) annotations help users find relevant information significantly faster than plain text abstracts (i): the percentage of time savings ranges between 7% and 13%, and the corresponding P-values ranges between < 0.001 and 0.027. Although manual annotations save more time than assvm annotations (13% versus 7% for A, and 10% versus 8% for B), assvm annotations are surprisingly useful. Random annotations (iv) have a negative effect: both experts spend more time examining (iv) than (i) abstracts: 6% for A and 19% for B.

Table 6.

Time savings

  Q1
 
Q2
 
Q3
 
Q4
 
Q5
 
Q6
 
Q7
 
Q8
 
Q9
 
Q10
 
Total
 
  time save (%) time save (%) time save (%) time save (%) time save (%) time save (%) time save (%) time save (%) time save (%) time save (%) time save (%) 
(i) 14.1  6.3  11.8  7.2  5.2  6.1  12.5  7.7  8.7  3.4  83.0  
 (ii) 11.9 16 5.7 9.1 23 7.2 4.7 5.6 10.6 15 7.2 7.2 17 3.1 72.2 13 
 (iii) 12.0 15 6.0 9.8 16 8.5 -17 5.9 -12 5.1 17 11.5 7.5 7.6 12 3.2 77.1 
 (iv) 12.6 11 7.5 -19 14.7 -25 8.6 -19 5.2 5.5 12.8 -2 7.4 9.5 -10 3.9 -12 87.8 -6 
(i) 10.1  9.8  8.8  9.6  4.9  5.7  12.4  9.2  12.7  3.8  87.1  
 (ii) 8.7 14 9.3 7.3 17 10.0 -5 4.9 5.0 12 12.1 7.3 21 9.6 24 3.9 -3 78.3 10 
 (iii) 9.0 11 10.0 -2 8.8 10.0 -5 5.2 -6 4.9 15 11.8 6.9 26 9.9 22 4.0 -6 80.5 
 (iv) 12.2 -21 12.4 -27 12.7 -44 11.2 -17 5.5 -12 4.9 14 16.0 -29 7.9 15 15.6 -24 5.0 -30 103.5 -19 
  Q1
 
Q2
 
Q3
 
Q4
 
Q5
 
Q6
 
Q7
 
Q8
 
Q9
 
Q10
 
Total
 
  time save (%) time save (%) time save (%) time save (%) time save (%) time save (%) time save (%) time save (%) time save (%) time save (%) time save (%) 
(i) 14.1  6.3  11.8  7.2  5.2  6.1  12.5  7.7  8.7  3.4  83.0  
 (ii) 11.9 16 5.7 9.1 23 7.2 4.7 5.6 10.6 15 7.2 7.2 17 3.1 72.2 13 
 (iii) 12.0 15 6.0 9.8 16 8.5 -17 5.9 -12 5.1 17 11.5 7.5 7.6 12 3.2 77.1 
 (iv) 12.6 11 7.5 -19 14.7 -25 8.6 -19 5.2 5.5 12.8 -2 7.4 9.5 -10 3.9 -12 87.8 -6 
(i) 10.1  9.8  8.8  9.6  4.9  5.7  12.4  9.2  12.7  3.8  87.1  
 (ii) 8.7 14 9.3 7.3 17 10.0 -5 4.9 5.0 12 12.1 7.3 21 9.6 24 3.9 -3 78.3 10 
 (iii) 9.0 11 10.0 -2 8.8 10.0 -5 5.2 -6 4.9 15 11.8 6.9 26 9.9 22 4.0 -6 80.5 
 (iv) 12.2 -21 12.4 -27 12.7 -44 11.2 -17 5.5 -12 4.9 14 16.0 -29 7.9 15 15.6 -24 5.0 -30 103.5 -19 
Table 7.

Significance of the results in the previous table

  Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Total 
(i) v. (ii) 0.035 0.837 0.063 0.975 0.397 0.421 0.032 0.296 0.015 0.285 <0.001* 
 (i) v. (iii) 0.083 0.924 0.405 0.162 0.221 0.075 0.154 0.550 0.139 0.413 0.005 
 (i) v. (iv) 0.200 0.135 0.005 0.018 0.864 0.248 0.872 0.232 0.315 0.781 0.159 
 (ii) v. (iii) 0.570 0.633 0.235 0.141 0.052 0.242 0.326 0.530 0.321 0.851 0.041 
(i) v. (ii) 0.122 0.923 0.180 0.666 0.986 0.149 0.901 0.006 0.002 0.781 0.005 
 (i) v. (iii) 0.266 0.321 0.565 0.381 0.338 0.070 0.532 0.018 0.005 0.786 0.027 
 (i) v. (iv) 0.024 0.008 0.003 0.106 0.385 0.027 0.008 0.193 0.009 0.077 <0.001* 
 (ii) v. (iii) 0.682 0.188 0.050 0.729 0.535 0.693 0.477 0.341 0.795 0.667 0.619 
  Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Total 
(i) v. (ii) 0.035 0.837 0.063 0.975 0.397 0.421 0.032 0.296 0.015 0.285 <0.001* 
 (i) v. (iii) 0.083 0.924 0.405 0.162 0.221 0.075 0.154 0.550 0.139 0.413 0.005 
 (i) v. (iv) 0.200 0.135 0.005 0.018 0.864 0.248 0.872 0.232 0.315 0.781 0.159 
 (ii) v. (iii) 0.570 0.633 0.235 0.141 0.052 0.242 0.326 0.530 0.321 0.851 0.041 
(i) v. (ii) 0.122 0.923 0.180 0.666 0.986 0.149 0.901 0.006 0.002 0.781 0.005 
 (i) v. (iii) 0.266 0.321 0.565 0.381 0.338 0.070 0.532 0.018 0.005 0.786 0.027 
 (i) v. (iv) 0.024 0.008 0.003 0.106 0.385 0.027 0.008 0.193 0.009 0.077 <0.001* 
 (ii) v. (iii) 0.682 0.188 0.050 0.729 0.535 0.693 0.477 0.341 0.795 0.667 0.619 

As we mentioned in Section 2.3: ‘we compared the time it took for experts to answer the questions when presented with abstracts in (i)–(iv), and examined whether the differences are statistically significant [significance level of 0.05, Mann-Whitney U Test (Mann and Whitney, 1947; Wilcoxon, 1945)]. Values in bold are less than 0.05, indicating that the differences are statically significant.

*After rounding, this value is 0.00

Looking at the results for individual questions, (ii) and (iii) are more helpful for answering broader questions (e.g. Q9 Is the outcome of the study expected/unexpected/neutral?) than more specific questions (e.g. Q4 Is exposure length mentioned?). Although (ii) is more helpful than (iii) for most questions, the majority of differences are not statistically significant, showing that assvm annotations are almost equally useful as manual annotations. assvm annotations have a negative effect on Q4 and Q5. Q4 and Q5 focus on meth which is a higher frequency (accounting for 18% of the corpus) but less predictable (0.76 F-score for assvm) category.

Table 8 shows the joint probability of users' agreement on the answers. Annotations (ii), (iii) and (iv) do not affect the users' agreement a lot: 0.82–0.86 and 0.81 with and without annotations. Interestingly, experts tend to agree the most on the answers when using assvm annotated abstracts. This demonstrates that automatic annotation does not affect the quality of the answers.

Table 8.

Quality of answers (inter-expert agreement)

 Q1 Q2 Q3a Q3b Q3c Q4 Q5 Q6 Q7 Q8 Q9 Q10 Total 
(i) 0.63 0.51 0.78 0.90 0.80 0.86 0.90 0.92 0.86 0.76 0.92 0.90 0.81 
(ii) 0.70 0.66 0.92 0.96 0.92 0.82 0.92 0.90 0.72 0.76 0.88 0.88 0.84 
(iii) 0.82 0.68 0.96 0.98 0.90 0.86 0.86 0.90 0.74 0.86 0.90 0.88 0.86 
(iv) 0.66 0.54 0.90 0.92 0.92 0.74 0.88 0.90 0.82 0.82 0.84 0.90 0.82 
 Q1 Q2 Q3a Q3b Q3c Q4 Q5 Q6 Q7 Q8 Q9 Q10 Total 
(i) 0.63 0.51 0.78 0.90 0.80 0.86 0.90 0.92 0.86 0.76 0.92 0.90 0.81 
(ii) 0.70 0.66 0.92 0.96 0.92 0.82 0.92 0.90 0.72 0.76 0.88 0.88 0.84 
(iii) 0.82 0.68 0.96 0.98 0.90 0.86 0.86 0.90 0.74 0.86 0.90 0.88 0.86 
(iv) 0.66 0.54 0.90 0.92 0.92 0.74 0.88 0.90 0.82 0.82 0.84 0.90 0.82 

Since Q3 is a multiple-choice question, we report the inter-expert agreement for each option: Q3a,b,c.

4 DISCUSSION AND CONCLUSIONS

Our results show that weakly supervised ml can be used for the identification of information structure in biomedical abstracts. In our experiments, the majority of weakly supervised methods: assvm, asvm and sscrf outperformed their corresponding supervised methods: svm and crf. assvm/asvm selects the most difficult instances (or the instances distinct from the existing labeled data) to be manually labeled and then used for the next round of learning, offering a wider coverage of the possible inputs than svm. sscrf extends crf by taking into account the conditional entropy of the model's predictions on unlabeled data (favoring peaked, confident predictions) so that the decision boundary is moved into the sparse regions of input space.

The best performing weakly supervised methods were those based on active learning. When using 10% of the labeled data, active learning combined with self-training (assvm) outperformed the best supervised method svm with a statistically significant difference. assvm reached its top performance (88% accuracy) when using 40% of the labeled data, and performed equally well as fully supervised svm when using just one-third of the labeled data. This result is in line with the results of other text classification works where active learning has proved similarly useful, e.g. Esuli and Sebastiani (2009); Lewis and Gale (1994). In addition, we have demonstrated that the accuracy of our best weakly supervised method (assvm) is high enough to benefit a real-life task in biomedicine: cancer risk assessors find relevant information in abstracts significantly faster (7–8%) when the abstracts are annotated using assvm (as opposed to being unannotated). In sum, our research shows that application of az-style approaches to real-world biomedical tasks can be realistic as only a limited amount of labeled data is needed for it.

To the best of our knowledge, no previous work has been done on weakly supervised learning of textual information structure according to the family of schemes we have focused on Guo et al. (2011); Hirohata et al. (2008); Liakata et al. (2010); Lin et al. (2006); Mizuta et al. (2006); Shatkay et al. (2008). Previous works on these schemes have been fully supervised in nature. In addition, although some works have been evaluated in the context of text mining tasks (e.g. information extraction, summarization), the only previous work which has reported user-centered evaluation in the context of a real-life biomedical task is that of Guo et al. (2011).

In the future, we plan to improve and extend this work in several directions. Semi-supervised learning (tsvm and sscrf) did not perform equally well as active learning in our experiments, although it has proved successful in related works e.g. (Jiao et al., 2006). We suspect that this is due to the high dimensionality and sparseness of our labeled dataset. Given the high cost of obtaining labeled data, methods not needing it are preferable. We plan to thus experiment with more sophisticated active learning algorithms, e.g. margin sampling (Scheffer et al., 2001), query-by-committee (QBC) (Seung et al., 1992) and svm simple margin (Tong and Koller, 2001). Combinations of other weakly supervised methods, e.g. EM+active learning (McCallum and Nigam, 1998) and co-training+EM+active learning (Muslea et al., 2002) would also be worth investigating. In addition, we plan to replace the svm-based model with other models e.g. Logistic Regression, which outperforms svm in active learning as reported in (Hoi et al., 2006). crf-based active learning might be a good option too.

The work presented in this article has focused on the az scheme. In the future, we plan to investigate the usefulness of weakly supervised learning for identifying information structure according to other popular schemes, e.g. (Hirohata et al., 2008; Liakata et al., 2010; Lin et al., 2006; Shatkay et al., 2008) and not only in scientific abstracts but also in full journal papers, which typically exemplify a larger set of scheme categories. Focusing on full journal papers will also enable further user-based evaluation. For example, although abstracts are used as a typical starting point in cra, subsequent steps of cra focus on information in full articles. These more challenging steps may benefit from az (and other type of) annotations to a greater degree.

Funding: Royal Society (UK); Swedish Research Council; FAS (Sweden); Cambridge International Scholarship (to Y.G.) and EPSRC (EP/G051070/1 UK).

Conflict of Interest: none declared.

REFERENCES

Abney
S.
Semi-Supervised Learning for Computational Linguistics.
 , 
2008
Chapman & Hall / CRC
Cohen
J.
A coefficient of agreement for nominal scales
Educ. Psychol. Measur.
 , 
1960
, vol. 
20
 (pg. 
37
-
46
)
Collobert
R.
, et al.  . 
Trading convexity for scalability
Proceedings of the 23rd International Conference on Machine Learning.
 , 
2006
ACM Press
(pg. 
201
-
208
)
Curran
J.R.
, et al.  . 
Linguistically Motivated Large-Scale Nlp With C&C And Boxer
Proceedings of the ACL 2007 Demonstrations Session.
 , 
2007
ACL
(pg. 
33
-
36
)
Dietterich
T.G.
Approximate statistical tests for comparing supervised classification learning algorithms
Neural Comput.
 , 
1998
, vol. 
10
 (pg. 
1895
-
1923
)
Esuli
A.
Sebastiani
F.
Active learning strategies for multi-label text classification
Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval.
 , 
2009
Berlin/Heidelberg
Springer
(pg. 
102
-
113
)
Guo
Y.
, et al.  . 
Identifying the information structure of scientific abstracts: an investigation of three different schemes
Proceedings of BioNLP.
 , 
2010
ACL
(pg. 
99
-
107
)
Guo
Y.
, et al.  . 
A comparison and user-based evaluation of models of textual information structure in the context of cancer risk assessment
BMC Bioinformatics
 , 
2011
, vol. 
12
 pg. 
69
 
Hachey
B.
Grover
C.
Extractive summarisation of legal texts
Artif. Intell. Law
 , 
2006
, vol. 
14
 (pg. 
305
-
345
)
Hall
M.
, et al.  . 
The weka data mining software: an update
SIGKDD Explor. Newsl.
 , 
2009
, vol. 
11
 (pg. 
10
-
18
)
Hastie
T.
Tibshirani
R.
Classification by pairwise coupling
Ann. Stat.
 , 
1998
, vol. 
26
 (pg. 
451
-
471
)
Hirohata
K.
, et al.  . 
Identifying sections in scientific abstracts using conditional random fields
Proceedings of 3rd International Joint Conference on Natural Language Processing.
 , 
2008
ACL
(pg. 
381
-
388
)
Hoi
S.C.H.
, et al.  . 
Large-scale text categorization by batch mode active learning
Proceedings of the 15th International Conference on World Wide Web.
 , 
2006
ACM
(pg. 
633
-
642
)
Jiao
F.
, et al.  . 
Semi-supervised conditional random fields for improved sequence segmentation and labeling
COLING/ACL.
 , 
2006
ACL
(pg. 
209
-
216
)
Korhonen
A.
, et al.  . 
The first step in the development of text mining technology for cancer risk assessment: identifying and organizing scientific evidence in risk assessment literature
BMC Bioinformatics
 , 
2009
, vol. 
10
 pg. 
303
 
Lewis
D.D.
Gale
W.A.
A sequential algorithm for training text classifiers
Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.
 , 
1994
Springer, New York, Inc.
(pg. 
3
-
12
)
Liakata
M.
, et al.  . 
Corpora for the conceptualisation and zoning of scientific papers
Proceedings of LREC'10.
 , 
2010
European Language Resources Association (ELRA)
Lin
J.
, et al.  . 
Generative content models for structural analysis of medical abstracts
Proceedings of BioNLP-06.
 , 
2006
(pg. 
65
-
72
)
Mann
H.B.
Whitney
D.R.
On a test of whether one of two random variables is stochastically larger than the other
Ann. Math. Stat.
 , 
1947
, vol. 
18
 (pg. 
50
-
60
)
McCallum
A.
Nigam
K.
Employing em and pool-based active learning for text classification
Proceedings of the Fifteenth International Conference on Machine Learning.
 , 
1998
Morgan Kaufmann Publishers Inc.
(pg. 
350
-
358
)
McNemar
Q.
Note on the sampling error of the difference between correlated proportions or percentages
Psychometrika
 , 
1947
, vol. 
12
 (pg. 
153
-
157
)
Mizuta
Y.
, et al.  . 
Zone analysis in biology articles as a basis for information extraction
Int. J. Med. Informat. Nat. Lang. Process. Biomed. Appl.
 , 
2006
, vol. 
75
 (pg. 
468
-
487
)
Mullen
T.
, et al.  . 
A baseline feature set for learning rhetorical zones using full articles in the biomedical domain
Nat. Lang. Process. Text Min.
 , 
2005
, vol. 
7
 (pg. 
52
-
58
)
Muslea
I.
, et al.  . 
Active + semi-supervised learning = robust multi-view learning
Proceedings of the Nineteenth International Conference on Machine Learning.
 , 
2002
Morgan Kaufmann Publishers Inc.
(pg. 
435
-
442
)
Nocedal
J.
Updating Quasi-Newton matrices with limited storage
Math. Comput.
 , 
1980
, vol. 
35
 (pg. 
773
-
782
)
Platt
J.C.
Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods
Advances in Large Margin Classiers.
 , 
1999
MIT Press
(pg. 
61
-
74
)
Platt
J.C.
Using analytic qp and sparseness to speed training of support vector machines
Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II.
 , 
1999
MIT Press
(pg. 
557
-
563
)
Ruch
P.
, et al.  . 
Using argumentation to extract key sentences from biomedical abstracts
Int. J. Med. Inform.
 , 
2007
, vol. 
76
 (pg. 
195
-
200
)
Scheffer
T.
, et al.  . 
Active hidden Markov models for information extraction
Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis.
 , 
2001
Springer
(pg. 
309
-
318
)
Seung
H.S.
, et al.  . 
Query by committee
Proceedings of the Fifth Annual Workshop on Computational Learning Theory.
 , 
1992
ACM
(pg. 
287
-
294
)
Shatkay
H.
, et al.  . 
Multi-dimensional classification of biomedical text: toward automated, practical provision of high-utility text to diverse users
Bioinformatics
 , 
2008
, vol. 
24
 (pg. 
2086
-
2093
)
Sun
L.
Korhonen
A.
Improving verb clustering with automatically acquired selectional preference
Proceedings of EMNLP.
 , 
2009
ACL
(pg. 
638
-
647
)
Tbahriti
I.
, et al.  . 
Using argumentation to retrieve articles with similar citations
Int. J. Med. Inform.
 , 
2006
, vol. 
75
 (pg. 
488
-
495
)
Teufel
S.
Moens
M.
Summarizing scientific articles: Experiments with relevance and rhetorical status
Comput. Ling.
 , 
2002
, vol. 
28
 (pg. 
409
-
445
)
Teufel
S.
, et al.  . 
Towards domain-independent argumentative zoning: Evidence from chemistry and computational linguistics
Proceedings of EMNLP.
 , 
2009
(pg. 
1493
-
1502
)
Tong
S.
Koller
D.
Support vector machine active learning with applications to text classification
J. Mach. Learn. Res.
 , 
2001
, vol. 
2
 (pg. 
45
-
66
)
Wilcoxon
F.
Individual comparisons by ranking methods
Biomet. Bull.
 , 
1945
, vol. 
1
 (pg. 
80
-
83
)

Author notes

Associate Editor: Jonathan Wren

Comments

0 Comments