Abstract

We aimed to determine the reproducibility of assessments made by independent reviewers of papers submitted for publication to clinical neuroscience journals and abstracts submitted for presentation at clinical neuroscience conferences. We studied two journals in which manuscripts were routinely assessed by two reviewers, and two conferences in which abstracts were routinely scored by multiple reviewers. Agreement between the reviewers as to whether manuscripts should be accepted, revised or rejected was not significantly greater than that expected by chance [κ = 0.08, 95% confidence interval (CI) –0.04 to –0.20] for 179 consecutive papers submitted to Journal A, and was poor (κ = 0.28, 0.12 to 0.40) for 116 papers submitted to Journal B. However, editors were very much more likely to publish papers when both reviewers recommended acceptance than when they disagreed or recommended rejection (Journal A, odds ratio = 73, 95% CI = 27 to 200; Journal B, 51, 17 to 155). There was little or no agreement between the reviewers as to the priority (low, medium, or high) for publication (Journal A, κ = –0.12, 95% CI –0.30 to –0.11; Journal B, κ = 0.27, 0.01 to 0.53). Abstracts submitted for presentation at the conferences were given a score of 1 (poor) to 6 (excellent) by multiple independent reviewers. For each conference, analysis of variance of the scores given to abstracts revealed that differences between individual abstracts accounted for only 10–20% of the total variance of the scores. Thus, although recommendations made by reviewers have considerable influence on the fate of both papers submitted to journals and abstracts submitted to conferences, agreement between reviewers in clinical neuroscience was little greater than would be expected by chance alone.

Introduction

Peer review is central to the process of modern science. It influences which projects get funded and where research is published. Although there is evidence that peer review improves the quality of reporting of the results of research (Locke, 1985; Gardner and Bond, 1990; Goodman et al., 1994), it is susceptible to several biases (Peters and Ceci, 1982; Maddox, 1992; Horrobin, 1996; Locke, 1998; Wenneras and Wold, 1997), and some have argued that it actually inhibits the dissemination of new ideas (Peters and Ceci, 1982; Horrobin, 1990). These shortcomings might be tolerable if the peer review process could be shown to be effective in maximizing the likelihood that research of the highest quality is funded and published. Unfortunately, there is no objective standard of quality of a scientific report or grant application against which the sensitivity or specificity of peer review can be assessed. However, the lack of a quality standard does not prevent measurement of the reproducibility of peer review. How often do independent referees agree about the quality of a paper or abstract? Quality is related to factors such as originality, appropriateness of methods, analysis of results, and whether the conclusions are justified by the data given. Consistency in these assessments should lead to some agreement about overall quality. Poor reproducibility casts doubt on the utility of any measurement, whether made quantitatively by an instrument or assay, or qualitatively by a reviewer assessing a manuscript or abstract.

The reproducibility of peer review has been studied in psychology (Scott, 1974; Cicchetti, 1980), the social sciences (McCartney, 1973) and the physical sciences (Cole et al., 1981), but the few systematic studies of peer review in the biomedical sciences have produced conflicting results (Ingelfinger, 1974; Locke, 1985; Strayhorn et al., 1993; Goodman et al., 1994; Scharschmidt et al., 1994). There have been no published studies of the reproducibility of peer review in neuroscience. We investigated the reproducibility of assessments made by independent reviewers of papers submitted to two journals in the field of clinical neuroscience and analysed the scores made by multiple assessors of abstracts submitted to two clinical neuroscience conferences.

Methods

Journal manuscripts

The authors wrote to the editors of five major clinical neuroscience journals requesting access to the assessments of manuscripts made by external reviewers. The editors of two of the journals were willing to allow this. Journal A sent all manuscripts to two independent reviewers. We studied a 6-month consecutive sample. Journal B allowed us to study a consecutive series of 200 manuscripts. We analysed reports on the 116 (59%) papers that had been assessed by two reviewers. The remainder had been assessed by only one reviewer (n = 54) or only one of the two reviewers had completed the structured assessment (n = 30). Both journals required the reviewers to complete a structured assessment form as part of their review of the manuscript. In both cases, the reviewers were asked to make the following assessments: (i) should the manuscript be accepted, revised or rejected?; (ii) was the priority for publication low, medium or high?

Agreement between the reviewers was calculated for each assessment. Agreement was expressed as a κ statistic (Thompson and Walter, 1988) rather than a simple percent- age, in order to measure the extent to which agreement was greater than that expected by chance. A κ value of 0 represents chance agreement and a value of 1 indicates perfect agreement. Intermediate κ values are generally classified as follows: 0–0.2 = very poor; 0.2–0.4 = poor; 0.4–0.6 = moderate; 0.6–0.8 = good; 0.8–1.0 = excellent. Negative κ values indicate positive disagreement.

Meeting abstracts

Scores given to abstracts by independent reviewers were obtained for two clinical neuroscience conferences. For both meetings, the majority of abstracts submitted for poster presentation only were accepted. Our analyses were, therefore, limited to abstracts submitted for oral presentation (`platform preferred'). The scoring of these abstracts determined the manner in which they were presented; abstracts with the highest mean scores were allocated time for an oral presentation whereas those with lower scores were accepted as a poster or were rejected. For both conferences, the reviewers were requested to give each abstract an integer score between 1 (poor quality, unsuitable for inclusion in the meeting) and 6 (excellent). They were asked to consider both the scientific merit of the work and the likely level of interest to the conference participants. The abstracts were scored by 16 reviewers for Meeting A and 14 reviewers for Meeting B. Abstract scores were analysed by ANOVA (analysis of variance) using the statistical package SPSS for Windows, Release 6.1. The contributions of abstract identity and reviewer identity to the total variance amongst the abstract scores were determined.

Results

Journal manuscripts

Journal A accepted for publication 80 (45%) of the 179 papers submitted during the study period; Journal B accepted 47 (41%) of the 116 papers submitted during the study period. The reviewers for Journal A agreed on the recommendation for publication, or otherwise, for 47% of manuscripts and the reviewers for Journal B agreed for 61% of the manuscripts (Table 1). The corresponding κ values were 0.08 (95% CI = –0.04 to 0.20) and 0.28 (95% CI 0.12 to 0.40). The observed proportions of agreement are compared with the proportions that would have been expected by chance in Fig. 1.

For those manuscripts where both reviewers agreed that the paper was suitable for publication (with or without revision), they agreed on the priority for publication in 35% of cases for Journal A and 61% of cases for Journal B (Table 2). Corresponding κ values were –0.12 (95% CI = –0.30 to 0.11) and 0.27 (95% CI = 0.01 to 0.53). The observed proportions of agreement are compared with the proportions that would have been expected by chance in Fig. 1.

Manuscripts that both reviewers agreed were suitable for publication (with or without revision) were more likely to be published than those for which they disagreed or both recommended rejection (Fig. 2). Journal A published 66 (92%) of the 72 manuscripts that were recommended for publication by both authors compared with 14 (13%) of 107 remaining manuscripts (odds ratio = 73, 95% CI = 27 to 200). Journal B published 40 (85%) of the 47 manuscripts recommended for publication by both reviewers compared with 7 (10%) of 69 remaining manuscripts (odds ratio = 51, 95% CI = 17 to 155).

Meeting abstracts

There was statistically significant heterogeneity in the mean scores (ANOVA) given to the abstracts submitted to the conferences (Meeting A, 32 abstracts, P < 0.001; Meeting B, 28 abstracts, P < 0.005). There were also significant differences between the mean scores given by different reviewers (Meeting A, 16 reviewers, P < 0.001; Meeting B, 14 reviewers, P < 0.001). Over a quarter of the variance in abstract scores (27% for Meeting A and 32% for Meeting B) could be accounted for by the tendency for some reviewers to give higher or lower scores than others (Table 3). Only a small proportion of the variance in abstract scores could be accounted for by differences between the mean scores given to abstracts (11% for Meeting A and 15% for Meeting B) (Table 3).

Discussion

In neither of the journals that we studied was agreement between independent reviewers on whether manuscripts should be published, or their priority for publication, convincingly greater than that which would have been expected by chance alone. The scoring of conference abstracts by a larger number of independent reviewers did not lead to any greater consistency. In other words, the reproducibility of the peer review process in these instances was very poor. Although the journals and meetings which we studied were not chosen at random, we believe that they are likely to be representative of their type, i.e. specialist journals and meetings in clinical neuroscience.

Poor inter-observer reproducibility of peer review has been reported in several non-medical sciences (Scott, 1974; McCartney, 1973; Cicchetti, 1980; Cole et al., 1981), but the few previous studies of peer review for medical journals have produced conflicting results. Locke reported inter-observer κ values ranging from 0.11 to 0.49 for agreement between a number of reviewers making recommendations on a consecutive series of manuscripts submitted to the British Medical Journal during 1979 (Locke, 1985). Agreement was significantly greater than that expected by chance, but this may have been because the reviewers making the recommendations were professional journal editors. Strayhorn and colleagues reported a κ value of 0.12 (poor agreement) for the accept–reject dichotomy for 268 manuscripts submitted to the Journal of the American Academy of Child and Adolescent Psychiatry (Strayhorn et al., 1993). Similarly low levels of agreement were reported by Scharschmidt et al. for papers submitted to the Journal of Clinical Investigation (Scharschmidt et al., 1994).

Previous studies of agreement between reviewers in the grading of abstracts submitted to biomedical meetings have produced results similar to our own (Cicchetti and Conn, 1976; Rubin et al., 1993). Rubin and colleagues reported κ values ranging from 0.11 to 0.18 for agreement between individual reviewers, and found that differences between abstracts accounted for 36% of the total variance in abstract scores. It has even been shown that the likelihood of an abstract being accepted can be related to the typeface used (Koren, 1986).

We also found that the assessments made by reviewers were strongly predictive of whether or not manuscripts were accepted for publication. Manuscripts recommended for publication by both reviewers were 50–70 times more likely to be accepted than those about which reviewers disagreed.

Others have reported similar results (Locke, 1985; Scharschmidt et al., 1994). Editors often find the reviewers' comments of more use than the overall recommendations (Bailar and Patterson, 1985; Locke, 1985; Gardner and Bond, 1990; Goodman et al., 1994), suggesting that they do still rely on the reviewers' recommendations about suitability for publication. For scientific meetings, the grading given to abstracts by reviewers is often the sole determinant of whether they are accepted for presentation. Grant-awarding bodies also appear to place similar weight on the opinions of referees (Gillett, 1993).

Given this reliance on peer review, should we be concerned about the lack of reproducibility? Some authors have argued that poor reproducibility is not a problem, and that different reviewers should not necessarily be expected to agree (Locke, 1985; Bailar, 1991; Fletcher and Fletcher, 1993). For example, an editor might deliberately choose two reviewers who he or she knows are likely to have different points of view. This may be so, but if peer review is an attempt to measure the overall quality of research in terms of originality, the appropriateness of the methods used, analysis of the data, and justification of the conclusions, then a complete lack of reproducibility is a problem. These specific assessments should be relatively objective and hence reproducible.

Why then is the reproducibility of peer review so poor? There are several possibilities. First, some reviewers may not be certain about which aspects of the work they should be assessing. Secondly, some reviewers may not have the time, the knowledge or training required to assess research properly. When deliberately flawed papers are sent for review the proportion of major errors picked up by reviewers is certainly low (Godlee et al., 1998). Thirdly, it is possible that reviewers do agree on the more specific assessments of the quality of research, but that this consistency is undermined by personal opinions and biases. For example, assessments of reviewers have been shown to be biased by the fame of the authors or the institution in which the work was performed (Peters and Ceci, 1982), and by conflicts of interest due to friendship or competition and rivalry between the reviewer and the authors (Locke, 1988; Maddox, 1992). It has been shown that reviewers recommended by authors themselves give much more favourable assessments than those chosen by editors (Scharschmidt et al., 1994).

How might the quality and reproducibility of peer review be improved? Neither blinding reviewers to the authors and origin of the paper nor requiring them to sign their reports appear to have any effect on the quality of peer review (Godlee et al., 1998). However, the use of standardized assessment forms has been shown to increase agreement between reviewers (Strayhorn et al., 1993), and appears to be particularly important in the assessment of study methods, the analysis of data and the presentation of the results (Gardner et al., 1986). Editors might also consider publishing a short addendum to papers detailing the major comments of the reviewers, along with their identity. However, it should be borne in mind that many researchers already spend as much time participating in peer review as they spend doing research (Gillett, 1993). Any increase in this considerable workload might be difficult to justify. Whether payment of reviewers for their reports, as is the practice of some journals, increases the quality of reports is unknown. This and other policies are quite amenable to testing in randomized controlled trials. Open peer review on the internet of articles submitted to journals is currently under investigation (Bingham et al., 1998).

Peer review of articles submitted to journals and abstracts submitted to meetings does achieve a number of important ends irrespective of whether it is reproducible. It helps those responsible to decide which papers or abstracts should be published or presented. The comments of reviewers do generally improve the quality of papers whether or not they are accepted for publication. Peer review also gives the impression that decisions are arrived at in a fair and meritocratic manner. Therefore, even if the results of peer review were essentially a reflection of chance, the process would still serve a useful purpose. However, given the biases inherent in peer review, the tendency to lead to suppression of innovation and the enormous cost of the process in terms of the time spent on the work by reviewers, the lack of reproducibility does cast some doubt on the overall utility of the process in its present form. Finally, many of the arguments that apply to peer review for journal articles and conference abstracts also apply to peer review of grant applications (Greenberg, 1998; Wessely, 1998). There is a need for further research into peer review in each of these areas.

Table 1

Agreement between two independent referees as to whether the manuscripts submitted to clinical neuroscience journals should be accepted without revision, accepted after revision or rejected

 Reviewer 1 
 Accept Accept if revised Reject Total 
Journal A     
Reviewer 2     
Accept  11 
Accept if revised 15 50 30  95 
Reject 36 31  73 
Total 24 89 66 179 
Agreement = 47%, κ = 0.08 (95% CI –0.04 to 0.20)  
Journal B     
Reviewer 2     
Accept 
Accept if revised 45 17  70 
Reject 12 25  41 
Total 13 60 43 116 
Agreement = 61%, κ = 0.28 (95% CI 0.12 to 0.40)  
 Reviewer 1 
 Accept Accept if revised Reject Total 
Journal A     
Reviewer 2     
Accept  11 
Accept if revised 15 50 30  95 
Reject 36 31  73 
Total 24 89 66 179 
Agreement = 47%, κ = 0.08 (95% CI –0.04 to 0.20)  
Journal B     
Reviewer 2     
Accept 
Accept if revised 45 17  70 
Reject 12 25  41 
Total 13 60 43 116 
Agreement = 61%, κ = 0.28 (95% CI 0.12 to 0.40)  
Table 2

Assessments of two independent reviewers of the priority for publication of those papers submitted to clinical neuroscience journals where both reviewers recommended acceptance

 Reviewer 1 
  High Medium Low Total 
Journal A     
Reviewer 2     
High 11 
Medium 11 16 30 
Low 13 
Total 19 31 54 
Agreement = 35%, κ = –0.12 (95% CI –0.30 to 0.11)  
Journal B     
Reviewer 2     
High 
Medium 22 32 
Low 
Total 10 30 49 
Agreement = 61%, κ = 0.27 (95% CI 0.01 to 0.53)  
 Reviewer 1 
  High Medium Low Total 
Journal A     
Reviewer 2     
High 11 
Medium 11 16 30 
Low 13 
Total 19 31 54 
Agreement = 35%, κ = –0.12 (95% CI –0.30 to 0.11)  
Journal B     
Reviewer 2     
High 
Medium 22 32 
Low 
Total 10 30 49 
Agreement = 61%, κ = 0.27 (95% CI 0.01 to 0.53)  
Table 3

Analysis of variance of the scoring of abstracts by reviewers for abstracts submitted to meetings A and B

Source of variation Sum of squares Degrees of freedom Adjusted r2 P 
Meeting A     
Reviewer 138.6  15 0.27 <0.001 
Abstract  78.3  31 0.11 <0.001 
Residual 258.6 465   
Total 475.5 511   
Meeting B     
Reviewer 124.4  13 0.32 <0.001 
Abstract  64.1  27 0.15 <0.001 
Residual 188.4 351   
Total 376.9 391   
Source of variation Sum of squares Degrees of freedom Adjusted r2 P 
Meeting A     
Reviewer 138.6  15 0.27 <0.001 
Abstract  78.3  31 0.11 <0.001 
Residual 258.6 465   
Total 475.5 511   
Meeting B     
Reviewer 124.4  13 0.32 <0.001 
Abstract  64.1  27 0.15 <0.001 
Residual 188.4 351   
Total 376.9 391   
Fig 1

Agreement between independent reviewers on the assessment of manuscripts submitted to two journals of clinical neuroscience. Reviewers were asked to assess whether manuscripts should be accepted, revised or rejected (Manuscript acceptance) and, if suitable for publication, whether their priority was low, medium or high (Manuscript priority). The observed agreements are compared with the level of agreement expected by chance. The error bars show the 95% confidence intervals.

Fig 1

Agreement between independent reviewers on the assessment of manuscripts submitted to two journals of clinical neuroscience. Reviewers were asked to assess whether manuscripts should be accepted, revised or rejected (Manuscript acceptance) and, if suitable for publication, whether their priority was low, medium or high (Manuscript priority). The observed agreements are compared with the level of agreement expected by chance. The error bars show the 95% confidence intervals.

Fig. 2

The proportions of manuscripts submitted to two clinical neuroscience journals that were accepted for publication according to whether two independent reviewers both recommended acceptance, disagreed, or both recommended rejection. The error bars show the 95% confidence intervals.

Fig. 2

The proportions of manuscripts submitted to two clinical neuroscience journals that were accepted for publication according to whether two independent reviewers both recommended acceptance, disagreed, or both recommended rejection. The error bars show the 95% confidence intervals.

We wish to thank Professor R. A. C. Hughes, Professor Jan van Gijn, Mrs E.B.M. Budelman-Verschuren, Suzanne Miller and Chris Holland for their help and collaboration. We are grateful to Kathy Rowan, Catharine Gale and Paul Winter for administrative help.

References

Bailar JC. Reliability, fairness, objectivity and other inappropriate goals in peer review.
Behav Brain Sci
 
1991
;
14
:
137
–8.
Bailar JC 3d, Patterson K. The need for a research agenda.
N Engl J Med
 
1985
;
312
:
654
–7.
Bingham CM, Higgins G, Coleman R, Van der Weyden MB. The Medical Journal of Australia internet peer-review study.
Lancet
 
1998
;
352
:
441
–5.
Cicchetti DV. Reliability of review for the American Psychologist: a biostatistical assessment of the data.
Am Psychol
 
1980
;
35
:
300
–3.
Cicchetti DV, Conn HO. A statistical analysis of reviewer agreement and bias in evaluating medical abstracts.
Yale J Biol Med
 
1976
;
49
:
373
–83.
Cole S, Cole JR, Simon GA. Chance and consensus in peer review.
Science
 
1981
;
214
:
881
–6.
Fletcher RH, Fletcher SW. Who's responsible? [editorial]
Ann Intern Med
 
1993
;
118
:
645
–6.
Gardner MJ, Bond J. An exploratory study of statistical assessment of papers published in the British Medical Journal.
JAMA
 
1990
;
263
:
1355
–7.
Gardner MJ, Machin D, Campbell MJ. Use of check lists in assessing the statistical content of medical studies.
Br Med J
 
1986
;
292
:
810
–2.
Gillett R. Prescriptions for medical research: II—Is medical research well served by peer review?
Br Med J
 
1993
;
306
:
1672
–5.
Godlee F, Gale CR, Martyn CN. Effect on the quality of peer review of blinding reviewers and asking them to sign their reports: a randomized controlled trial.
JAMA
 
1998
;
280
:
237
–40.
Goodman SN, Berlin J, Fletcher SW, Fletcher RH. Manuscript quality before and after peer review and editing at Annals of Internal Medicine.
Ann Intern Med
 
1994
;
121
:
11
–21.
Greenberg DS. Chance and grants.
Lancet
 
1998
;
351
;
686
.
Horrobin DF. The philosophical basis of peer review and the suppression of innovation.
JAMA
 
1990
;
263
:
1438
–41.
Horrobin DF. Peer review of grant applications: a harbinger for mediocrity in clinical research?
Lancet
 
1996
;
348
:
1293
–5.
Ingelfinger FJ. Peer review in biomedical publication.
Am J Med
 
1974
;
56
:
686
–92.
Koren G. A simple way to improve the chances for acceptance of your scientific paper [letter].
N Engl J Med
 
1986
;
315
:
1298
.
Locke S. A difficult balance: editorial peer review in medicine. London: Nuffield Provincial Hospitals Trust; 1985.
Locke S. Fraud in medicine [editorial].
Br Med J
 
1988
;
296
:
376
–7.
Maddox J. Conflicts of interest declared [news].
Nature
 
1992
;
360
:
205
.
McCartney JL. Manuscript reviewing.
Sociol Q
 
1973
;
14
:
440
–6.
Peters DP, Ceci SJ. Peer review practices of psychological journals: the fate of published articles, submitted again.
Behav Brain Sci
 
1982
;
5
:
187
–255.
Rubin HR, Redelmeier DA, Wu AW, Steinberg EP. How reliable is peer review of scientific abstracts?
J Gen Intern Med
 
1993
;
8
:
255
–8.
Scharschmidt BF, DeAmicis A, Bacchetti P, Held MJ. Chance, concurrence and clustering: analysis of reviewers' recommendations on 1000 submissions to the Journal of Clinical Investigation.
J Clin Invest
 
1994
;
93
:
1877
–80.
Scott WA. Interreferee agreement on some characteristics of manuscripts submitted to the Journal of Personality and Social Psychology.
Am Psychol
 
1974
;
29
:
698
–702.
Strayhorn J Jr, McDermott JF Jr, Tanguay P. An intervention to improve the reliability of manuscript reviews for the Journal of the American Academy of Child and Adolescent Psychiatry.
Am J Psychiatry
 
1993
;
150
:
947
–52.
Thompson WG, Walter DW. A reappraisal of the kappa coefficient.
J Clin Epidemiol
 
1988
;
41
:
949
–58.
Wenneras C, Wold A. Nepotism and sexism in peer-review.
Nature
 
1997
;
387
:
341
–3.
Wessely S. Peer review of grant applications: what do we know? [Review].
Lancet
 
1998
;
352
:
301
–5.