-
PDF
- Split View
-
Views
-
Cite
Cite
Theodoros Baimpos, Nils Dittel, Roumen Borissov, Unravelling the panel contribution upon peer review evaluation of numerous, unstructured and highly interdisciplinary research proposals, Research Evaluation, Volume 29, Issue 3, July 2020, Pages 316–326, https://doi.org/10.1093/reseval/rvz013
Close - Share Icon Share
Abstract
In this study, we analyze the two-phase bottom-up procedure applied by the Future and Emerging Technologies Program (FET-Open) at the Research Executive Agency (REA) of the European Commission (EC), for the evaluation of highly interdisciplinary, multi-beneficiary research proposals which request funding. In the first phase, remote experts assess the proposals and draft comments addressing the pre-defined (by FET-Open) evaluation criteria. In the second phase, a new set of additional experts (of more general expertise and different from the remote ones), after cross reading the proposals and their remote evaluation reports, they convene in an on-site panel where they discuss the proposals. They complete the evaluation by reinforcing per proposal and per criterion one or another assessment, as assigned remotely during the first phase. We analyze the level of the inter-rater agreement among the remote experts and we identify its relative correlation with the funded proposals resulted after the end of the evaluation. Our study also provides comparative figures of the evolution of the proposals` scores during the two phases of the evaluation process. Finally, by carrying out an appropriate quantitative and qualitative analysis of all scores from the seven past cut-offs, we elaborate on the significant contribution of the panel (the second phase of the evaluation) in identifying and promoting the best proposals for funding.
Introduction
The peer review approach for evaluating various types of scientific work,(grant proposals, publications in scientific journals, textbooks, etc.), has been widely criticized (Jayasinghe, Marsh and Bond 2003; Bornmann and Daniel 2005; Demicheli and Di Pietrantonj 2007; Obrecht, Tibelius and D'Aloisio 2007) for having certain drawbacks such as restraining innovation, exposing private ideas due to violation of confidentiality, lack of reliability and waste of time/energy of the best scientists (Horrobin 1990; Stehbens 1990; Hodgson 1997; Weber 2002; Langfeldt and Kyvik 2011). However, the peer review approach is still the most commonly used method, which provides an effective and well-established way for assessing the quality of this type of work.
In the case of grant proposals (Wessely 1998; Demicheli and Di Pietrantonj 2007), the utility of the approach is based on the ability of independent experts to evaluate with sufficient level of objectivity, scientific or technological proposals. The experts have to draft evaluation reports by assessing the content of each proposal against evaluation criteria set-up by the corresponding authority, managing the call for proposals. Common feature of the different approaches used for the evaluation of grant applications is that they all include as a minimum, an individual assessment by independent experts. In this initial phase, each one of a certain number of experts independently evaluates in a structured manner the proposals assigned to them on the basis of their scientific expertise. If the remote phase is the only phase of the evaluation process this approach allows in principle the appointment and the involvement of the ‘best possible’ experts available in the pool to evaluate each proposal. In this case, the proposals' final ranking/score is based on the individual evaluation reports of the independent experts and on their accompanying individual scores. However, the main drawback of this approach is that when there is no interaction between the experts and especially when the subject is scientifically novel and interdisciplinary, the lack of a proper discussion among them may lead to misunderstandings or to insufficient appreciation of the proposer's ideas.
An improvement of the remote evaluation phase is the ‘panel discussion’, in which a group of experts, after some sort of remote and independent preparation, are convened and discuss the merits of the proposals, (Olbrecht and Bornmann 2010). There are many varieties of this second phase (panel). Usually the funding organizations use a combination of a remote and a panel phase, which are best suited and adapted to their needs and available resources. The funding decision in such a case is based on some sort of a consensus report and the score's fine-tuning after the panel discussions. Compared to existing studies on inter-reviewer variability and reliability, specific research on panel evaluation is relatively scarce (Cicchetti 1991; Demicheli and Di Pietrantonj 2007; Marsh Herbert et al. 2008). Still, there are some suggestions (e.g. Obrecht, Tibelius and D'Aloisio 2007) that the panel discussions of research applications do not appear to increase or positively contribute to the fairness of the assessment, at least for most of the cases (see also Pina, Hren and Marusic 2015).
Independently of the applied methodology for the evaluation of proposals, the feedback to the applicants (especially in case of non-funding) is extremely important (Weber 2002) since it will provide them with comprehensive information, regarding the strong and weak points of their proposed activities per each criterion. Thus, in case of resubmission to next calls, upon proper modification and adjustments, the proposal may stand better chances for success.
The Future and Emerging Technologies (FET) program aims at creating in Europe a fertile ground for responsible and dynamic interdisciplinary collaborations on new technological ideas and for kick-starting new European research and innovation eco-systems around these ideas. These eco-systems would provide seeds for future industrial leadership and for tackling the society's grand challenges in new ways. FET-Open (one of the two main components of the program—the other one is FET-Proactive) is a bottom-up scheme that supports the early-stages of science and technology research and innovation. Multi-beneficiary proposals (at least three partners) are sought, encouraging the collaborative research among scientists and organizations for cutting-edge high-risk/high-impact interdisciplinary research satisfying all the following essential characteristics (the so-called ‘FET gatekeepers’): Radical vision, Breakthrough technological target, and Ambitious interdisciplinary research. There are no thematic restrictions in FET-Open. All eligible applications from any scientific field are accepted and treated/evaluated on equal terms.
In this study, after a brief presentation of the two-phase (remote andpanel) evaluation approach followed in FET-Open funding scheme, we analyze the results of the last seven evaluation sessions (from the period 2014–2018) and compare the relative scores between these two phases. Our focus is also on the identification of the level of the inter-reviewer agreement during the remote phase (first phase) and on the role of the panel (second phase) on the final ranking of the proposals which defines the granting list.
Methods
The FET-Open evaluation of multi-beneficiary, interdisciplinary research proposals involves two phases:
A remote phase performed by four per proposal independent/external evaluators, who draft Individual Evaluation Reports (IER) and assign scores per each one of three evaluation criteria (Excellence, Impact, Implementation) and
A cross-reading/panel discussion phase, carried out by a different/new set of also four per proposal external experts, who, after having read the proposals and their IERs, play decisive role in the fine-tuning of the proposals' final score.
Certain constrains have been taken into account in the design of the evaluation process:
There is a strict policy of 15 pages limit on the proposal's length in which all the necessary information addressing the three main evaluation criteria should be included.
The proposals comprise one big unstructured set—at the submission stage there is no self-allocation to specific scientific topics or clusters. Thus, a potential impact from an up-front structuring of the proposals on the required interdisciplinarity is being avoided.
The number of submitted proposals is relatively large—in the reported cut-off evaluations, there have been between 350 and 800 proposals per evaluation session/period.
The evaluation is against of a set of the three aforementioned pre-defined criteria, each one having several sub-criteria covering all scientific/technical aspects.
The evaluation process has to be completed within no more than 4 months.
First phase (Remote Evaluation): For the remote phase, a pool of potential evaluators is prepared, with a number of experts being roughly three times the number of the expected proposals. The involvement of the best external experts from a wide range of disciplines and backgrounds is ensured, while their appointment is carried out avoiding conflict of interest and ensuring diversity (gender, geography). After the submission deadline (cut-off date), four experts are assigned to each proposal to carry out the remote evaluation. In contrast to other evaluation procedures (Solans-Domènech 2017), FET–Open evaluation approach is not blind. The evaluators have information about the partners (institutions) comprising the proposal's consortium as well as about the key investigators within each partner/entity. This is necessary given the fact that the evaluators, under ‘Implementation’, have to assess the operational capacity of the consortium, namely to judge whether it has the necessary expertise to tackle all the proposed tasks/activities which are planned to be performed during the project execution. Each remote evaluator drafts comments per sub-criterion and then these comments are ‘translated’ (by the expert) into scores per criterion (Excellence, Impact, Implementation), using a concrete correspondence scoring table. The scores are in the range from 1 (poor–lowest merit) to 5 (excellent–highest merit) with a possibility for half-integer scores. The combined set of comments (per sub-criterion) and scores (per criterion) represent the value assigned to the proposal by each remote evaluator in his/her IER. During the remote evaluation phase, the format of each submitted IER is quality checked by a quality controller (QC). QCs are (as remote evaluators and cross-readers) external experts who check and verify per IER if each sub-criterion has been assessed, if there is a correspondence between the comments and the sub-criterion for which they are entered, and if the correspondence between comments and scores respects the common scoring table applied to all European Commission (EC) funding programs. To emphasize that the role and contribution of the QCs is restricted only to the format of the IER rather than its scientific content, a closely monitored task for Research Executive Agency (REA)'s staff. The main result of the remote evaluation phase is that each proposal has assigned a set of comments (four IERs) covering all sub-criteria and a set of scores, one per criterion. The final remote score (Scremote) of each proposal is calculated as the weighted average of the scores on the three criteria.
Second phase (Cross-reading and panel discussion): In the second phase, a new set of additional, four per proposal external experts is assigned. Each one of these experts (known also as Cross-Readers or CRs) has a more general scientific expertise. They participate in the panel discussion on-site and they validate/finalize the evaluation procedure and the list of the funded proposals.
The specific novel approach of this second phase, lies in the structured cross-reading and the fine-tuning of the scores. Each CR reads a set of proposals (roughly 25 proposals per CR on average) together with the accompanying 4 per proposal IERs, as written by the remote evaluators during the first phase. Similarly, to the remote phase, the assessment in the second phase is also not blind. All relevant information about the key experts within the applying consortium is available to the CRs who are also aware of the identities of the remote evaluators who drafted the IERs. Each CR on the basis of his/her personal/independent reading, selects per criterion the IER comments which, in his/her opinion, provide the best assessment of the quality and the merit of the proposal for this given criterion. Once the CR reinforces the remote evaluation for a given criterion by selecting a specific set of comments, this selection is automatically accompanied by the score as assigned by the corresponding remote evaluator thus refining the approximation of the ‘true value’ of the proposal and contributing to the calculation of it's final score. To emphasize, a CR, neither can give a brand new independent score, nor can provide new comments. A CR, selects the most appropriate remote evaluation comments (per criterion) accompanied by its corresponding score. The four CRs per proposal play more the role of validators of the correctness of the remote evaluation rather than the role of evaluators, since by design, the remote experts (in the great majority of the cases) provide better coverage of the necessary expertise. The CRs` objective is to correct possible deficiencies arising from the remote phase however without re-evaluating the proposals and over-writing the original assessment. This is a major difference compared to other approaches, (e.g. the consensus approach), where the same set of remote experts meet on-site and smooth-out most of their differences in order to produce a common consensus report. Bearing in mind the scientific breakthrough nature of the FET-Open proposals and mainly their interdisciplinary character, a scenario where independent evaluators of the same proposal may have different or even, in some cases, entirely opposite opinions regarding the merit of the proposal is considered as a normal and an acceptable situation. At the end of the panel discussion, the final score per criterion is computed as the median of the 8 scores assigned—4 from the remote evaluators plus 4 (from within the original remote scores’ values), as reinforced by the selection of comments by the CRs. Then, the overall final score (1–5) is computed as the weighted average of the three scores per criterion with the use of pre-defined weights (Excellence 60%, Impact 20%, Implementation 20%). The output of the panel review is a list of proposals recommended for funding with each proposal being accompanied by its' Evaluation Summary Report (ESR). Unlike many other programs, the ESR of FET-Open is not a result of a consensus discussion seeking smoothed agreed-upon comments but is comprised of the 4 collated individual remote IERs plus eventually a short consensus panel comment, as drafted during the discussion on site. It should also be emphasized that because of the breakthrough scientific nature and the required interdisciplinarity of the proposals, a consensus report may not be a very appropriate way of presenting the assessment of the proposals by the evaluators.
A feature of the above scoring mechanism that is worth emphasizing is the following: In most of the cases, the outliers from the remote evaluation will not enter the final score and thus will not affect the proposal's final ranking. However, if the four CRs all decide to select the comments corresponding to an outlier, this would make the outlier become the median and thus the final score of the proposal compared with the one just after the remote evaluation will be completely different. This may happen only in exceptional cases though, when for instance, only one of the remote evaluators spotted an important (usually disqualifying) feature of a proposal but the CRs were able to recognize and properly take into account this feature through their choices. Thus, while for less controversial cases the FET-Open approach produces an outcome which would not be very different from the result from more conventional approaches (like using the average or a consensus score), in the cases of divergent opinions the final score could be very different from the averaging of the remote scores.
As Supplementary information, we provide more specific details and information on the main stages of the FET-Open evaluation, such as the preparation of the pools of potential remote evaluators, submission of proposals, first phase (Remote evaluation) and second phase (Cross-reading and panel discussion).
Results
From September 2014 to May 2018 within seven submission sessions 3,764 proposals have been evaluated out of which 160 were finally funded. Table 1 presents summarized information, such as the submission deadline (cut-off date), the available budget, the total number of the proposals, and the calculated success rate.
Relevant Information (cut-off date, budget, number of proposals evaluated/funded, success rate) regarding the seven evaluation sessions from which the data of the current study have been extracted from
| Cut-off date: . | 1st Cut-off Sept. `14 . | 2nd Mar. `15 . | 3rd Sept. `15 . | 4th May `16 . | 5th Jan. `17 . | 6th Sept. `17 . | 7th May `18 . | TOTAL . |
|---|---|---|---|---|---|---|---|---|
| Budget (M€): | 78 | 41 | 38 | 88 | 85 | 85 | 123 | 538 |
| Evaluated Proposals | 639 | 665 | 800 | 544 | 365 | 395 | 356 | 3764 |
| Funded Proposals | 24 | 11 | 11 | 22 | 26 | 27 | 39 | 160 |
| Success Rate | 3.8 | 1.7 | 1.4 | 4.0 | 7.1 | 6.8 | 10.7 | 5.1 (average) |
| Cut-off date: . | 1st Cut-off Sept. `14 . | 2nd Mar. `15 . | 3rd Sept. `15 . | 4th May `16 . | 5th Jan. `17 . | 6th Sept. `17 . | 7th May `18 . | TOTAL . |
|---|---|---|---|---|---|---|---|---|
| Budget (M€): | 78 | 41 | 38 | 88 | 85 | 85 | 123 | 538 |
| Evaluated Proposals | 639 | 665 | 800 | 544 | 365 | 395 | 356 | 3764 |
| Funded Proposals | 24 | 11 | 11 | 22 | 26 | 27 | 39 | 160 |
| Success Rate | 3.8 | 1.7 | 1.4 | 4.0 | 7.1 | 6.8 | 10.7 | 5.1 (average) |
Relevant Information (cut-off date, budget, number of proposals evaluated/funded, success rate) regarding the seven evaluation sessions from which the data of the current study have been extracted from
| Cut-off date: . | 1st Cut-off Sept. `14 . | 2nd Mar. `15 . | 3rd Sept. `15 . | 4th May `16 . | 5th Jan. `17 . | 6th Sept. `17 . | 7th May `18 . | TOTAL . |
|---|---|---|---|---|---|---|---|---|
| Budget (M€): | 78 | 41 | 38 | 88 | 85 | 85 | 123 | 538 |
| Evaluated Proposals | 639 | 665 | 800 | 544 | 365 | 395 | 356 | 3764 |
| Funded Proposals | 24 | 11 | 11 | 22 | 26 | 27 | 39 | 160 |
| Success Rate | 3.8 | 1.7 | 1.4 | 4.0 | 7.1 | 6.8 | 10.7 | 5.1 (average) |
| Cut-off date: . | 1st Cut-off Sept. `14 . | 2nd Mar. `15 . | 3rd Sept. `15 . | 4th May `16 . | 5th Jan. `17 . | 6th Sept. `17 . | 7th May `18 . | TOTAL . |
|---|---|---|---|---|---|---|---|---|
| Budget (M€): | 78 | 41 | 38 | 88 | 85 | 85 | 123 | 538 |
| Evaluated Proposals | 639 | 665 | 800 | 544 | 365 | 395 | 356 | 3764 |
| Funded Proposals | 24 | 11 | 11 | 22 | 26 | 27 | 39 | 160 |
| Success Rate | 3.8 | 1.7 | 1.4 | 4.0 | 7.1 | 6.8 | 10.7 | 5.1 (average) |
The relatively low funding budget available for the first few sessions combined with the extremely high interest (e.g. 805 proposals submitted upon the third cut-off date) shown by proposers for FET-Open funds resulted in initially very low success rates. Clearly, such a situation posed questions regarding the viability of the program, since the financial and organizational costs of performing the evaluation process are disproportionally high. However, after the 4th evaluation session the participation stabilized to around 400 proposals (420 proposals were submitted in the ongoing evaluation session with cut-off date the 24/01/2019), and since the available funds also increased, the success rate reaches 10%, a normal rate for a highly competitive program. Additionally, due to the expected interdisciplinarity of the proposals, the four (per proposal) remote evaluators (instead of the three used in many other programs) causes indeed by itself a substantial increase on the evaluation's total cost. However, the fact that a relatively much smaller number of experts (compared to the high number of submitted proposals) participate in the cross-reading and the on-site panel phase, this balances the higher expenses of the remote phase in a way that in the end, the overall spending on the FET-Open evaluation is close to the average for REA.
Before the discussion of our results, it should be noted that a proposal has a real chance of ending up in the funding range, only if among the 4 scores from remote evaluators under ‘Excellence’ (the most important criterion with 60% weight in the calculation of the final score) there is at least one remote score of 4.5 or 5. It is only these proposals that are later cross-read by the four additional CRs and that are being discussed in detail during the panel (second phase). The remaining proposals with a low score (not a single 4.5 or 5) under ‘Excellence’, are cross-read only by two CRs, whose main role is to check for eventual inconsistencies and to draft the panel comment. The ` number of the proposals fit in the first category (cross-read by four CRs) is 2,822 out of totally 3,764 submitted (75%). Obviously, only the scores from these proposals have been analyzed for creating all the comparative figures (between remote and panel phase) of this study.
Impact on the final score from divergent opinions between the remote evaluators
Inter-rater agreement (AD index)
In order to assess the within group inter-rater agreement among the four remote experts who evaluated a given proposal, the ‘Average Deviation'' (AD) index was calculated (Burke, Finkelstein and Dusig 1999). AD is considered as a pragmatic, simple and straightforward measure of disagreement that takes into consideration the average difference between scores of individual raters and the average scores of all raters. AD does not require the specification of null distribution and estimates inter-rater disagreement in the units of the original scale (Pina, Hren and Marusic 2015). There has been long discussion about which one of the available indices (AD, intraclass correlation-ICC or within group agreement index-rwg) used for similar cases (James, Demaree and Wolf 1993; Mutz, Bornmann and Daniel 2012; Smith-Crowe et al. 2014), performs better and describes more accurately inter-rater agreement. According to simulation research (Smith-Crowe, Burke and Kouchaki 2013; Smith-Crowe et al. 2014), AD index has been proved to perform better (Kline and Hambley 2007; Roberson, Sturman and Simons 2007). Additionally, the fact that ICC measures both agreement and reliability simultaneously (LeBreton and Senter 2008) might potentially complicate inferences.
Where ADj is the ‘Average Deviation' from the median computed for a proposal j, N is the number of judges or observations (evaluators in our case), xnj is the nth single evaluator's score on proposal j, and Mdj is the median for the proposal j.
Where Sc# is the remote score given on a single proposal by each one of the four remote experts during the first phase of the process. Smaller values of AD correspond to better agreement between the four remote experts, while once AD = 0 it represents a perfect agreement between them.
Figure 1 presents the calculated AD index resulting from the four remote scores per proposal (number of proposals per case is shown in –y axis) in a range of 0–1.5 in steps of 0.025. The positively (right) skewed distribution gives an overall median AD index for all proposals of 0.45 (on the original scale of 1–5) showing very good inter-rater agreement. The great majority of the proposals (80%) have AD index below 0.7. In another study, Pina, Hren and Marusic (2015) calculated an overall median AD of 5.4 (in a scale 0–100) under Marie-Sklodowska Curie Actions (MSCA), which is another funding mechanism of EC. If we project it to our scale then this would be 0.27. However, it should be taken into consideration that the two evaluation processes not only differ to the numbers of evaluators (in MSCA there are 3 instead of 4 remotes) but most importantly in MSCA there exist clearly pre-defined clusters upon submission (e.g. Life Sciences, Physics, Mathematics, Chemistry, etc.). The smaller number of evaluators that are used, combined with the predefined clustering, make the agreement between MSCA experts of the same cluster much more probable. To the same conclusion additionally contributes if we consider the highly interdisciplinary character of FET-Open proposals. Our calculated AD index is lower than the values published for all different kinds of null-distributions (Smith-Crowe, Burke and Kouchaki 2013). To the best of our knowledge, there is no other available published study about such enhanced interdisciplinary proposals, with which our results could be compared with in a straightforward way.

Positively (right) skewed distribution of Average Deviation (AD) indices for all FET-Open evaluated proposals (2014–2018/7 evaluation sessions).
Inter-rater agreement (how many agree with each other)
In order to analyze further the remote inter-rater agreement during the first phase, all proposals were categorized into four different categories (Figure 2). The categorization took place on the basis of the number of evaluators who agreed among themselves on the score given to the proposal just after the completion of the remote phase. Disagreement is defined when the score given to a single proposal from one evaluator differs in absolute values more than 0.5 (the lowest score step a remote evaluator can assign during his/her assessment) from the Md score (in a score scale of 1–5) as calculated from all the four remote scores of that proposal.

a) Proposal distribution as a function of various inter-raters agreement after the remote phase (Different levels of grey express the ‘3 agree’, ‘2 agree’, and ‘all disagree’ cases, while ‘all agree’ is in black). Disagreement is defined when the score of one evaluator given to a single proposal deviates in absolute value more than 0.5 (scale 1–5) from the median score (Md) calculated from all four raters of the same proposal. The percent distribution for each one of the different inter-rater agreement is also shown. b) Distribution of all FET-Open funded proposals (160) according to their initial categorization as resulted from the remote phase (in parenthesis the absolute number of funded projects per case).
Figure 2a represents for each one of the four categories of inter-rater agreement, their cumulative number and their corresponding percentage within all the evaluated proposals. Black colour is used to represent the ‘all agree' case—meaning when all the 4 remote evaluators agree with each other. Different scales of grey are used to illustrate lower agreement cases (‘3 agree’-deep grey, ‘2 agree'-grey, and ‘all disagree’-lightest grey).
The most widely occurring inter-rater agreement case is when ‘3 agree' (1,563 proposals-42%), the ‘2 agree' case reaches 31% (1,178 proposals), ‘all agree’ 18% (667 proposals), while the lowest percentage is that of ‘all disagree' (9% and 356 proposals).
One possible explanation for the disagreement between evaluators is the expected highly interdisciplinary nature (ambitious interdisciplinary research is one of the FET-Gatekeepers) of the submitted proposals. As a result, an evaluator with a certain expertise could evaluate and score the same proposal in a different way than another one who is expert in a different field, both scientific fields being present in the same proposal. Another explanation may be the lack of comparability between reviewers and procedures due to different background and past evaluation experiences in other programs, rather than to real differences in opinions. For instance, (Hodgson 1997) considered grant proposals sent to two different independent funding agencies, which used similar scoring systems and instructed their reviewers independently. The results showed a disagreement on the fundability of the same proposals as sent to the two agencies. Consequently, there could be divergent opinions in the different experts' assessments and as a result the absence of a very robust outcome.
However, the overall result for ‘all agree' and ‘3 agree' cases represents a very satisfactory figure of 60%, while ‘all disagree’ exists to only 9% of the cases showing a good internal consistency and overall high agreement among the remote evaluators despite the enhanced interdisciplinarity of research ideas within the same proposal.
Impact from the panel discussion
Relation between the inter-rater agreement and the panel outcome
An interesting question related to Figure 2a is: ‘What are the chances of a proposal, which has been assessed and scored during the remote phase by a set of low agreement experts’ (‘2 agree’ or ‘all disagree’) to become successful at the end and finally get funded? According to Figure 2a, these two cases of low-agreement correspond to 40% of the remotely evaluated proposals. In order to answer this question we distributed all the 160 FET-Open proposals selected for funding (from all 7 evaluation sessions), according to their initial (after the remote phase) inter-rater agreement. Figure 2b represents this distribution. The absolute numbers of the funded projects per case are also shown in parenthesis. In particular, 71 projects (44%) belong to the ‘all agree‘ case, 75 projects (47%) when ‘3 agree', while 14 funded projects out of the total 160 (representing the 9%) belong to the lower agreement areas (‘2 agree' and ‘all disagree’). This last number (9%) shows that indeed, even if a proposal after the remote phase has fallen within the 2 lower inter-rater agreement cases, it still has chances and may end-up among the finally funded proposals. Such a scenario would not be possible at all if remote evaluation was the only part of the evaluation process and less possible if the same remote experts had been invited to consensus on-site meetings. This result also shows that even if the panel does not change any of the original comments and scores, still it has the possibility to substantially change, in a logically sound and structured way, the final outcome for some of the proposals, even when the disagreement among the remote evaluators is substantial and their scores are scattered.
Comparison between the remote and the panel (final) score
Figure 3 represents for all 2,822 proposals evaluated in both phases of the seven evaluation sessions (each session is split by the other with vertical dotted lines), their calculated ΔSc values (each symbol represents one proposal). Proposals with ΔSc < 0 (Scpanel<Scremote) are depicted in open grey triangles, proposals with ΔSc > 0 (Scpanel>Scremote) are illustrated with open grey cycles, while the proposals which undergo no change in their scores between the two phases, ΔSc = 0 (Scpanel=Scremote) are shown in black and are placed along the zero line. An important observation in Figure 3 is that across all seven evaluation sessions there is a consistent distribution of the relative change of the scores, between the remote and the panel phase. The figure also presents the overall percentage for each one of these three cases. In particular, 60% of the proposals have Scpanel<Scremote, 24% have Scpanel>Scremote, while the remaining 16% correspond to no change of their score (ΔSc = 0). This shows that, in total, the majority of the proposals after the panel discussion have their overall score lowered. Once again it should be emphasized that this does not happen because the cross-readers assign new scores—they just reinforce one or another set of remote comment/scores.

Relative change ΔSc (%) of the two scores (Scpanel, Scremote) given to each one of the evaluated proposals. Each point represents a single proposal. The % overall distribution of ΔSc (%) is also shown.
Figure 4 provides a complementary to Figure 3 illustration of the comparative change between the two scores Scremote and Scpanel for each proposal. Every dot represents a single proposal out of the 2,822 which were read by eight experts (four remotes who evaluated and four CRs/panel members who validated). The exact position of each proposal in the x, y axis is defined by the score that the proposal received after the first (remote—x axis) and the second (panel—y axis) phase. The bold black square-dot line indicates the cases in which there has been no change between the scores (Scremote=Scpanel). Similarly to Figure 3, it can be observed that the panel has the general tendency to select more `negative` comments and thus to decrease the score given after the completion of the remote phase since a higher population of dots/proposals is placed below the x = y line. The two dashed lines drawn in parallel and in same distance from the x = y line, show that the vast majority of increased scores falls within half a point of change, while there are quite a few cases in which the decrease is bigger than half a point. Obviously, there are cases of extreme changes of the scores from two phases, in which the panel contributed either extremely positively by increasing substantially the score given in the remote phase or extremely negatively by decreasing it. For instance, there is a proposal which received a score of 4.6 after the remote phase but was downgraded to only 2.9 after the discussions during the panel. On the other hand, a proposal received 4.25 in the remote and it was later upgraded to—4.85 after the panel, which eventually led to its selection for funding. Such big changes happen when all panellists reinforce the comments of an outlier, which changes the median from very high to very low or the other way around. These two dots which represent these extreme cases have been marked with circles for clarity.

Remote score versus Panel score (Scremote vs Scpanel) for each one of the evaluated FET-Open proposals. Scremote represents the median score by the four assigned remote evaluators (Phase 1) while Scpanel the median score given after the panel (Phase 2). Each dot represents the correlated scores for a single proposal. The dashed line represents the area where Scremote = Scpanel (x = y). Proposals displayed above the line have their scores increased after panel discussion, while those placed below the line had their scores decreased. Two extreme cases (circles) of big discrepancy between the scores of the two phases are shown as indicative examples.
Change of the distribution of remote and panel scores per score range
The main observation of Figures 3 and 4 is that, in general, the panel tends to reinforce (for most of the proposals) the negative comments, thus leading to lower scores compared to the remote phase. In order to present better some qualitative statistics, the score range of 2–5 has been sub-divided into small steps of 0.2. Figure 5 presents the distribution of the number of proposals per score range for each one of the two different evaluation phases. Black columns represent the proposals' score distribution after the remote phase, while grey columns their distribution after the panel phase.

Population distribution of proposals by their score range (by 0.2 step) in the remote (black) and the panel (grey) phase.
There are some interesting observations, which deserve to be pointed out:
The total mean score (± SD) for the remote phase is Scremote=4.04±0.47 while for panel phase it is a little bit lower Scpanel=3.91±0.55. The panel not only decreases the scores on average but also flattens the distribution of scores (the standard deviation is bigger) by selecting more scores away from the mean. At least to some extent this could be attributed to the use of median. The median on the one hand cuts/isolates the outliers. On the other hand, because of the model utilized, the median allows in a small but significant number of cases for a wider range of scores to be used. To see how this happens let us consider the extreme case where a proposal receives an one time score 2 and three times score 5 from the four remote experts. In this case, the median is 5 and the average is 4.25. If all the panellists involved in the second phase select the comments corresponding to the low score, the final median will be 2 (while the average will be 3.125).
The most popular score range of the remote phase is that of 4<x≤4.2 (505/2,822 proposals= 18%), while for the panel phase is that of 3.8<x≤4.0 (380/2,547 proposals= 15%).
In the medium-high score range (3.8<x≤4.8), the remote evaluators tend to put many proposals in a ‘waiting list’ and most of them have their scores effectively (by selection of more negative comments) reduced by the panel. The effect from the panel discussion is lowering the number of proposals in the most popular score ranges and producing a closer to a truncated bell-shaped distribution.
The panel increases the number of proposals in each score range bellow the most popular score ranges and decreases the number of proposals above those. This is just a consequence of the average decrease of scores. However, all this is true with one very noticeable exception: The number of proposals in the top score bracket (4.8<x≤5.0) goes up. While all other changes could be explained with the overall lowering of the scores and the spreading of the scores due to the use of median, the change in the top score bracket underlines the important role that the panel is playing in reinforcing the selection of the best proposals.
Relation between the inter-rater agreement and the change of score
Figure 6 represents the proposal distribution of all the four possible inter-rater agreement cases (‘all agree’, ‘3 agree', ‘2 agree’, and ‘all disagree') as a function of ΔSc=Scpanel-Scremote –which is the change of the median scores during the two phases of the evaluation process. In particular, five different areas of ΔSc range have been identified and they are presented in detail: −1≤ ΔSc<−0.5, −0.5≤ ΔSc < 0, ΔSc = 0, 0<ΔSc ≤ 0.5, and 0.5<ΔSc ≤ 1. The great majority of the cross-read proposals lies (2,587/2,822 or 92%) within the three intermediate cases (namely −0.5≤ ΔSc < 0, ΔSc = 0 and 0<ΔSc ≤ 0.5). Within each one of these three cases, between 70% and 80% of the proposals correspond to ‘3 agree’ and ‘2 agree' situations with a common pattern within the inter-rater agreement data from higher to lower number of proposals which is: ‘3 agree’>‘2 agree’>‘all agree’>‘all disagree’. This trend does not apply for the two extreme but low in population numbers ΔSc ranges (−1≤ ΔSc < 0.5 and 0.5<ΔSc ≤ 1). Additionally, for each range the percentage distribution (in a pie-chart format) is also presented.

Proposal distribution of the different remote experts’ agreement cases (‘3 agree’, ‘2 agree’, and ‘all disagree’) as a function of ΔSc, which is the change of the median scores of the two phases of the evaluation process.
All these confirm the fact that the panel discussion does not in general change a lot the remote score(notable exceptions are the two outmost ranges). While it is normal to expect that when the inter-rater agreement is very low the most probable final outcome of the evaluation will be a score, not sufficient for funding, our data shows that this is not necessarily the case. There are 13 cases in which despite the low remote agreement, the score after the panel has increased by more than half a point. This is an example of the capability of the system to incorporate extreme cases and to select for funding proposals, which on the one hand have been very far from consensus remotely, but on the other, they have been identified by the panel as having big potential for success.
Portion of the funded proposals, selected for funding after the first phase
A remaining task of our study is to define the influence and the contribution of each one of the FET-Open evaluation phases to the ranking list with the funded proposals (projects). As it was already mentioned, the panel members, first cross-read remotely all their assigned proposals together with their accompanying remote evaluation reports, and after the scientific discussions during the panel, fix their selection of comments to be reinforced, thus finalizing the list of proposals to be funded. The analysis of this task can be done by comparing the ‘real’ list of proposals suggested for funding (just after the panel), with the corresponding ‘provisional’ list before the start of the panel (just after the ending of the remote phase and as if the second phase would not follow). Having knowledge of the exact number of proposals funded per cut-off (Table 1), this is feasible by identifying all the proposals (according to their acronyms) that are both in the ‘real’ (solid grey) and the ‘provisional’ (grey diagonal lines) funding lists. Figure 7 does compare per cut-off case (the last column shows also the total results for all seven evaluation sessions) the number of the proposals which eventually got funded after the final panel in Brussels, with the ones which would have been funded if the remote phase was the only phase of the evaluation process. For instance, after the end of the first evaluation session 24 proposals were funded. If we compare these 24 acronyms with the 24 best ranked proposals' just after the remote phase, then 17 proposals (shown in grey diagonal lines) are common in both lists. Overall, by comparing the ‘provisionally’ funded proposals with the ‘real' funded, it can be concluded that the panel decides for the funding of the 44% (70 out of 160) of the projects. Equivalently the decision for the remaining 56% has already been ‘taken’ from the remote phase (overlapping proposal acronyms among the two phases). This result shows that the panel, with the (i) cross-reading of the proposals (ii) the reading of the IERs and (iii) the on-site discussions, significantly contribute to the final ranking of the proposals and the identification of the best ones which deserve to be funded.

Number of (i) the ‘provisionally’ funded proposals per cut-off (and total) after the remote evaluation (grey diagonal lines) and (ii) the ‘real’ after the final panel discussion (solid grey).
In the same Figure, we are further focusing only on the 70 proposals that the panel decided to fund after its discussions. For these proposals, we identify what was their remote inter-rater agreement from phase 1. In particular, 44% (31 proposals) belong to the ‘all agree' case, 36% (25 proposals) to the ‘3 agree’, while there is a quite significant amount of 20% (14 proposals) which despite the fact that initially belonged to the low agreement area (‘2 agree' and ‘all disagree'), after the panel discussions they were high ranked and finally got funded. These exact 14 proposals are the same proposals which were also identified in Figure 2b. This is another illustration of our evaluation system's capability to incorporate extreme cases and to select for funding proposals, which remotely were very far from consensus.
As it was already mentioned, there are exactly 4 new cross-readers assigned per proposal. Mayo et al. (2006) calculated that about 10 reviewers would be needed to reach a high degree of reliability and consistency regarding individual grant proposals. However, in the same report it is also highlighted that this comment should not be generalized and that possibly could be their review system which resulted in this conclusion. In FET-Open per proposal there are exactly eight experts who contribute to the final report and define the final score. The more recent study of Fogelholm (2012) on medical research grant proposals showed that considerable inter-reviewer disagreement can be reduced by consensus discussion. On the other hand, Obrecht, Tibelius and D'Aloisio (2007) examined the value added by a committee discussion in the review of applications for research grants and concluded that panel discussion and rating of proposals offered no improvement to fairness and effectiveness over and above that attainable from the pre-panel evaluations. According to our analysis, the second phase (cross-reading/panel) plays a substantial role in the finalization of the list of proposals selected for funding. This can be achieved without the panel re-evaluating the proposals and without changing the original comments and scores as given from the remote experts. The only action that the panel members perform is (after the cross-reading of the proposals, their IERs and the on-site discussions) to reinforce one or another remote assessment with its accompanying score.
Conclusions
A proper evaluation process plays undoubtedly a crucial role in the fair and as objective as possible selection for funding of research proposals and therefore for research in general. In FET-Open at the REA of EC, the evaluation process of highly interdisciplinary research proposals has two phases (a remote plus a cross-reading/panel). The outcome of this approach is a consolidated report containing the complete individual reports from the remote evaluators and the additional comments from the panel accompanied by the final scores per criterion. Our study provides a comprehensive data analysis of the last seven evaluation sessions (2014–2018) of the grant review process followed.
Having in mind the interdisciplinary nature of the FET-Open program, the Average Deviation (AD) shows very good inter-rater agreement among the remote evaluators. In 60% of the cases either all four per proposal remote evaluators or at least three of them agree to a very big extent in their assessment of the proposals. Additionally, the 91% of the finally funded projects belong to these two categories. The panel phase, in general, tends to decrease the remotely assigned scores in most of the proposals. Exception is the proposals that end up on the funding list in which the panel discussion enforced mostly the positive comments assigned remotely, thus increasing their scores and making the distinction between these successful proposals and the remaining proposals more profound. The use of median and the particular procedure utilized by FET-Open has the important effect that in a small number of cases, mostly for proposals with divergent opinions among the remote evaluators, the panel may reinforce one of the more extreme sets of comments. Such a development would result in a big change in the overall score. Nine percent of the funded proposals belong to the two lower inter-rater agreement cases (‘2 agree', ‘all disagree') underlining the critical role of the panel, which after the cross-reading and the on-site discussions, placed these proposals to the funding list. Finally, by comparing the acronyms of the proposals which got funded, with those that would have been funded if there was no panel phase, it can be seen that 44% are selected for funding after the panel discussion, highlighting its crucial contribution to the identification of the best proposals. Undoubtedly, the importance of the panel on grant decisions is still under investigation and we are not still able to make a definite conclusion. In order to arrive at such a conclusion a more comprehensive statistical investigation is needed, possibly by using the results and the analysis made from different funding schemes and by using different evaluation procedures and rules.
Supplementary data
Supplementary data is available at Research Evaluation Journal online.
Acknowledgements
We would like to thank our colleagues Timo Hallantie, Martin Lange, and Adelina Nicolaie for the careful reading of the drafts of this paper and for the fruitful discussions.
Disclaimer: All views expressed in this article are strictly those of the authors and may in no circumstances be regarded as an official position of the Research Executive Agency or the European Commission.