Abstract

Many countries perform research assessment of universities, although the methods differ widely. Significant resources are invested in these exercises. Moving to a more mechanical, metrics-based system could therefore create very significant savings. We evaluate a set of simple, readily accessible metrics by comparing real Economics departments to three possible benchmarks of research excellence: a fictitious department composed exclusively of former Nobel Prize winners, actual world-leading departments, and reputation-based rankings of real departments. We examine two types of metrics: publications weighted by the quality of the outlet and citations received. The publication-based metric performs better at distinguishing the benchmarks if it requires at least four publications over a six year period and allows for a top rate for a very small set of elite reviews. Cumulative citations received over two six-year review periods appear to be somewhat more consistent with our three benchmarks than within-period citations, although within-period citations still distinguish quality. We propose a simple evaluation process relying on a composite index with a journal-based and a citations-based component. We also provide rough estimates of the cost: assuming that all fields of research would be amenable to a similar approach, we obtain a total cost of about £12M per review period.

1. INTRODUCTION

Over the last 20 years, many countries have established systems to evaluate the performance of their institutions of higher learning as an incentive scheme for improving the quality of provision and as a tool for allocating government funds. The quality of the research performed at these institutions has been a particular point of focus. The manner in which academic research is assessed varies both across countries and over time. In the United Kingdom, for example, peer review-based rankings of research outputs have been complemented recently by impact; furthermore, the weighting of different elements of the exercise, such as the research environment or the published output of the unit of assessment, has changed. The system continues to evolve, with discussions of citations measures to be used in the next assessment in 2020. Systems in other countries have been reviewed in European Commission (EC Expert Group on Assessment of University-Based Research, 2008). These systems vary widely in how and whether they evaluate quality and how many research outputs need to be submitted. In some systems, fewer than one publication per full time equivalent member of staff is or has been submitted for review, while in others all publications are counted. In some assessments, a publication’s quality automatically surmounts the threshold if it appears in a journal listed by Journal Citation Reports, whereas others have a finer ranking of output quality or rely on peer review panels.

The current diversity of approaches reflects the different purposes to which research evaluations are put and the different contexts in which research operates in various countries, as well as the resources that the local government is willing to devote to the process. Research assessment exercises can be quite expensive: the most recent research assessment in the UK has been estimated as costing from £121 million, when counting only the direct costs, to over £1 billion, 2 when the time spent by all actors is fully costed. This raises the question of whether one might be able to find approaches to reviewing academic research that would be both sufficiently accurate for the purpose of decision making and cheaper than current systems. As time spent by reviewers and staff in the academic institutions is a major component of total cost, a system based on simple, readily available bibliometrics seems attractive. The challenge, however, is how one might validate the accuracy of such bibliometrics.

Validation requires some benchmark. Proposing and illustrating three such benchmarks of quality is the purpose of this paper. To construct our first benchmark, we consider the full set of past winners of the Nobel Prize for Economics. We split the working career of each laureate into six 6-year long ‘review periods’. Hence, each laureate produces six observation points that we can use to randomly build fictitious ‘Nobel-Only’ departments. The strength of this benchmark is that it captures overall research excellence that can be associated with a variety of career models: from the constant prolific publisher, to the author with one big idea. Also, precisely because Nobel Prizes are given for ‘big contributions’, one can reasonably assume that ‘Nobel-Only’ departments would score high on the ‘impact dimension’ that policymakers also care about. The main weakness is that a large proportion of Nobel Prize winners had their heyday many years ago when incentives and publication outlets might have been rather different from what they are now. While we adjust for this, some caution is required in interpreting the results.

As a second benchmark, we use the recent performance of economics departments that appear to be world-leading in a wide selection of rankings, reflecting a number of different measures of quality. As a result, this benchmark should not be affected by changes in the values and incentives of the profession over time. In addition, such a benchmark is likely to account for both academic excellence and impact.

Finally, we turn to the available rankings based on the worldwide reputation of economics departments, where reputation is gauged by opinion surveys of a large number of members of the profession. Like the Nobel benchmark, reputation has the advantage that it might be somewhat more ‘holistic’ than the second, leading department, benchmark. On the other hand, reputational assessments are subject to biases and might not always be based on accurate and up to date information.

We propose that all three benchmarks: a department that is composed solely of Nobel Prize winners, a department that reflects the top four departments by many rankings, and top departments by reputational standards, would be highly desirable to policymakers as target outcomes for their own university research policies. A research exercise that distinguishes these benchmarks is, therefore, useful in implementing a research evaluation policy that envisages such an outcome.

Our approach consists of evaluating real life economics departments based on our proposed bibliometrics and then checking whether the results obtained appear to be consistent with distinguishing our benchmarks. For both the ‘Nobel’ and the ‘leading departments’ approach, we claim that bibliometrics that are useful in leading to such a desired outcome should be able to differentiate between our benchmarks and most actual departments, even very good ones. The ‘reputation’ benchmark has somewhat different implications. Since it seems reasonable to assume that reputational rankings are more accurate near the top of the distribution (where experts could be expected to have good information about the departments), we look for bibliometrics that allow us to match the upper part of the reputational distribution rather closely. We interpret discrepancies further down the ranking as a sign that bibliometrics would actually add significantly to the ‘reputational’ view, as they allow better alignment of the ranking to actual quality.

We focus our attention on simple, readily available bibliometrics. We are fully aware that much more complex measures could be used. We believe, however, that to make our contribution policy-relevant we need to focus on methods with some key characteristics that limit the set of indicators we consider. Clearly, the bibliometrics must be informative, and this is the bulk of our work: we investigate how well our bibliometrics distinguish our benchmark groups from the rest. However, the bibliometrics must also be relatively difficult to manipulate. This rules out, for example, measures based on the references cited by the publication to be evaluated, such as the novelty measure suggested by Wang et al. (2016) . 3 We also need our measure to be sufficiently immediate, as it could be the basis for allocation of current research funds. We therefore consider only bibliometrics that reflect short lags from the date at which research is completed. 4 Third, we believe that there should be a large premium put on simplicity, not only because it reduces costs, but also because it increases transparency: authors have a good idea of what are good or bad journals and might have a feel for the type of research that leads to citations. In contrast, the meaning of constructed ‘impact factors’, to use one example, is more difficult to grasp. In other words, if these bibliometrics are to be used to modify behaviour in order to attain targets, then transparency has value. Finally, cost itself is a main concern and is a focus of this paper. As Bertocchi et al. (2015) show, a method such as informed 5 peer review may generate rankings quite similar to bibliometrics, but peer review may be much more expensive.

We look at two families of bibliometrics that seem to meet these criteria. The first type of indicator relies on ranking publication outlets according to quality. The list of outlets is then divided into categories receiving different weights. A variety of quality-weighted totals or averages can then be computed for a given authors’ publications over the assessment period. We find that, in order to be highly consistent with our three benchmarks as indicators of quality, such bibliometrics need to include a sufficient number of publications per author (at least four, possibly more). Adopting a journal ranking with only a very limited set of ‘elite’ reviews at the top also helps.

The second type of indicator is based on citations received. We look both at citations received within a given assessment period and the cumulative citations received over two periods by publications that appeared in the first of these two periods. While cumulative citations give us rankings that are somewhat more consistent with our three benchmarks as indicators of quality, this difference is small enough that it might not justify the 6-year gap between the output and the assessment that it requires.

Having established that simple bibliometrics could actually provide an accurate research assessment, we then provide a rough estimate of the total costs involved. We focus on the United Kingdom, as estimates of the current system are available. As we have seen, these estimates range from £121M to £1B. Our calculations suggest that, if all areas of research could use similarly simple bibliometrics, then the total cost of an assessment exercise would be about £12M.

Our paper fits into a general stream of literature on research rankings. In a recent special issue on the topic of the UK Research Excellence Framework, Hudson (2013) reviews and produces an adjusted ranking of journals, while Laband (2013) evaluates the role of citations in conducting quality evaluations. In the same issue, Sgroi and Oswald (2013) discuss how peer review panels should combine output counts with citations to generate an overall view of quality. Other literature has looked at research assessment more generally, investigating accusations of bias ( Clerides et al. , 2011 ), bias related to the use of bibliometrics ( Wang et al. [2016 ] and references therein) and a lack of stability in rankings when the balance of quality/quantity weightings and citations systems change ( Frey and Rost, 2010 ). 6 We have nothing to add on the issue of indicator aggregation and hence on the relative weight that should be put on the two types of simple bibliometrics that we propose. We do however address another aspect of ‘stability’, namely how much one would expect the performance of a given department to vary naturally over time even if the underlying quality of its researchers does not change. We find that, absent adjustments due to changes in the composition of the department, performance can vary over time. This suggests that research funding – as well as internal treatment of departments within their institution – might not want to be too sensitive to the results of assessment exercises, however accurate those exercises might be.

The main difference between the existing literature and our paper – and our main contribution – is the use of a number of external excellence benchmarks for validation. We also put special emphasis on the indicators’ ability to discriminate near the top of the research quality distribution. Top quality output has been considered specifically in a few papers. Gans and Shepherd (1994) , focusing on Nobel and Clark prize winners, comment anecdotally that even discipline-based reviewers may not easily distinguish high quality work, a point echoed by Starbuck (2005 , 2006 ). Abramo et al. (2009) add to this the concern that selection procedures by universities to decide which outputs to submit to the research assessment, which may or may not be conducted primarily by experts in the field depending on who and how this selection is conducted, is the weakest stage of the evaluation process. On the other hand, Bertocchi et al. (2015) find that peer review and bibliometrics-based approaches are in relatively good agreement in their study of the Italian research evaluation exercise, leading them to suggest that bibliometrics-based procedures can work well in place of peer review. 7 We see these contributions as supporting a bibliometrics-based system where the only relevant ‘selection’ decision is the choice of a cut-off point for submission; here, subjective evaluation of submissions plays no role.

The rest of the paper is organized as follows. In Section 2, we briefly present the research assessment methods used across a large number of countries. Section 3 presents our three main benchmarks and describes our methodology. Our results are displayed and discussed in Section 4. Section 5 estimates the likely cost of running a research assessment exercise based on our suggested metrics. Section 6 discusses further aspects of implementation, including comparisons across research fields and the expected variability of a given department over time. Section 7 concludes.

2. ASSESSMENT METHODS

A digest of assessment methods was presented in a 2008 EC report as well as a 2010 Organization for Economic Co-operation and Development (OECD) report on research assessment. 8 This report included assessments conducted by private bodies, such as newspapers, for the purpose of informing students of institutional characteristics alongside assessments carried out by higher education institutions (HEIs) and national scientific bodies for the purposes of ranking research quality. We focus in this section on the latter, where the main aim is to inform the national management of research activity. OECD (2010) reports that the first research evaluation system was the 1986 research assessment exercise in the United Kingdom. Other systems have evolved more recently.

We take the information in these two reports as an indicator of the range of systems that have been considered seriously, even if some of these systems have since been modified. Thirty-three countries currently rank their universities in some way. Table 1 summarizes the main features of a set of representative assessment systems at the time of those reports.

Table 1.

Representative evaluation methods for research [TQ3]

Country Level Goal Peer review Bibliometric indicators Other indicators Period Funding effects 
Australia (Excellence in Research for Australia, ERA) National Quality/Competitiveness External Yes Yes 3–4 years Yes 
Belgium (Université Libre de Bruxelles) Institution Quality/Governance External Yes Yes 5 years No 
Finland (Aalto & Helsinki) Institution & National Quality/Governance/Reputation External Yes Yes Irregular Yes 
France (Agence d'evaluation de la recherche et de l’enseignement supérieur) National Quality External Yes Yes Unknown Yes 
Germany (Forschungsrating) National Quality External Yes Yes Pilot 2 years No 
Hungary National Quality/Competitiveness/Efficiency No Possibly Yes 3 years Yes 
Italy (Comitato di Indirizzo per la Valutazione della Ricerca moving to Agency for the Evaluation of University Systems and Research 2009) National Quality External Yes Yes 4 years (up from 3) Yes* 
The Netherlands National Quality/Governance External Yes Yes 6 years Yes 
Norway (Sweden/Denmark/Similar) National Quality External Yes Yes 2 years Yes 
Spain National Quality External Yes Yes 6 years No 
UK (Research Excellence Framework) National Quality External No Yes 6 years Yes 
Country Level Goal Peer review Bibliometric indicators Other indicators Period Funding effects 
Australia (Excellence in Research for Australia, ERA) National Quality/Competitiveness External Yes Yes 3–4 years Yes 
Belgium (Université Libre de Bruxelles) Institution Quality/Governance External Yes Yes 5 years No 
Finland (Aalto & Helsinki) Institution & National Quality/Governance/Reputation External Yes Yes Irregular Yes 
France (Agence d'evaluation de la recherche et de l’enseignement supérieur) National Quality External Yes Yes Unknown Yes 
Germany (Forschungsrating) National Quality External Yes Yes Pilot 2 years No 
Hungary National Quality/Competitiveness/Efficiency No Possibly Yes 3 years Yes 
Italy (Comitato di Indirizzo per la Valutazione della Ricerca moving to Agency for the Evaluation of University Systems and Research 2009) National Quality External Yes Yes 4 years (up from 3) Yes* 
The Netherlands National Quality/Governance External Yes Yes 6 years Yes 
Norway (Sweden/Denmark/Similar) National Quality External Yes Yes 2 years Yes 
Spain National Quality External Yes Yes 6 years No 
UK (Research Excellence Framework) National Quality External No Yes 6 years Yes 

Notes: For each country, listed in the left-hand column, the table lists whether the review is at the national or institutional level (e.g., in the case of the Université Libre de Bruxelles the review was purely internal); the stated purpose of the review (e.g., to increase the quality of research or to improve its management (governance), to improve the international reputation of an institution or a national innovation system, or to improve the performance relative to other national systems – so called ‘competitiveness’); whether peer review is involved and whether this involves outsiders; whether bibliometric measures or other measures (such as indicators of inputs, process, or impact, e.g.) are used in the evaluation; the period of the evaluation; and finally whether there are any direct resource allocation implications for the bodies that are reviewed. *For Italy, the older system did not have direct funding implications, although there are some effects under the newer system that replaces it. Both are listed in the table.

Concentrating on the systems of research assessment, rather than overall ranking systems that include teaching and other dimensions, most systems include some form of output count over the review period. These outputs generally involve a quality and a quantity measure for publications. The number of publications included varies: there is no clear limit to the number of publications reviewed in the German Science Council system, whereas the Italian system has been quite selective in the past, including 0.5 outputs per full time equivalent over a 3-year review period. The Italian system has been less selective recently. Similarly, while most systems judge the quality of the outputs, the fineness of the quality rankings differs. For example, the Norwegian system includes only two levels of quality rankings, whereas the United Kingdom and Australia have a four tier quality ranking. Citations are often used as an indicator of quality to supplement impact factors or other measures of judging the quality of the outlet. The review window itself varies, but usually is at least three years.

Not all systems require all institutions to submit the same information. For example, in Hungary each institution can choose to report just a few measures out of a wide list of indicators. Even those systems that require the same categories of output be submitted by all institutions often include a large number of categories so that a single institution could perform well on one measure but badly on another while both end up with the same overall evaluations. Finally, most systems allow for selective reporting within categories. For example, even if research outputs are required the institution is allowed to select only a few of these to report. In this sense, quantity tends to be discounted in favour of multiple categories of output that are aggregated into a single index in either a formal or an informal way.

Many systems include some measure of the benefits of the research for the wider society, variously styled as ‘impact’ (UK, Australia, Belgium), ‘relevance’ (Netherlands and Italy), or ‘knowledge transfer’ (Germany). Measurement tends to be based on case studies or other reflective documents prepared by the institution undergoing review. Some documentary evidence of the research environment and activity also are present in most systems, including the human and monetary resources brought to bear upon research and management system employed (input measures), and the associated human outputs (such as doctorates awarded). Most involve peer review alongside metrics, including bibliometrics. Clearly the inclusion of such input measures next to output measures is controversial.

All but one of the systems includes only retrospective measures, applying to the census period. The lone exception is the Netherland, where a measure of innovative potential is included.

From this overview, we retain that bibliometrics are already widely used and belong mostly to two broad families: quality weighted measures of publications and citations. However, there are large differences in the precise measures relied on as well as on the number of indicators that departments can – or can choose to – report. We further notice that impact or relevance assessments often involve a separate metrics-less procedure.

The purpose of our paper is to check how far one can actually get by using a few simple bibliometrics on their own. Such an approach would be much more transparent – and likely much less costly – than most current procedures. If we can provide convincing evidence that simple bibliometrics lead to an appropriate ranking of the research produced by academic units, the case for moving to such a simpler system would be strengthened. Following the approach of most national authorities, we concentrate on metrics based on journal rankings and metrics based on citations.

3. BENCHMARKS

To validate an approach that relies on simple bibliometrics, we need recognized benchmarks of research excellence. We rely on three. Our first benchmark is the output of past winners of the Nobel Prize for Economics. Our implicit assumption here is that the laureates’ excellence in research is beyond doubt. We are aware that the Nobel Prize is – in principle at least – awarded for a contribution of special significance rather than for sustained performance over a career but our contention is that this is a type of research excellence that policymakers could easily accept. Moreover, the fact that the Prize is given for a unique contribution makes it more likely that the performance of laureates also reflects ‘impact/relevance’. How the Nobel Prize winners’ record is used and the further pros and cons of this benchmark are discussed in the following section.

Our second benchmark is the current performance of a set of generally acknowledged world-leading economics departments. For this benchmark, we pool four US economics departments that rank at the top of schools in essentially any ranking available. Indeed, we choose these four because the results are so consistent across rankings. We pool the departments to increase the number of individual faculty members who contribute to our benchmark, thereby reducing the possible impact of idiosyncratic events.

We use available reputational rankings as our third benchmark. The precise ranking that we use is discussed in more detail in Section 4. We should already point out however that this benchmark is rather different from the first two. With the first two benchmarks, the relevant question is whether applying simple bibliometrics to both our benchmark and real life departments seems to yield results that are consistent with the quality ranking that the benchmarking implies. With reputation, the question is how rankings based on simple bibliometrics compare to reputational rankings. Our working assumption is that reputational rankings are likely to be sufficiently accurate for top institutions: the ‘experts’ whose opinions generate the rankings have easy access to information about certain top schools and so are likely to base their opinion on some significant knowledge of the department. They are more likely to err for lesser known academic units. We would therefore hope that simple bibliometrics would track the top of the reputational distribution pretty closely but would expect the correlation between the two rankings to become significantly weaker as we move down the distribution.

The reputation benchmark also differs from the other two in that it provides a potential alternative to bibliometrics-based rankings: a well-designed reputational ranking can be thought of as ‘market evaluation’ of research. In our opinion, there are several reasons why such a pure market-based approach is not desirable.

First, we suspect that reputational rankings are likely to be more accurate at the top of the distribution than at the bottom. In this view, a good correlation with bibliometrics for high-ranking institutions but a weaker one lower down the reputational ranking would be an argument to prefer a bibliometrics-based approach, which need not suffer from lack of information at any level.

Using reputation surveys to allocate significant amounts of funding also raises the spectre of manipulation. Even if the set of experts involved is sufficiently large and diverse, the prospects for gaming the system would seem greater than for our approach, where journals would be ranked based on objective measures. Citations – through sources such as Google Scholar – number in the hundreds on average over the course of two review periods and so would seem hard to affect significantly through opportunistic citation behaviour. There are other difficulties with reputation measures. In particular, experts could be biased in favour of local schools and large institutions. The Quacquarelli Symonds (QS) ranking tries to control for the first bias by using a geographically varied set of assessors and distinguishing between local and non-local schools. However, as the degree of regional (as opposed to national) preference might vary significantly across areas, it is not clear that the problem can be completely eliminated. The second bias seems harder to handle. The opinion of experts is bound to depend on the individual researchers of whose work they are aware. The larger the department, the more likely it is that an expert will be aware of the work of some of its scholars. Experts are likely have very little idea of the size of the departments involved, so asking them to operate a size correction to evaluate average performance would be unrealistic.

4. METHODOLOGY

We will compare actual departments to benchmarks in a system that would strictly be based on an assessment of publication quality, approximated by our chosen system of journal rankings. To perform this comparison, we collected publication data for three UK departments and one Italian department over the 6-year period from 2007 to 2012, included. The fact that these years do not correspond to an actual UK assessment period is intentional: we use these departments simply as comparators and do not try to understand how the scores that they received for an actual assessment exercise were generated. The three UK institutions are University College London (UCL), the University of Warwick, and the University of Sheffield. One of the criteria for this choice is convenience: it was easier to identify individuals who were associated with the economics programme in these universities than at some other institutions (e.g. London School of Economics) with more diffuse organizations. At the same time, these universities also represent a range of quality ranking: while UCL was ranked first in overall GPA in the Times Higher Education aggregation of the 2014 research assessment results, the University of Warwick was ranked fifth and the University of Sheffield was ranked twenty-fourth out of a total of 28 departments reviewed. Bocconi University was chosen as a prominent European university, although it clearly has not been ranked formally in the same system. As for our leading departments benchmark, we collected data only for individuals who were listed as associated with the economics department on the department’s website at these institutions. 9

The logic of this comparison is that, without questioning the high quality of the chosen departments, our benchmarks – as benchmarks – should receive a higher quality ranking. It would also be comforting if our bibliometrics-based assessment preserved the rankings established in the UK evaluation exercise which, albeit costly, is certainly carefully run.

In the next section, we specify the benchmarks we will use in our comparisons with the real departments. We will also outline a reputation measure, which we will hold up to the bibliometrics to observe whether reputation captures the same ranking information.

4.1. Nobel prizes benchmark

We compare the performance of real life departments to notional departments staffed only with Nobel-Prize winners. To create our ‘Nobel’ departments, we start with all recipients of a Nobel Prize in Economics since its inception in 1969 up to 2015. From this set, we exclude a few outliers whose performance is known to have been affected by illness or exceptional events. This leaves us 73 individuals for whom we collect lifetime publications, date of birth (and death when relevant) as well as the date at which they entered academic work. 10 We can then partition the career to a normal retirement age at 65 years into intervals of 6 years each and allocate their publications to each of these periods. 11 Perhaps because of the fairly advanced age of Nobel laureates so far, each individual in our set was professionally active for at least 6 periods of 6 years. However, most of our results will be based only on four ‘middle’ periods for each individual, roughly corresponding to ages from the early 30s to about 60. Ignoring the first period makes sense since many assessment systems have special provisions for faculty members who have graduated recently. We also ignore the last period as some individuals effectively retire at some point during this period, while others go on publishing even in normal retirement. In fact, this restriction has the added advantage that it limits us to periods before the individuals received their Nobel Prize in most cases. 12 We might expect that their publications changed markedly after this event. Our interest is not in the effect of the prize on publications but rather the output of those who merit the prize. Hence, our focus on productive periods before winning the prize is appropriate for our purposes. Overall this method gives us 414 six-year Nobel research periods. If we focus on the middle four periods for each laureate, we end up with 283 observations.

4.2. Leading departments

For this measure, we pooled together the economics faculty of MIT, Harvard, Princeton, and Stanford and focused on the period 2007–2012. We sampled the faculty members who are listed on the departmental website as of the beginning of 2016. Lecturers, visiting faculty, adjuncts, and courtesy appointees were not included. We select faculty members based on their career stage as in the case of the Nobel Prize department. As our main focus will be on the period from 2007 to 2012, we only consider faculty members who got their doctoral degrees (or equivalent) in 2001 or before. Similarly, to truncate late career contributions in a manner that is similar to our approach for the Nobel benchmark, we discard any faculty member with a doctoral degree obtained before 1983.

4.3. Reputation

There are two main sources of research-based reputational ranking for economics department, the Times Higher Education and the QS World University Rankings 2015 . Because our interest is research output, we do not focus on measures that include research grants. What we are looking for is a ‘reputation’ ranking that reflects the overall ‘market view’ of university research. In this respect, the QS survey seems to be most suitable as it offers a separate research ranking. The ranking we used is based on the opinion of more than 85,000 academic experts. Within their area of expertise, experts were asked to identify 30 institutions worldwide plus ten national institutions that are known for their ‘excellence in research’. We use the 2015 ranking. Since it seems reasonable to assume that reputation reacts to changing facts only with a lag, this late ranking makes it more likely that bibliometrics-based measures of performance over the 2007–2012 period might be based on the same factual evidence as the reputation assessment.

The QS Survey gives us research reputation scores for the HEIs which are in the top 200 overall, based on a combination of research and non-research criteria. We isolate the North American institutions and rank them based on research reputation alone. 13 Evaluating each entry on that list based on our bibliometrics would go well beyond the scope of this paper. We limit ourselves to assessing a number of universities at different points of the ranking. Although we aim at picking departments at regular intervals along the distribution we also wish to ensure that there is enough of an absolute ‘reputation score’ gap between the chosen entries that we are confident that the ranking we list is robust. Our choice is also affected by the availability of data as we need both a sufficiently comprehensive list of economics faculty and access to their curriculum vitae. Finally, where the number of faculty members in their ‘middle years’ is not sufficient, we pool two adjacent departments together. To facilitate comparison with the previous sections, we use our group of ‘top departments’ (MIT, Harvard, Princeton, and Stanford) as representative of the top of the distribution. The departments we have chosen based on these considerations are listed in Table 2 .

Table 2.

Universities in our ‘reputation’ sample

Universities QS North America research rank QS research reputation score Number of mid-career faculty members 
Top 4 1–5 100–87.1 61 
Columbia 79.5 16 
Michigan 14 71.5 19 
Boston U +Caltech 19–20 68.8–68.5 14 
Carnegie Mellon + Texas 26–27 60–59 15 
Michigan State 33 54.2 17 
Boston College + Dartmouth 45–36 49.2–48.9 25 
Indiana + UC Santa Barbara 53–54 45.6–45.2 14 
Vanderbilt + Pittsburgh 59–62 42.7–39.4 16 
Colorado + Arizona 64–65 35.9–31.6 21 
Universities QS North America research rank QS research reputation score Number of mid-career faculty members 
Top 4 1–5 100–87.1 61 
Columbia 79.5 16 
Michigan 14 71.5 19 
Boston U +Caltech 19–20 68.8–68.5 14 
Carnegie Mellon + Texas 26–27 60–59 15 
Michigan State 33 54.2 17 
Boston College + Dartmouth 45–36 49.2–48.9 25 
Indiana + UC Santa Barbara 53–54 45.6–45.2 14 
Vanderbilt + Pittsburgh 59–62 42.7–39.4 16 
Colorado + Arizona 64–65 35.9–31.6 21 

Notes: For each university, we list the economics department research and reputation ranking as well as the number of faculty members who fall within our boundaries for mid-career.

Source: QS Worldwide University Rankings (2015).

4.4. Metrics

The focus of our attention is a fictitious ‘2007–2012’ 6-year assessment period. However, we also compute most of our metrics for a previous period covering 2001–2006. Not only does this allow us to use some additional metrics like ‘cumulative citations’ (see below) but it also gives us some idea of how stable the metrics are likely to be over time.

4.4.1 Quality-weighted publications

Our first family of metrics uses publications weighted by the quality of the outlet in which they appear. We used standard indexing sources (such as JSTOR, RePec, and Google Scholar), as well as curriculum vitae available on departmental websites to generate a list of economics publications for each individual. We also supplemented these indexing sources with an additional search for books and other non-standard publications. We rated each publication according to information contained in Hudson (2013) . Where this ranking was not available, we used other ranking information. 14 We use a four point scale, with four the maximum quality rating, although we also extend this to a five point ranking, breaking out five Economics journals ( The American Economic Review, The Review of Economic Studies, Econometrica, The Quarterly Journal of Economics , and the Journal of Political Economy ) into a separate top tier to examine the effect of further quality differentiation in outlet. While the list of publications in these sources is broad, it is limited to outlets with at least some direct link to economics or finance. Consequently, even some publications in prestigious ‘out of field’ journals (say Science or the Lancet ) are not counted.

We then computed an average score for each individual in each period, using the top one publication, the top two, the top four publications, the top five, the top eight, as well as the total score over all publications, the total unweighted count of publications and the average score over all publications. This mimics systems that have fewer or more quality tiers, including those with no quality weighting, those systems that downplay quantity heavily, and those that value both quality and quantity. They are chosen to reflect and investigate the diversity we see in actual systems.

4.4.2 Citations

There are many ways to undertake a citations analysis. First, there are many sources of citations, including relatively restricted databases that list only citations within a relatively narrow scientific community and others that gather citations from a much wider set of sources including the discipline itself, other disciplines, and policy or other reports on top of purely academic publications. Google Scholar fits into this latter category and is the source we use. We chose this because we wished to go at least some way towards reflecting concerns that have been raised to include the ‘impact’ of academic work on the broader society as well as its value in generating further academic output. 15

Because our Nobel benchmark covers publications over a long period of time, we must also worry about the fact that citations patterns have changed considerably over time. Card and DellaVigna (2013) trace the evolution of total citation counts in articles published from 1970 to 2012 in the ‘top five’ economics journals, noting the large increase in median total citation counts for each over time. This should not come as a surprise as the overall size of the economics profession and the number of academic journals have both increased sharply. Moreover, the advent of the Internet has made finding relevant papers to cite much easier. To deal with this, we use Card and DellaVigna’s citation counts as our measure of citation propensities for each cohort of papers, and use an adjustment that reflects the mean citation rate for each 6 year period as our adjustment factor for each research assessment period in our sample. 16

Card and Della Vigna find a marked decline in citations for later stages in their period (roughly from 2002 to 2012). These may largely reflect truncation effects in their sample, as citations patterns have quite long tails. We are agnostic about whether the fall reflects a real trend or a truncation effect imposed on rising or constant numbers. Therefore, in an effort to take a neutral approach to more recent citation trends, we assume that citation levels once truncation effects have faded will remain the same from the mid-90s to the present. 17

We also need to select which publications to count in our citation measure. This includes how many publications we track and what type. Here, we have many possible choices: we can use citations of all outputs, citations of only the top journal outputs, or the top cited papers in some limited number of publications. We have chosen to report here the analysis of citations for the top four most cited publications among the set of an author’s within field academic papers and books. 18

Finally, we need to choose a time period for tracing citations. It is of little use to a research assessment that must evaluate the current state of a unit of assessment’s output to look at total citation counts over the entire history of an output: this output could be 30 or 40 years old and would not therefore reflect current activity or even the current strength of a department very well. Hence, we do not use career measures of impact and rather look at short-term measures of two types: first, we look at citation counts of outputs within the research assessment period for that period only (e.g., a 1992 publication would fall within the 1990–1996 assessment period, so we would count its citations during the 1990–1996 period only). Second, we look at citations within the research assessment period during which the output was published and in the subsequent period (for the same example as above, we would count citations during the period 1990–2002, or over the current and the subsequent research period). We call the first type of citation ‘instantaneous’ and the second ‘cumulative’. For our Nobel benchmark, we do not include any periods after the receipt of the Nobel Prize, as we are not interested in the effect of the Nobel Prize on citations.

Instantaneous citations are affected both by lags in publication for citing outputs and by when the output was created within the research assessment cycle. For example, a paper published a few months before the census date for a research evaluation would have very few instantaneous citations. This would not, however, be an accurate reflection of its quality even if there is no systematic bias in the Nobel group or the comparison departments in when publications fell in the assessment period. Allowing citations to cumulate for one further period allows both of these effects of publication timing and publication lags to fade while at the same time keeping evaluation quite close to the census date.

We did not earmark publications by subfields. In this respect, we believe that relying on a population of Nobel laureates is useful as, presumably, the distribution of individuals – or their publications – across fields is likely to reflect the profession’s opinion of ‘what matters’ at the time. This does not, of course, mean that field does not account for some variations in output across Nobel laureates but these variations should be seen as not reflecting any difference in the underlying quality of the research profiles. Pooling four of the top US departments also ensures that most fields are well represented in that second benchmark. This is not to say, of course, that specific departments would not gain from specializing in currently citation intensive fields of economic research. Instead, our assertion is that all our benchmarks reflect some balance of fields – and fashions – and associated citation pattern, but it is not clear that this differs across benchmarks or biases the comparisons in a systematic way.

5. RESULTS

5.1. Nobel prizes and leading US departments

We now evaluate how our real departments compare to our first two benchmarks: the simulated Nobel Prize departments and the constructed Top US department. Table 3 gives information about the number of faculty members considered in each of the actual departments. Similar information is provided for our set of Nobel Prize winners 19 and our Top US constructed department. The numbers in the table reflect averages for the four point scale that does not separate out top five publications in all columns except one, labelled Sc5, where we list the ratings for four outputs evaluated on a five-point scale. 20

Table 3.

Scores for journal-based metrics in actual, Top US, and Nobel departments

Institutions/benchmarks Number of entries Top 1 Top 2 Top 4 Sc5 Top 5 Top 8 Total Number Average 
Top US 61 3.95 3.9 3.7 4.27 3.61 3.49 38.3 11.18 3.39 
Nobel – All 73 3.6 3.45 3.14 3.54 2.98 2.59 31.9 10.82 2.80 
UCL 33 3.80 3.72 3.31 3.48 3.17 2.7 29.65 10.15 2.96 
Warwick 25 3.74 3.58 2.93 3.06 2.70 2.20 22.59 7.52 2.99 
Bocconi 22 3.18 2.96 2.48 2.59 2.23 1.73 18.06 7.09 2.50 
Sheffield 16 3.03 2.87 2.54 2.54 2.37 1.91 21.27 9.38 2.33 
Institutions/benchmarks Number of entries Top 1 Top 2 Top 4 Sc5 Top 5 Top 8 Total Number Average 
Top US 61 3.95 3.9 3.7 4.27 3.61 3.49 38.3 11.18 3.39 
Nobel – All 73 3.6 3.45 3.14 3.54 2.98 2.59 31.9 10.82 2.80 
UCL 33 3.80 3.72 3.31 3.48 3.17 2.7 29.65 10.15 2.96 
Warwick 25 3.74 3.58 2.93 3.06 2.70 2.20 22.59 7.52 2.99 
Bocconi 22 3.18 2.96 2.48 2.59 2.23 1.73 18.06 7.09 2.50 
Sheffield 16 3.03 2.87 2.54 2.54 2.37 1.91 21.27 9.38 2.33 

Notes: For each row: the two benchmarks in the rows at the top of the table and several real departments listed below, we have shown the number of individuals on which we base our metric, and then a series of columns calculating averages across all individuals in each department or benchmark when each individual is rated by the average quality of their top one, two, four, five, or eight outputs, the total quality score for all their outputs taken together, the number of outputs that can be submitted with a rating of one or above, and the average quality of all the individual’s outputs taken together. Sc5 refers to the score over the Top 4 outputs, where we used a rating scale from 1–5 instead of 1–4.

We first look at the Top US benchmark. Using every metric we see that our expectation that this benchmark should outperform even a leading UK department such as UCL is confirmed. In this sense, there is little to choose among the various measures. We note, however, that measures that are based on a higher number of publications or that rely on a scale giving more weight to a few elite publication outlets magnify the difference between UCL and the benchmark, suggesting that more publications and a finer rating scale both improve the indicators’ discriminating power at the top of the distribution, given that all these ratings have an associated standard error. By contrast, the number of publications included in the metrics has little impact on the relative ranking or, indeed, on the respective gaps between these institutions with the exception that increasing the number of publications to five pushes Sheffield ahead of Bocconi. 21 Overall, either increasing the number of publications considered or moving to a five-point scale increases the range of the rankings and their discriminating power.

The lessons that we draw from the Nobel benchmark are somewhat different. The striking difference is that our set of Nobel Prize winners only performs (very marginally) better than UCL for one indicator (total score) and lags behind for all others. Indeed, indicators based on few publications also put Warwick ahead of our Nobel Laureates. As for the Top US benchmark, increasing the number of publications taken into account and moving from a 0–4 to a 0–5 scale improves the relative performance of the Nobel benchmark, putting it ahead of Warwick and in the same neighbourhood as UCL. In this sense, the general message is not distinct: increasing numbers and increasing the fineness of the scale both improve the discriminatory ability.

On the other hand, it appears that all our metrics simply fail to capture some dimensions of the ‘excellence’ of our Nobel population. This is not surprising, as the Nobel Prize is supposed to reward a single outstanding contribution rather than a steady flow of quality output over a whole career. The ratings we have for outlets are much better at distinguishing the latter. Indeed, given this difference in emphasis, it is comforting that the Nobel benchmark still performs very well – even if not at the top – in terms of our simple bibliometrics. Still, the overall results for the Nobel benchmark should serve as a warning that putting too steep a gradient on any policy action – such as the allocation of funds – based on bibliometrics (or indeed, very likely, based on any current assessment methodology) might penalize a type of research that is of great value to society.

Table 4 shows the results for our citations-based metrics. We present results for the citations of the top four papers in this table. The within period citations for the real departments are computed for the period of 2007–2012 and for 2001–2006. There is, of course, a single measure of ‘instantaneous’ citations for Nobel Prize winners as these are computed as an average over all of the laureates’ four middle periods. 22

Table 4.

Scores for citation metrics in actual, top US, and Nobel departments

Institutions/benchmarks Within period citations (actual departments: 2007–2012 only) Within period citations (actual departments: 2001–2006 only) Cumulative citations (actual departments: 2001–2012 only) 
Top US 807 610 2,350 
Nobel Prize Department 289 – 1,131 
UCL 322 162 620 
Warwick 221 149 470 
Bocconi 228 210 734 
Sheffield 89 72 251 
Institutions/benchmarks Within period citations (actual departments: 2007–2012 only) Within period citations (actual departments: 2001–2006 only) Cumulative citations (actual departments: 2001–2012 only) 
Top US 807 610 2,350 
Nobel Prize Department 289 – 1,131 
UCL 322 162 620 
Warwick 221 149 470 
Bocconi 228 210 734 
Sheffield 89 72 251 

Notes: For each benchmark, listed in the top two rows, or the following four real departments, we list the average for the entire department (using the number of entries from Table 3 ) of the total number of citations that are attributable to the four top-cited publications that appeared in 2007–2012 during the 2007–2012 years or the instantaneous citations for the Nobel group; the total number of citations attributable to the four top-cited publications of 2001–2006 during the 2001–2006 period only; the citations attributable to the four top-cited publications that appeared in 2001–2006, where citations are collected over the 2001–2012 period.

The results are quite striking. While the ‘Top US’ benchmark still dominates, our Nobel benchmark now performs significantly better that all real life departments in cumulative citations and overall on instantaneous citations. Nobel-quality research appears to be appreciated rapidly enough to yield sizeable numbers of citations over a fairly short period of time. In our view, this additional discriminating power of citations with respect to the Nobel benchmark is a strong argument for introducing citations-based measures alongside the more traditional assessment of output weighted by the quality of the outlet, as it allows a second of our benchmarks to distinguish itself where the pure outlet-based ranking did not. Indeed, we also note that while the ranking of UCL, Warwick and Sheffield remain unchanged, Bocconi changes its relative ranking considerably using cumulative citations. Hence, more than just the Nobel group is affected by incorporating a citations measure.

5.2. Reputation

Table 5 displays the metrics for each of the universities or group thereof in our reputation sample, with the highest ranked on the left and the lowest ranked on the right. We list various metrics in the rows for each institution. ‘Top 1’ refers to the average rating of each of the ‘midlife’ faculty members on a scale of 1–4 for either the period 2001–2006 (Top 1a) or the period 2007–2012 (Top 1b). ‘Sc5’ refers to the average rating when we take the best four publications of each faculty member and evaluate them on a 1–5 scale. ‘Cite’ refers to the citations received during the corresponding period for the four papers that receive the most citations during that period and ‘Cumul’ is the number of citations received from 2001 to 2012 for the top 4 most cited publications in the 2001–2006 period.

Table 5.

Metrics for reputation sample

Department Top 1a Top 1b Top 2a Top 2b Top 4a Top 4b Sc5a Sc5b Cite a Cite b Cumul 
Top US 3.96 3.95 3.94 3.9 3.82 3.7 4.42 4.27 610 807 2,350 
Colorado 3.9 3.83 3.86 3.65 3.71 3.37 4.23 3.78 521 445 1,907 
Michigan 3.74 3.79 3.5 3.7 3.15 3.35 3.43 3.63 207 284 809 
Boston U + Caltech 3.91 3.59 3.76 3.49 3.35 3.4 3.66 3.82 180 275 739 
Carnegie Mellon + Texas 3.6 3.38 3.52 3.23 3.18 2.83 3.43 2.93 150 126 470 
Michigan State University 3.27 3.62 3.15 3.44 2.75 2.88 3.1 2.98 72 179 316 
Boston College + Dartmouth 3.6 3.55 3.41 3.38 3.08 3.11 3.29 3.3 273 248 915 
Indiana + UC Santa Barbara 3.66 3.64 3.54 3.51 3.11 3.17 3.2 3.35 122 181 481 
Vanderbilt + Pittsburgh 3.8 3.66 3.43 3.45 2.95 3.23 3.09 3.3 114 155 372 
Colorado + Arizona 3.61 3.64 3.42 3.52 3.04 3.08 3.22 3.24 165 180 589 
Department Top 1a Top 1b Top 2a Top 2b Top 4a Top 4b Sc5a Sc5b Cite a Cite b Cumul 
Top US 3.96 3.95 3.94 3.9 3.82 3.7 4.42 4.27 610 807 2,350 
Colorado 3.9 3.83 3.86 3.65 3.71 3.37 4.23 3.78 521 445 1,907 
Michigan 3.74 3.79 3.5 3.7 3.15 3.35 3.43 3.63 207 284 809 
Boston U + Caltech 3.91 3.59 3.76 3.49 3.35 3.4 3.66 3.82 180 275 739 
Carnegie Mellon + Texas 3.6 3.38 3.52 3.23 3.18 2.83 3.43 2.93 150 126 470 
Michigan State University 3.27 3.62 3.15 3.44 2.75 2.88 3.1 2.98 72 179 316 
Boston College + Dartmouth 3.6 3.55 3.41 3.38 3.08 3.11 3.29 3.3 273 248 915 
Indiana + UC Santa Barbara 3.66 3.64 3.54 3.51 3.11 3.17 3.2 3.35 122 181 481 
Vanderbilt + Pittsburgh 3.8 3.66 3.43 3.45 2.95 3.23 3.09 3.3 114 155 372 
Colorado + Arizona 3.61 3.64 3.42 3.52 3.04 3.08 3.22 3.24 165 180 589 

Notes: We list the sample university departments from top to bottom by reputation ranking, starting with the top four US departments aggregated together, and then list the ranking we obtain by calculating their score on a variety of metrics: Top 1, Top 2, Top 4 and Top 4 using the 1–5 point journal output quality scale (Sc5). “a” corresponds to the period 2001–2006, and “b” to the period 2007–2012.

The most striking observation from this table is that there is a significantly larger range in the citation measures than in the ‘journal quality’ metrics. As before, the range of the journal-based metric increases with the number of papers per individual that we use and is also significantly higher when we rank the journals on a five-point scale. By contrast, there is no striking difference between the range we observe using ‘instantaneous’ citations (Cite a and Cite b) and cumulative citations.

The rankings implied by the ‘journal quality’ metrics are compared with the QS reputation-based rankings in Table 6 , where the schools are ranked in rows by their reputation standing from top to bottom and the metrics are listed across the top. If the reputation ranking and the metric give the same answer, then, the entries in each column should move from 1 to 2 to 3 and so on as one moves down the table.

Table 6.

Journal-based metrics rankings for the reputation sample

Department Top 1a Top 1b Top 2a Top 2b Top 4a Top 4b Sc5a Sc5b 
Top US 
Columbia 
Michigan 
Boston University + Caltech 
Carnegie Mellon + University Texas 10 10 10 10 
Michigan State University 10 10 10 
Boston College + Dartmouth 
Indiana + U.C. Santa Barbara 
Vanderbilt + Pittsburgh 10 
Colorado + Arizona 
Spearman Rank Correlation 0.31 0.407 0.673 0.4667 0.818 0.636 0.842 0.649 
t -test  0.92 1.26 2.57 1.49 4.02 2.33 4.42 2.41 
Spearman rank correlation 0.665 (2.52)  0.534 (1.93)  0.673 (2.57)  0.649 (2.41)  
Department Top 1a Top 1b Top 2a Top 2b Top 4a Top 4b Sc5a Sc5b 
Top US 
Columbia 
Michigan 
Boston University + Caltech 
Carnegie Mellon + University Texas 10 10 10 10 
Michigan State University 10 10 10 
Boston College + Dartmouth 
Indiana + U.C. Santa Barbara 
Vanderbilt + Pittsburgh 10 
Colorado + Arizona 
Spearman Rank Correlation 0.31 0.407 0.673 0.4667 0.818 0.636 0.842 0.649 
t -test  0.92 1.26 2.57 1.49 4.02 2.33 4.42 2.41 
Spearman rank correlation 0.665 (2.52)  0.534 (1.93)  0.673 (2.57)  0.649 (2.41)  

Notes: We list the sample university departments from top to bottom by reputation ranking, starting with the top four US departments aggregated together, and then list the ranking we obtain by calculating their score on a variety of metrics: Top 1, Top 2, Top 4 and Top 5 using the 1–5 point journal output quality scale (Sc5). We then calculate at the bottom of the table first the rank correlation between reputation and each metric (i.e., each column) and on the last line the rank correlation between the orderings of the same metric in two different time periods (e.g., between Top1a and Top1b). “a” corresponds to the period 2001–2006, and “b” to the period 2007–2012.

The next to last line of the table shows the Spearman Rank Correlation Coefficient between the reputation ranking and the ranking implied by the corresponding metric. When a given metric is available for both the 2001–2006 and 2007–2012, we also compute the rank correlation between these orderings to get an idea of how stable the ranking might be over time. Clearly, given the small number of observation points, the standard errors involved are large. Differences in correlation coefficients should therefore only be taken solely as broadly informative.

We see that, although the metrics-based rankings closely track reputation at the top of the distribution, the discrepancies in the ranking of institutions below the twentieth place is such that the overall rank correlation is fairly weak, when we only consider one or two publications per faculty member. The correlation is significantly stronger when we look at four publications on a four or five points scale, confirming the conclusions that we got from the other two benchmarks. Even in these cases, however, the metrics track the reputation-based ranking better at the top of the distribution. 23 As discussed in Section 5, we find these results comforting. They indicate that our metrics are in line with reputation over a range where reputation might well be accurate, but bring new information over the range where experts likely have fairly little knowledge of the institutions that they are asked to evaluate. The table also indicates that the rankings implied by journal-based metrics are only moderately stable over time. We will come back to this issue in Section 6.

Similar observations hold for Table 7 , where the citation-based rankings are displayed against reputation-ranked schools. Again reputational and metrics-base rankings are remarkably similar for the top of the distribution but diverge significantly farther down. Surprisingly, the correlation implied by within-period citations is rather stable over time. The rank correlation between the ranking based on within period citations for 2001–2006, and citations for the publications produced during the same period but with cites counted from 2001 to 2012, is remarkably high at 0.9879 ( t = 18) 24 suggesting that, in this sample at least, counting the cumulative citations of older publications might not be necessary to achieve a satisfactory ranking.

Table 7.

Citations-based rankings for the reputation sample

Department Cites 01–06 Papers 2001–2006 Cites 07–12 Papers 2007–2012 Cites 01–06 Papers 2001–2012 
Top US 
Columbia 
Michigan 
Boston University + Caltech 
Carnegie Mellon + University of Texas 10 
Michigan State University 10 10 
Boston College + Dartmouth 
Indiana + U.C. Santa Barbara 
Vanderbilt + Pittsburgh 
Colorado + Arizona 
Spearman rank correlation 0.673 0.721 0.636 
t -test  2.57 2.94 2.33 
Spearman rank correlation 0.855 (4.65)   
Department Cites 01–06 Papers 2001–2006 Cites 07–12 Papers 2007–2012 Cites 01–06 Papers 2001–2012 
Top US 
Columbia 
Michigan 
Boston University + Caltech 
Carnegie Mellon + University of Texas 10 
Michigan State University 10 10 
Boston College + Dartmouth 
Indiana + U.C. Santa Barbara 
Vanderbilt + Pittsburgh 
Colorado + Arizona 
Spearman rank correlation 0.673 0.721 0.636 
t -test  2.57 2.94 2.33 
Spearman rank correlation 0.855 (4.65)   

Notes: We list the sample university departments from top to bottom by reputation ranking, starting with the top four US departments aggregated together, and then list the ranking we obtain by calculating their score on a variety of metrics: the average, over all individuals in a department, of the citations of the four top cited papers published during 2001–2006, where citations are collected over 2001–2006 only; the same measure where we use publications appearing in 2007–2012 and collect citations over 2007–2012; and the citations of the four top cited papers published during 2001–2006, where citations are collected over 2001–2012. We then calculate at the bottom of the table the rank correlation between reputation and each metric (in other words, each column), listing this at the bottom of the table. The final rank correlation is between the same metric in two different years (i.e., columns 1 and 2).

The level of correlation between reputation and our bibliometrics-based rankings hides some striking variations at the level of individual institutions. For example, if we focus on Top 4 and citations measures, Carnegie Mellon and the University of Texas jointly underperform quite significantly compared with their reputational ranking, especially for the more recent 2007–2012 period. The gap between reputation and bibliometric-based performance seems even wider for Michigan State. By contrast, Colorado and Arizona punch significantly above their reputational weights. Finally, the variation between rankings based on journal-metrics and those based on citations confirms that there might be value in placing some weight on both types of measures. For example, Boston College + Dartmouth perform much better when citations are used than with journal-based bibliometrics.

We also want to know whether journal-based and citation-based bibliometrics yield near identical rankings or whether there are reasons to believe that each type of indicator captures somewhat different aspects of research excellence. The Spearman rank correlation coefficients between journal-based measures for 2007–2012 on the one hand and within period cites for 2007–2012 or cumulative cites for 2001–2006 (which would be used in a 2007–2012 assessment exercise) on the other are shown in Table 8 .

Table 8.

Rank correlation between journal-based and citation-based metrics

Journal-based metric Within period citations 2007–2012 Cumulative citations 2001–2012 
Top 1 0.632 0.474 
Top 2 0.721 0.624 
Top 4 0.83 0.685 
Sc5 0.894 0.748 
Journal-based metric Within period citations 2007–2012 Cumulative citations 2001–2012 
Top 1 0.632 0.474 
Top 2 0.721 0.624 
Top 4 0.83 0.685 
Sc5 0.894 0.748 

Notes: We list in the left-hand column four different metrics based on publication outlets, using the Top 1, Top 2, and Top 4 publications using a four-point (Top 4) and a five-point (Sc5) scale. The right-hand columns indicate two citations-based metrics: citations during 2007–2012 from the top four cited papers that also appeared in 2007–2012 and citations during 2001–2012 from the top four cited papers that appeared in 2001–2006. The cells are the correlations between the metrics in the rows and columns.

We see that the correlation between the rankings implied by journal-based metrics and the rankings implied by citations-based metrics increases as the number of submissions included in the first type of metrics increases. This, however might be as expected not only because we have already seen that journal-based metrics with more publications perform better with respect to our two previous benchmarks but also because our citations-based metrics are themselves based on four publications. The rank correlation is also stronger if the quality of output is rated on a five-point scale.

Although the rank correlation between the two types of metrics is high when each individual submits four publications, it is not perfect, suggesting again that there might be merit in relying on both types of measure: over the course of the discussion of three types of benchmarks, we have seen that citations pick out certain benchmarks that outlet rankings may miss, although outlet rankings do a relatively good job of picking out benchmarks if enough publications are included. As we have seen that the correlation between within period and cumulative citations for the same publications is nearly one, the weaker correlation between cumulative citations metric and a journal-based metric must be attributed to the fact that the cumulative citations measure refers to publications from a previous period. There is, in other words, implicit smoothing in the cumulative citations measure. Whether this is desirable depends in part on how stable our metrics are over time, an issue to which we return in the next section.

6. COST AND OTHER PRACTICAL CONSIDERATIONS

The previous sections have shown that simple journal-based or citation-based metrics do a good job of capturing research excellence compared with recognized benchmarks of excellence such as Nobel Prize winners or world leading departments. However, this does not necessarily imply that a simple bibliometrics-based approach would do appreciably better than a process where papers are rated based on peer review, which we could think of as an implicit journal ranking corrected by an individual independent assessment of each submitted paper. To have a strong pro-bibliometrics argument, we must show that the approach is at least sufficiently less costly than the current framework. In what follows we show that, although there is some considerable uncertainty about how the costs of the current process should be allocated between various facets of the assessment (e.g., publications versus impact case studies), it seems likely that an evaluation relying on bibliometrics would have significantly lower costs both for the assessing agency and for the academic departments concerned. In order to get a rough idea of cost, we need to specify a procedure. This is just as a basis for our calculation and is not a recommendation of how to proceed: it is one way of getting a ballpark estimate. We have used the time we needed to do our calculations as a guess about time commitment for many of the tasks. Again, this is just a way of getting a rough idea.

We briefly discuss a few important implementation issues: which metrics might be used and how they could be combined, the need to tailor the precise metrics used to each research field and whether the journal rankings on which the metrics rely should be fixed significantly in advance of the submission deadline.

6.1. Cost

Let us consider the likely cost of running a system where each faculty member is required to submit two (not necessarily disjoint) sets of (say four) publications. The first set will be evaluated based on a publicly available ranking of journals (as well as rules for books and book chapters), while the second set will be used to compute the author’s (Google Scholar) citations for the relevant period. The submissions are then checked for accuracy and departmental scores are computed along the lines that we have done in this paper using simple bibliometrics.

Such a process would involve four main sources of cost. To start with, each faculty member would have to determine the two sets of publications that would yield the highest score and enter them online. This would be unlikely to take more than an hour per faculty member (it takes us 20 min per author). Next comes the management time needed to decide which faculty members to submit – if such freedom is given – and then file the departmental submission on the assessing organization’s website. Counting on a committee of five people meeting for 5 h plus 5 h to enter the submission should be more than enough, as the only relevant decision to be taken is where to apply the submission cut-off. This gives us a cost per department of 30 h plus 1 h per faculty member.

The assessing institution would incur four types of cost. First, it would have to set up a website capable of handling submissions securely. Secondly, submissions would need to be verified. This could be done fairly easily by designing a program which checks sources like Google Scholar, The Web of Science and journal websites. Still, to get a higher bound, let us assume that submissions are checked manually, taking 1 h for three faculty members. The assessing organization then must compute departmental scores and announce them. This functionality could of course be automated. Let us count 2 h per department for ex post checks. This leaves us with one crucial component: the journal ranking.

We have used publicly available rankings. Ensuring that the chosen ranking would be updated every 6 years would cost £200,000 per field if we count quite generously on the time to either build a new ranking system or to borrow and update another system already in place. If desired, such a ranking could of course be shared across a number of countries performing bibliometrics-based evaluations. We understand that each country might want to amend the type of ranking that we have used, for example to give some more weight to publications in the national language, 25 but, given the possibility of sharing the cost of developing the basic ranking, it would seem that £200,000 per field, per assessment period, would be a reasonable estimate.

In order to complete our rough estimate of the total cost of running a research assessment exercise based on our kind of metrics we need an hourly wage, the number of departments to be reviewed and the average departmental size. We take an hourly wage of £40 (including overheads). This corresponds to a total yearly salary (including all contributions) of £70,000 divided by 1680 h (48 weeks at 35 h a week). For the fact-checking and data handling by the research assessment organization we use a lower figure of £30 an hour.

Overall, then, each field involves a fixed cost of £200,000 to update the journal list, each department within a field adds a fixed cost of £1260 (30 h in the department at £40 plus 2 h at the reviewing organization times £30) and each faculty member within a department increases the cost by £50 (1 h at £40 of the faculty member’s time plus a third of an hour at £30 for the reviewing organization to check the data). The formula that we use to cost the bibliometric-based assessment for a given field is then:  

Total cost =£200,000+50xtotal number of faculty members submitted+1260xnumber of departments

The number of departments and faculty members involved in our estimates are obtained from the submission data for the latest UK REF exercise. 26 While Economics departments chose to submit only about 80% of faculty members, many other disciplines submitted 100%.

Of course, when performing this costing exercise, we are assuming that all fields would be amenable to the type of bibliometrics-based approach that we suggest; however, since we do not have separate estimates for the economics-related cost of the current system, comparing such implied total costs is the only possible approach for us.

Our cost estimates per field can be found in Table A2 in the Appendix. These costs do not include the cost of developing the appropriate website and software for the assessment organization or the relevant overhead of the organization entrusted with conducting the assessment exercise. We note, however, the limited scope of the task if a pure bibliometrics measure is used.

The final cost figure is £12,209,800. Our estimate of a total cost just in excess of 12 million pounds per assessment exercise is of course only a rough guide. On the one hand, we are fairly confident that our estimates represent an upper bound if all fields could use the simple bibliometrics that we have explored. In particular, economics counts a relatively large number of academic journals. The renewable cost of £200,000 per exercise in order to update the journal rankings is likely to be on the high side. On the other hand, there are clearly some fields – say Arts and Design or Architecture – for which an exclusive reliance on simple bibliometrics would be inappropriate. The evaluation methods best suited to such fields might well be significantly costlier. Still, compared with the available estimates of the cost of the last UK REF, this seems appealingly small.

Of course, part of the total cost of the REF comes from elements that we have not considered such as checking whether the rules governing ‘special circumstances’ have been followed or designing and evaluating a number of ‘impact’ studies for each department. Still, the order of magnitude in the costs involved makes it hard to believe that some significant savings could not be achieved by switching to a bibliometrics-based approach. We would add that, although we sympathize with the idea of measuring ‘impact’, case studies are only one way of doing so.

6.2. Other practical considerations

In the previous sections, we looked at a number of possible bibliometrics. Among the metrics based on the perceived quality of academic journals, we found that our benchmarks of excellence performed more strongly compared with actual departments if each faculty member must submit a larger number of papers (four or more did better in our work) and if journals are graded on a five-points scale rather than on the four-points scale currently in use, at least in the United Kingdom. However, the discrimination power of journal-based metrics is only one of the considerations that should enter into the choice of the number of publications that need to be submitted for a given assessment period. Incentives also matter. Besides being likely to provide only a rather inaccurate picture of relative performance, an evaluation relying on one publication per faculty member per 6-year period would hardly motivate the non-intrinsically driven members of the profession to undertake significant research. At the other end of the spectrum, using the total quality weighted output over the assessment period might easily lead faculty members to put excessive weight on research – which would be so clearly monitored – than on other, harder to evaluate, tasks. Overall, then, we believe that our finding rather supports the current UK approach of looking at four publications per period, an intermediate amount, albeit a five-point scale improves on discrimination in our work.

For citations we found that, while citations measured within assessment periods and citations for the same publications but measured over two assessment periods appear to be closely correlated (Section 5.2), we also saw that cumulative citations perform significantly better as a metric when evaluated in light of the Nobel benchmark. In practice, however, a given research evaluation exercise would only have access to instantaneous citations for the papers published during the assessment period and the cumulative citations earned by publications from the previous period . This means that putting weight on cumulative citations has two effects. On the one hand, it seems that it would help achieve a more accurate rating of research performance but, on the other hand, it would also mean that a given assessment exercise would relate both to current output and to older output. If such implicit ‘smoothing’ of performance over time is thought to be desirable, then cumulative cites should be used, with or without instantaneous cites. It is only if policymakers want to focus narrowly on recent research performance that cumulative cites should be ignored.

We also saw that journal-based and citation-based metrics – while well correlated – can sometimes tell rather different stories, further suggesting that some weight should be put on each of these two types of indicators. In other words, we see different types of benchmarks becoming distinct with the two different types of measures. Other work has investigated combining journal- and citation-based metrics into a single indicator. On this topic, the reader is referred to the careful study of Sgroi and Oswald (2013) . We also note that there is a very large literature on how to design composite indicators. 27 Building a composite indicator based on just two or three easily available and understandable components should be well within the capabilities of specialists in this area. Our contribution is to point out the benchmark that may be well picked up by the journal or the citation-based metric and to point out which of the various journal- or citation-based measures might be useful to include.

As we have already recognized, the specific, simple, bibliometrics that we have examined might not be as suitable for other fields of research as for economics. In this sense, each subject area should undertake its own study to identify a simple set of bibliometrics that seem to perform well compared with a relevant benchmark of excellence. This flexibility does not create any additional difficulty if the purpose of a research assessment is, for example, to allocate funds which have already been divided across fields to the departments operating in these fields. However, research assessments might also aim at comparing research performance across fields in order to guide the distribution of research funds across research areas. In such a case, having different metrics for different subjects is unhelpful. There is however a possible solution. Rather than compare subjects directly based on the chosen metrics, one could compare the distance that exists between national departments and internationally recognized benchmarks of excellence. The closer to the ‘excellence frontier’ national departments are, the stronger is the nation’s expertise compared with other research areas. The benchmarks that we propose here are suggestive but not the only ones that could be chosen.

Finally, when setting up an actual bibliometrics-based process, one would have to decide how much in advance the journal rankings on which part of the overall metric relies should be established and made available. Clearly, the ranking must be available by the time departments must file their submissions. Not doing so would merely introduce guessing, which could increase costs significantly (as individuals and departments agonize about where to submit their work) without any obvious benefit. However, making the ranking available much earlier in the assessment period would also have significant drawbacks. First, this means that research would not be assessed based on the most up to date ranking possible. Second, a list that is available from the start might have a disproportionate influence on submission behaviour and might indeed end up ‘ossifying’ the journal rankings, making it harder for new outlets to earn their stripes. Overall, then we would favour using a ranking which would be computed and released at an intermediate date, for example a year before the submission date.

6.3. Stability of rankings

As metrics-based research evaluations would likely be used to make significant policy decisions such as the allocation of research funds, it is important to ensure that such an approach does not expose academic institutions to excessive risk. While one clearly wants a research assessment tool that is responsive to actual change in the research quality of various academic units, one must also consider that, even research activities of a constant quality might produce rather different ratings depending on the review period considered. This is especially true in economics, where the rate of publication per faculty member is not high. With relatively few publications to choose from, slight variations in the timing of research success and publication can result in quite different ratings, at least at the individual level. The relevant issue then is whether academic units are generally of sufficient size to ensure that such idiosyncratic variations are ‘averaged out’ within a single 6 year assessment period.

We saw in Section 5.2 that the rank correlation between the ordering implied by the same journal-based or citation metrics over two consecutive review periods was only moderately high. Unfortunately, this statistic does not allow us to differentiate between the effect of genuine changes in relative research performances and the effect of the type of accidental factors that we have just described.

To help better separate these two effects, we generated 20 Nobel departments, determined their ratings, and then performed the same calculations on the same 20 departments ‘aged’ one period. Clearly, some members of our fictitious departments ‘retire’ between assessment exercises, so that we must replace them with new members if we wish to maintain the same department size. If we replace the retired members with randomly selected juniors, so that the age distribution also remains the same across the original and the ‘aged’ department, we can illustrate the change in rating from one period to the next for our population for 20 simulated departments, each with 48 randomly selected members. We did this for the ‘Top 4’ publications on a 1–4 scale. We then recorded the position that any given Nobel department would have occupied among the ten US institutions that we used in Section 5.2. The idea is to see what the effect of purely ‘accidental’ variations in research on the ranking of a department would be if this department was compared with a representative set of real life academic units for which the research output is kept constant. The results are shown in Table 9 .

Table 9.

Variation in simulated Nobel department’s score and ranking over 20 repetitions

Original population, 48 member department Aged population, 48 member department 
Score Ranking Score Ranking 
3.21 3.13 
3.05 3.09 
2.52 11 3.3 
2.72 11 2.93 
2.48 11 2.84 10 
2.98 3.25 
2.91 3.06 
2.86 10 3.07 
2.97 2.98 
2.91 3.06 10 
2.93 2.94 
2.56 11 2.58 11 
2.72 11 2.71 11 
3.04 3.02 
2.84 10 2.87 10 
3.05 2.53 11 
3.07 3.01 
2.5 11 2.81 11 
2.82 11 2.82 11 
3.13 2.78 11 
Original population, 48 member department Aged population, 48 member department 
Score Ranking Score Ranking 
3.21 3.13 
3.05 3.09 
2.52 11 3.3 
2.72 11 2.93 
2.48 11 2.84 10 
2.98 3.25 
2.91 3.06 
2.86 10 3.07 
2.97 2.98 
2.91 3.06 10 
2.93 2.94 
2.56 11 2.58 11 
2.72 11 2.71 11 
3.04 3.02 
2.84 10 2.87 10 
3.05 2.53 11 
3.07 3.01 
2.5 11 2.81 11 
2.82 11 2.82 11 
3.13 2.78 11 

Notes: In the left-hand column we list the ‘Top 4’ metric scores of 20 simulated departments. We list the ranking of each simulated department compared to our reputation sample in the next column. The third column takes the same individuals and ages them one 6 year period, then recalculates the ‘Top 4’ metric for this new simulated department. Any individual who becomes too old for the sample is replaced by a random new hire drawn from the youngest cohort in the sample so that the total department size remains fixed. The new rankings are calculated for the 20 departments and are listed in the right-hand column.

We see that, when the Nobel departments are ranked with respect to actual departments of uneven research quality we see that the variation in ranking due to ‘natural’ variations over time in the Nobel departments is not very big with 15 out of 20 departments retaining the same rank or moving up or down by at most one place.

Still, the fact that bibliometrics-based rankings can change – albeit moderately – without an underlying change in the research quality of the assessed units should in principle foster a more relaxed attitude with respect to actual rankings since they indicate that a department can easily lose a few places, even in a world where, by definition, the department cannot possibly have done anything ‘wrong’ between successive assessments. Since such variance is hardly a special attribute of bibliometrics-based approaches, our findings suggest a degree of caution in how one actions the results of research rankings: as material changes can occur even if there is absolutely no change in the underlying quality of the population. This variability is an additional argument in favour of putting some weight on a cumulative citation indicator which, as explained in Section 5, introduces some smoothing of recorded performance from one evaluation period to the next.

7. CONCLUSION

Research assessment exercises, and the incentives they create, can have a profound effect on the way researchers spend their time and how they choose to express their ideas. For this reason, it is important to get the design of research assessment right. Exactly what that entails depends on the aims of the system, which differ across countries and have differed across time. We take a particular view of these exercises, suggesting that one aim is to get high quality research publications as output. How to measure that quality is the specific subject of our paper: if researchers are to direct their work in response to the research assessments, and if high quality publications are a goal, we would want quality to be measured accurately and transparently so that effort is not wasted in generating outputs that do well by a faulty or difficult to understand metric.

A challenge is to find a set of outputs that we are willing to take as high quality and then test the system we develop to see if this set is identified as high quality by the metric we create. We address this challenge by suggesting three benchmarks as our standard of high quality. We use these to make a series of points about research assessment, using a mechanistic approach of evaluating research output using journal rankings and citations counts. In other words, we combine this set of benchmarks and with a possible ‘low cost’ system of evaluating research, hoping to get a sense of how well this ‘low cost’ system works.

We obtain several results. First, we generally find that when we use a benchmark of the aggregation of four top US departments as our standard of quality, and when we use metrics involving publication outlet quality only, we are better able to distinguish this benchmark by using more quality rated outputs in each evaluation period and by using a finer ratings scale for each assessed output.

When we apply the same sets of bibliometrics for a second benchmark of a fictitious department composed solely of Nobel Prize winners, we observe that this population does not stand out from real departments on a metric that uses ratings of publication outlets only, regardless of how many outputs we use or how fine the ratings scale. On the other hand, if we use a cumulative citations measure of output, in other words a measure that cumulates citations from publications in a given evaluation period over a longer time span, this cumulative citation benchmark does do well to distinguish the Nobel group. It also distinguishes the set of top US schools. If we use a short period of citations, the Nobel Prize group does not stand out, although the top US schools do.

Reputation-based measures are well correlated with the results of these two types of bibliometrics when we look only at the very top schools, but are only moderately correlated with these two types of bibliometrics for institutions that are not at the top of the ranking, suggesting – unsurprisingly – that reputation measures only work well to ‘mimic’ citation or publication outlet metrics for those institutions that are relatively well known. In this sense, the ‘market’ may not have sufficient information to work well to distinguish the current quality at all levels of institutions. This would argue for a bibliometrics-based approach for research evaluation rather than simply relying on readily available reputation measures.

Finally, we observe that for a Nobel population that – by definition – does not change over time, we get moderate variance in the ranking of the fictitious department composed of members of this population over assessment cycles. In other words, if we create a fictitious department from this sample, rate it in one assessment period, then age the same population and apply the same metric to it, the ranking of the department does not stay the same. This variance in ranking that is independent of any change in population or its management – by definition – suggests a degree of caution in how one actions the results of research rankings since they can change without any material changes in the underlying population generating the outputs. This sort of movement may argue for some smoothing of performance indicators, which we note that a cumulative citation measure provides. It may also argue for funding bodies’ not reacting too strongly to small changes in rankings, as this does not necessarily indicate a true decrease (or increase) in underlying quality.

The metrics used in this paper are clearly cheaper to implement than a system of peer review. If the metrics are simple and transparent, then they have the additional benefit of being easy enough to understand that they can affect behaviour. We have performed some back of the envelope cost calculations here that suggest a substantial saving compared with some actual research assessments, although we did not attempt to replace with metrics or fully cost all the different elements of the most recent UK research evaluation framework. In particular, we only attempted to conduct an evaluation of research outputs, not research environment or impact, so only the cost of evaluating certain elements by a more mechanistic approach are reflected in our back of the envelope figures.

Finally, the simple bibliometrics we use here may not do a good job of evaluating the performance of any given individual: our work has been at the level of aggregating a large number of individuals into a ‘department’ or ‘unit of assessment’ and observing whether bibliometrics do well at picking out benchmarks with these aggregate groups. While a journal ranking may do a poor job of reflecting the quality of any individual paper submitted to that journal, it may do a good job of reflecting overall quality of the journal. In the same way, the metrics we look at may do a good job of reflecting overall rankings of units of assessment but may do a poor job of reflecting the quality of any individual member of that unit of assessment. In this sense, any metrics-based approach may be a cheap way to get an overall research ranking for a country, but would be inappropriate to evaluate the work of an individual. The argument we make for bibliometrics is that they do a good enough job in the large to be used for overall rankings, not that they do a good job in any single specific case.

Discussion

Sascha O. Becker

University of Warwick

Rankings rule. We cannot and should not escape them. Attempts to fight against the existence of rankings are futile. Universities are supported by taxpayer money and/or tuition fees, and students and the public have a right to learn about the quality of departments. Of course, from the perspective of undergraduate students, the quality of teaching might seem to be the most important thing. The UK government recently announced the introduction of a ‘Teaching Excellence Framework’ to mirror the ‘REF’ to underline the importance of teaching excellence.

This paper focuses on research excellence, which matters to the extent that top researchers might be more likely to give better lectures. But of course, research excellence is extremely important in its own right. Research generates new knowledge, leads to innovations, and informs policy makers and the public.

Many countries have recently introduced research assessments. The most elaborate and costliest exercise is probably the UK’s REF. UK university departments have to submit four papers per researcher in a given six-year window, to be evaluated by a committee of wise (mostly senior) economists. Papers are assigned a rating of one to four stars. The idea of the REF is that a committee of experts will identify bad Econometrica papers and assign them less than four stars, and will find the odd four-star gem in the Journal of Bad Economics . Obviously, the REF is extremely costly because a group of excellent economists is kept busy for several months duplicating the work of many referees who already refereed those papers before they were first accepted for publication. What if all this was just a complete waste of time and money because rankings based on journal quality or based on citations of individual papers were as good as the work of the evaluation committee? Work by Bertocchi et al. (2015) shows that, in Italy, bibliometric evaluations yield very similar rankings as ‘informed peer review’.

This paper goes further and tries to compare bibliometric indicators to three external benchmarks: the performance of a synthetic department of Nobel Prize winners; the performance of Top four US Economics departments (Harvard, MIT, Princeton, Stanford); and the performance of US departments according to ‘reputation rankings’. If bibliometric indicators are doing a good job at mimicking those three types of external benchmarks, it would give further support to bibliometrics-based rankings of departments that would only cost a tiny fraction of the cost of ‘informed peer review’.

This paper makes a convincing case that bibliometrics is indeed doing a good job. Bibliometrics likely will come out top when considering the costs and benefits of various alternative ways to construct rankings.

Given bibliometrics comes out as a promising alternative to the REF and similar exercises around the world, I would have liked to see a little more discussion of how we can insure that bibliometric indicators are not gamed. One important point in that regard is the role of journal impact factors. It strikes me as odd that the standard journal impact factor is and remains the one including ‘journal self-citations’, that is citations from journal X to journal X. This invites editors keen on increasing the impact factors to game the system, or it invites authors to cater to the presumed taste of the editor to see the journal’s previous publications cited to show ‘good fit’ in this journal. The standard journal impact factor should be one excluding such ‘self-citations’. A journal’s work is arguably more relevant if it gets cited by authors of other journals. Chang, McAleer, and Oxley (2011) nicely document the difference between impact factors with and without journal self-citations.

This paper also argues that we should not only base a department’s publication output on journal rankings, but that we should factor in citations to individual papers. This is absolutely key. There are far too many papers in Economics journals that get zero or close to zero citations, also in Top 5 Economics papers (see Laband 2013). Even if a set of referees found the paper good enough to pass the bar, but then no one cares to cite it, what use is a paper? Oswald (2007) argues quite succinctly: ‘… it is better to write the best article published in an issue of a medium-quality journal such as the OBES than all four of the worst four articles published in an issue of an elite journal like the AER. Decision-makers need to understand this.’

Another interesting measure of a department’s quality could be a department-level Hirsch index. While typically used at the level of individual researchers, it could be applied in the same way to a department’s research output: a department with an index of h has published h papers, each of which has been cited at least h times.

I would have been very curious to see a zero cost benchmark included in this paper. RePEc 28 is based on voluntary registrations of economists, but seems to have good coverage and generates a reasonable ranking of departments (within limits). The big advantage of a RePEc-type ranking would be that it gets updated on a more or less continuous basis. Would a compulsory registration of economists, combined with the algorithm-based rankings, be ‘good enough’ as a zero cost alternative?

Maia Güell

University of Edinburgh

Measuring the quality of academic performance is not an easy task and yet it is necessary for several reasons, including at least the allocation of public resources to research institutions. This is also crucially important because, in turn, resources can shape the way academics work and their output. Many countries in Europe have established different systems to evaluate the performance of academic institutions and some countries are seeking to reform them. In this paper, Pierre Régibeau and Katharine E. Rockett study different evaluation systems, which is highly relevant for both academics and policymakers.

A key aspect of this paper is that it highlights the costs of implementing different evaluation systems. In particular, it compares the well-known research assessment system in the United Kingdom to simple bibliometric systems. The system in the United Kingdom is attractive in several dimensions (e.g., it is peer-reviewed); however, the authors in this paper estimate a very high cost, over £1 billion in the last exercise, which raises the question of the efficiency of the system. This issue is highly relevant because some countries are seeking to implement systems that are similar to the one in the United Kingdom.

Bibliometric systems are attractive because they are readily available. However, their accuracy in measuring the quality of research is not obvious. In this paper, Régibeau and Rockett validate the accuracy of such bibliometric systems. Their idea is simple and neat: they apply the bibliometric system to real life departments as well as to three benchmarks of unquestionable high quality to see how these compare under their metric. These benchmarks are: (i) a fictitious department with only Nobel Prize winners, (ii) a leading department that reflects the top US departments by many rankings (pooling MIT, Harvard, Princeton, and Stanford), and (iii) top departments in terms of their reputation (based on the Times Higher Education and the Shanghai Academic Ranking of World Universities).

The authors use two families of bibliometrics: ranking publications and citations. They find that these work differently for the different benchmarks, highlighting different trade-offs. For instance, for departments of only Nobel Prize winners only do better than real EU departments in terms of long-term citations but not in terms of publications or immediate citations. This may well confirm what the Nobel Prize is, the emphasis of one excellent idea, which could be one publication with eventually many citations.

The authors also find that top departments in the United States do better than departments of only Nobel Prize winners both in terms of publications and citations. It is worth reflecting on the US system. Obviously there are many things that work differently in the United States, but there is no explicit RAE, even if the National Science Foundation allocates a substantial amount of resources.

Overall, the simple and inexpensive bibliometric method proposed is attractive compared with the RAE in the United Kingdom. However, it is worth mentioning that when looking at the RAE more broadly, an additional positive aspect is its potential effect on the average quality of departments in the United Kingdom, as well as on the overall distribution.

Finally, in this paper the authors discuss publication practices over time and in their analysis they control for age and time periods. But it could be the other way around. When departments start implementing these assessments, it becomes more valuable to publish, especially in top journals. Again, this highlights the importance in designing evaluation systems, as they have relevant effects on how researchers work and output.

Panel discussion

John Hassler made two points. First, he said that the Nobel committee’s way of thinking on what is award worthy has changed over time. According to him, compared with early decades, more recent awardees have narrower fields. Second, he said that the authors want to create an assessment system that provides incentives for departments to be like Nobel departments. He thought this is not in line with the current assessment system. Martin Ellison suggested using John Bates Clark medallists instead of Nobel medallists.

Jose Luis Peydro said he is not familiar with the British system but he thinks that in general, those systems are biased towards the top five journals in economics. The top three finance journals have higher impact factors than the majority of top five economics journals, except the Quarterly Journal of Economics . He said there is a negative bias in economics towards finance. Peter Egger also made two points. First, he said that it would be interesting to take into account the supply and demand potential for those top publications. He noted that the number of pages published in top journals has increased a lot in physics but much less so in economics. He suggested using this as the denominator and looking for the demand potential for the scarce space. Second, Egger said that if we think from the perspective of a social planner, there are creation and diversion effects and the latter are much stronger for the United Kingdom. In particular, he pointed to the distinction between creation of impact for top journal publications and reshuffling of people within the United Kingdom, which he thought of as a distortion.

Giulio Zanella said that the exercise would be more refined if the authors could create the synthetic Nobel department in a way that it matches the field composition of the target departments. Timo Goeschl said that Nobel Prizes are the inputs for the production function of the synthetic department. He criticized the assumption that the production function is a linearly separable combination of individual inputs.

APPENDIX

Data collection

Nobel Prize Sample : Much of the methodology is described in the text, so we will restrict ourselves to information not provided there in this Appendix.

For the sample described in the text, we identify six periods of employment, finishing with the research evaluation period during which they reached 65 years of age. We then chose the middle four of these periods as the ‘middle years’ on which we base the work. Analysis for all six periods was discussed as part of the panel version of the paper, Régibeau and Rockett (2015) .

We then collected for each Nobel Prize winner as complete a publication history as we could assemble, as described in the text, and rated each output. Once the RePec, JSTOR, CV, and Google Scholar data were downloaded and assembled, we manually checked each and every entry for being a true case of authorship and eliminated any double-counting we could detect. In Google Scholar, this involved clicking on the entry, checking the first page of the article or chapter (or the front matter of the book), and then cross checking with the other listings for any duplication. In contrast to our check of Google Scholar, we relied on JSTOR, the CV, and the RePec listing to be accurate reflections of authorship.

We did not include in our count working or discussion papers, conference presentations or transcripts of speeches, reports that did not have significant academic content, newspaper and popular press output, internal documents, newsletters, and output that was not primarily directed at an academic and broadly economic audience. Hence, our focus was on economics output for academic consumption. Our method of judging whether an output was economic or not was to first check whether the output was listed in the Keele Ranking, on JSTOR using the Economics filter, or on RePec. If the output was not included in any of these sources and if it did not have, at a first look, primarily economics content, we did not count it as we wished to count output that would broadly have contributed to a Nobel Prize in Economics as opposed to another discipline. We also wished to reduce the effect of citation or publication patterns across disciplines: we did not have a fully comparable rating system for many of the non-economics journals.

While this system was relatively mechanical for journals, for books we needed an additional filter since these are not ranked in standard journal rankings. We counted university-press books but excluded others. The only exception we made to this rule was if the Nobel Prize winner had cited a specific book as central to their contribution, where this book was not a university press book. In those cases, we included the output. There were very few such cases.

We used Hudson’s ‘chunky’ ranking listed in his Appendix to his 2013 paper, cited in the text. In other words, we classified journals as 4*, ‘Probable 4*’, ‘possible 4*’, ‘3*’, ‘Probable 3*’, and so on down to a ranking of 1*. This required going beyond the published ranking to infer from the Keele Ranking a similar set of classifications. As a result, where a journal did not appear in Hudson (2013) , we assigned rankings as follows: Keele ranking 3 = our ranking of 2; Keele ranking 2 = our ranking ‘probable 2’, Keele ranking 1 = ‘possible 2’. If the journal did not appear on the Keele ranking, we inferred a ranking based on the characteristics of the journal. Any journal that was not in English we ranked as a 1. New journals that did not appear on the Keele ranking were: American Economic Journal (ranked at 3), and Berkeley Electronic Press (ranked at 4 for Frontiers, 3 for Advances, 2 for Contributions, and 1 for Topics). Older journals that have evolved included Farm Economics and the Swedish Journal of Economics , both ranked at 3 as the closest match to the more recent forms of these journals . We ranked university press books as 3. The Nobel Prize lecture/written output itself was ranked as ‘possible 4’.

If a journal provides a shorter articles section, such as the American Economic Review Papers and Proceedings , we reduced the mark. Similarly, we reduced the mark for notes, shorter discussions, and replies. In these cases, we reduced the output’s ranking by two levels (so that a 4 ranking for the American Economic Review would become ‘possible 4’). We excluded corrigenda.

This system did not resolve all our rankings decisions, so we needed to rely on our own judgement in a few cases. In such cases, where an output appeared to have significant academic content for an economics audience but did not appear in any standard classification or in our modifications, we had to make a decision ourselves. Again, there were relatively few such cases.

For our five-point scale, we counted full-length articles in the ‘top five’ journals as quality level 5. When these were shorter articles (such as the Papers and Proceedings ), notes, discussions, and other less significant pieces, we did not adjust the quality to 5, despite the outlet. Instead, in these cases, we left the output at its ranking on the scale of 1–4.

Our citations were collected from Google Scholar. For each individual, we restricted our search to our target time period (1 January 2001–31 December 2006 or 1 January 2007–31 December 2012). This yielded publications in this time period. We then restricted the citations to be in the target period (either the same as the publication period or cumulated over the two periods) and chose the top outputs to include in our metric. Since the top cited articles when we restrict the citations period to be the same as the publication period are not necessarily the same as the article that receive the top number of citations when cumulated, the totals across the two periods do not necessarily sum to the total for the cumulation.

Where working paper versions of papers also appeared we attempted to cumulate the citations across the working paper and the final published version of the paper. This required some manual matching, as the final paper did not necessarily carry exactly the same title. We performed this matching by looking at the output itself to discover if there was a working paper version that was acknowledged as the source of the final version.

To adjust for changes in citation practices, we applied an adjustment factor reflecting the work of Card and DellaVigna for citations patterns in a series of journals, however the exact adjustment is elusive. First, Card and DellaVigna’s analysis does not extend before 1970 whereas much of our Nobel population was active in that period. Also, they only cover the five top journals, which all have very different citation patterns, none of which may correspond to the journals where the work we record appeared. Hence, we needed to make an assumption about how the citation patterns they uncover should be used to adjust our figures. We also needed to make an inference about whether the more recent dip in citations in their work was due to a truncation effect or whether it was due to changes in publication practice.

As a result of all these uncertainties, we made very rough estimates based on their figures for citation patterns in the American Economic Review , a journal that was in the middle of their citation patterns. We assumed that publications before 1970 had a no more than slowly declining number of citations. We also assumed that the latter portion of their sample where citations declined reflected truncation effects and we also do not attempt to infer anything about the trend after the citations level out: by making no further adjustment we effectively take the citations propensity to be constant after the late 1990s. The exact date where these truncation effects appear in their sample and the ‘levelling out’ starts is somewhat unclear from the data, however, as one year, 1996, is very high compared with what precedes or follows it. Depending on whether one chooses the high point of citations, in 1996, as the maximum that would persist into the future or whether one takes an average over several years after 1996 makes quite a difference to the adjustment factor, varying the adjustment for the first period of our data from almost 8 in the earlier periods to somewhat above 5. We chose a middle road in our adjustment factors for all periods between the possible extremes, but admit that the exact number is unclear. Allowing for considerable variance in the true factor does not change the basic message underlying our conclusions, however.

Top European and US Departments: The process was the same for the US and European departments, with the exception that we did not adjust the citations at all – since they refer to the same periods for all departments – and we did not have as much trouble locating and cleaning the sample of publications. Still, we used the same sources and used the same methodology, performing manual cross checks to obtain a complete list of publications and using Google Scholar to obtain citations for the target periods of 2001–2006 and 2007–2012.

Table A1.

Ranking of North American institutions by research reputation

Rank University Score 
Massachusetts Institute of Technology 100 
Harvard University 92.6 
Princeton University 90 
University of Chicago 87.2 
Stanford University 87.1 
University of California – Berkeley 86.4 
Yale University 84.1 
Northwestern University 80.5 
Columbia University 79.5 
10 University of California –Los Angeles 78.5 
11 University of Pennsylvania 78.4 
12 New York University 77.9 
13 University of California – San Diego 74.3 
14 University of Michigan 71.5 
15 Duke University 70.5 
16 Cornell University 70.5 
17 University of Toronto 70.3 
18 University of British Columbia 69.7 
19 Boston University 68.8 
20 California Institute of Technology 68.5 
21 Brown University 67.4 
22 University of Minnesota 66.4 
23 University of Rochester 62.4 
24 University of California – Davis 62.1 
25 University of Wisconsin – Madison 61 
26 Carnegie Mellon University 60 
27 University of Texas – Austin 59 
28 University of Maryland – College Park 58.3 
29 Mc Gill University 58.1 
30 Penn State University 55.1 
31 Queens University 54.9 
32 University of Illinois – Urbana Champaign 54.7 
33 Michigan State University 54.2 
34 Texas A&M University 52.3 
35 University of Virginia 51.7 
36 University of Western Ontario 51.6 
37 University of Illinois – Chicago 51.5 
38 Georgetown University 50.3 
39 Purdue University 50.2 
40 University of Washington 50 
41 George Mason University 49.7 
42 Washington University 49.6 
43 Johns Hopkins University 49.5 
44 Ohio State University 49.4 
45 Boston College 49.2 
46 Dartmouth College 48.9 
47 University of Southern California 48.5 
48 University of North Carolina – Chapel Hill 48.3 
49 Université de Montréal 48.2 
50 University of Massachusetts – Amherst 47.3 
51 Simon Fraser University 47 
52 George Washington University 45.8 
53 University of Indiana – Bloomington 45.6 
54 University of California – Santa Barbara 45.2 
55 McMaster University 44.7 
56 University of Calgary 44.5 
57 University of California – Irvine 44.4 
58 University of Alberta 42.9 
59 Vanderbilt University 42.7 
60 Arizona State University 42.2 
61 Georgia Institute of Technology 42.1 
62 University of Pittsburgh 39.4 
63 York University 37.8 
64 University of Colorado – Boulder 35.9 
65 University of Arizona 31.6 
Rank University Score 
Massachusetts Institute of Technology 100 
Harvard University 92.6 
Princeton University 90 
University of Chicago 87.2 
Stanford University 87.1 
University of California – Berkeley 86.4 
Yale University 84.1 
Northwestern University 80.5 
Columbia University 79.5 
10 University of California –Los Angeles 78.5 
11 University of Pennsylvania 78.4 
12 New York University 77.9 
13 University of California – San Diego 74.3 
14 University of Michigan 71.5 
15 Duke University 70.5 
16 Cornell University 70.5 
17 University of Toronto 70.3 
18 University of British Columbia 69.7 
19 Boston University 68.8 
20 California Institute of Technology 68.5 
21 Brown University 67.4 
22 University of Minnesota 66.4 
23 University of Rochester 62.4 
24 University of California – Davis 62.1 
25 University of Wisconsin – Madison 61 
26 Carnegie Mellon University 60 
27 University of Texas – Austin 59 
28 University of Maryland – College Park 58.3 
29 Mc Gill University 58.1 
30 Penn State University 55.1 
31 Queens University 54.9 
32 University of Illinois – Urbana Champaign 54.7 
33 Michigan State University 54.2 
34 Texas A&M University 52.3 
35 University of Virginia 51.7 
36 University of Western Ontario 51.6 
37 University of Illinois – Chicago 51.5 
38 Georgetown University 50.3 
39 Purdue University 50.2 
40 University of Washington 50 
41 George Mason University 49.7 
42 Washington University 49.6 
43 Johns Hopkins University 49.5 
44 Ohio State University 49.4 
45 Boston College 49.2 
46 Dartmouth College 48.9 
47 University of Southern California 48.5 
48 University of North Carolina – Chapel Hill 48.3 
49 Université de Montréal 48.2 
50 University of Massachusetts – Amherst 47.3 
51 Simon Fraser University 47 
52 George Washington University 45.8 
53 University of Indiana – Bloomington 45.6 
54 University of California – Santa Barbara 45.2 
55 McMaster University 44.7 
56 University of Calgary 44.5 
57 University of California – Irvine 44.4 
58 University of Alberta 42.9 
59 Vanderbilt University 42.7 
60 Arizona State University 42.2 
61 Georgia Institute of Technology 42.1 
62 University of Pittsburgh 39.4 
63 York University 37.8 
64 University of Colorado – Boulder 35.9 
65 University of Arizona 31.6 
Table A2.

Back of the envelope cost per field for UK departments, metrics estimate

Fields Average size of departments Number of departments Cost 
Clinical Medicine 115 31 417,610 
Public Health, Health Services & Primary Care 43 32 308,020 
Allied Health Professions 29 94 455,840 
Psychology, Psychiatry, Neuroscience 31 82 429,320 
Biological Sciences 54 44 374,090 
Agriculture, Veterinary and Food Science 36 29 288,640 
Earth Systems and Environmental Sciences 31 45 325,700 
Chemistry 33 37 308,070 
Physics 42 41 336,860 
Mathematics 36 53 363,280 
Computer Sciences 23 89 414,340 
Engineering 1 48 24 287,840 
Engineering 2 29 37 300,170 
Engineering 3 28 14 237,140 
General Engineering 39 62 400,470 
Architecture 23 45 307,950 
Geography, Environmental Studies and Archeology 23 74 377,540 
Economics 27 28 273,080 
Business and Management Studies 33 101 493,260 
Law 23 67 362,070 
Politics 23 56 334,310 
Social Work and Social Policy 21 62 343,220 
Sociology 24 29 271,740 
Anthropology and Development Studies 22 25 259,600 
Education 19 76 367,860 
Sports, Leisure, Tourism 15 51 303,760 
Area Studies 21 23 253,130 
Modern Languages and Linguistics 24 57 341,120 
English Language and Literature 22 89 410,690 
History 22 83 393,880 
Classics 17 22 246,870 
Philosophy 15 40 279,950 
Theology and Religious Studies 13 33 262,230 
Arts and Design 19 84 386,040 
Music, Drama, and Dance 14 84 362,940 
Communications, Media Studies, Library 14 67 331,170 
Total   £12,209,800 
Fields Average size of departments Number of departments Cost 
Clinical Medicine 115 31 417,610 
Public Health, Health Services & Primary Care 43 32 308,020 
Allied Health Professions 29 94 455,840 
Psychology, Psychiatry, Neuroscience 31 82 429,320 
Biological Sciences 54 44 374,090 
Agriculture, Veterinary and Food Science 36 29 288,640 
Earth Systems and Environmental Sciences 31 45 325,700 
Chemistry 33 37 308,070 
Physics 42 41 336,860 
Mathematics 36 53 363,280 
Computer Sciences 23 89 414,340 
Engineering 1 48 24 287,840 
Engineering 2 29 37 300,170 
Engineering 3 28 14 237,140 
General Engineering 39 62 400,470 
Architecture 23 45 307,950 
Geography, Environmental Studies and Archeology 23 74 377,540 
Economics 27 28 273,080 
Business and Management Studies 33 101 493,260 
Law 23 67 362,070 
Politics 23 56 334,310 
Social Work and Social Policy 21 62 343,220 
Sociology 24 29 271,740 
Anthropology and Development Studies 22 25 259,600 
Education 19 76 367,860 
Sports, Leisure, Tourism 15 51 303,760 
Area Studies 21 23 253,130 
Modern Languages and Linguistics 24 57 341,120 
English Language and Literature 22 89 410,690 
History 22 83 393,880 
Classics 17 22 246,870 
Philosophy 15 40 279,950 
Theology and Religious Studies 13 33 262,230 
Arts and Design 19 84 386,040 
Music, Drama, and Dance 14 84 362,940 
Communications, Media Studies, Library 14 67 331,170 
Total   £12,209,800 
1 An earlier version of this paper was presented at the 62nd meeting of the Economic Policy panel in October, 2015. The authors thank Sascha O. Becker and Maia Güell for their discussions at the panel and Andrea Ichino for outstanding editorial work. We also thank three anonymous referees for their perceptive reports and the audience at the panel meeting for their discussion, ideas, and suggestions. Andrew Oswald provided encouragement and Liutauras Petrucionis provided helpful research assistance. The opinions expressed in this article are those of the authors and should not be taken to reflect the views of CRA. All remaining errors are our own.
The Managing Editor in charge of this paper was Andrea Ichino.
2
See Jump (2015 a , b ) , with other estimates falling between these bounds. The cost is high for many reasons: the recent UK review is very complete and involves peer review that lasts over many months and involves reading and evaluating a large number of papers. It also requires submissions to be prepared by universities, which generally vet the submissions carefully. Information on the cost of other evaluation systems is sketchy. The European Commission Expert Group on Assessment of University-Based Research (2008) finds that Helsinki University’s direct costs of running an evaluation office to conduct their research assessment in 2005 were 896,000 euros; the German Science Council Rating pilot study in 2005–7 cost 1.1 million euros. On the other hand, some league tables, which were included in the Commission’s review, were associated with trivial expenses: the Sunday Times Irish Universities League Table is characterized as a basic ranking system and has a cost estimate of only 7,000 euros per annum to build the table based on straightforward bibliometrics.
3
Their measure reflects the combination of references from within and outside of field, showing that more novel research tends to be more highly cited in the longer term, with higher variance and a lower impact factor than that associated with the outlet in which it was published. If we were to put this type of measure in our work this would be an inappropriate way to use their results. It is easy to manipulate a reference list by modifying the section to include a wider set of papers. As our concern is to create a measure that could potentially be used to allocate research funds, the incentive to manipulate would be present.
4
This does not contradict the further findings of Wang et al. (2016) that lags in citations are important to pick up research contributions with higher eventual citation potential. This may be true at the individual level, but our interest is measurement at the level of the unit of assessment (often a department), which aggregates individuals. At this level of aggregation, we do not find a large gain from considering larger lags, as we discuss below.
5
Bertocchi et al. (2015) study peer review where the reviewers know something about the identity of the author and so the outlets in which publications appeared. They comment that this is quite different from uninformed peer review, as the similarity they find between peer review and journal ranking methods may simply reflect the tendency of peer reviewers to accept the journal’s ranking as a good proxy for quality. The issue of distinguishing high quality research without such indicators is closer to the work of Gans and Shepherd (1994) .
6
Frey and Osterloh (2011) provide an extensive review and evaluation of the literature on ranking systems for academic work.
7
See also extensive references within Bertocchi et al. (2015) and Wang et al. (2016) to other papers that discuss the limits of bibliometric analysis and to the use of bibliometrics as indicators.
8
See reference list for both reports, which provide a detailed discussion of a broad set of research evaluation systems. The information in this section is drawn heavily from both.
9
Clearly we have no contract information, so we count all individuals as ‘fully’ within the department. This over-counts any individual on a partial contract.
10
We exclude Michael Spence, Robert Mundell, and John Nash from our basic calculations. Other exclusions are detailed later as we undertake them.
11
We set research assessment periods at 6 year intervals starting at 2,020 and working backwards to 2014, 2008, and so on. The last period of the six for each individual is the one during which he or she reaches his or her 65th birthday.
12
Indeed, for some we need to truncate in order to avoid contamination from an early Nobel Prize. For the rest, our data stop either in the 6 year period of the Nobel award or before this. We do not observe a large change in quality of outlet or citation at the Nobel award or even a year or two afterward, so we have taken this as an appropriate truncation. Also, we have repeated our exercises with a ‘younger’ version (where, instead of counting back from the retirement date to determine a middle four periods we count up from the period after the one where the PhD is awarded). This modification generates little change in the results and so is not included here.
13
See Table A1 in the Data Appendix for this ranking.
14
We include books, book chapters, and reports as output, but as a practical matter university press books were the main non-journal outputs that received weight. Many of these received a rating of 3, but those we identified from reviews or discussion in the period as more significant received a ranking of 4. For this more detailed evaluation, we relied on a variety of period resources including discussions and reviews by other academics, book reviews in the popular press that gave information on the reception of the book, and the Nobel candidates’ own Nobel address, which often isolates the significant contributions in their body of work from the laureate’s own perspective. As our main concern is how output was received at the time, we attempt to disentangle hindsight from information at the time via reports at the time. We use the rough categories in Hudson’s rankings, as these reflect the overall view from a variety of different journal rankings. This allows us to categorize output as 4, ‘probable 4’, ‘possible 4’, 3, ‘probable 3’, ‘possible 3’, and so on. For Hudson’s ‘probable’ category, we use X.5 as a ranking (in other words, a ‘probable 4’ would receive 3.5 in our exercise). For this ‘possible’ category we use X.25. When we passed from Hudson’s rankings to the Keele ranking, we adjusted the Keele ranking to reflect the ‘downgrading’ that was associated with the rest of the Hudson results. Other details of our implementation of the rankings are available from the authors.
15
Impact is a broader concept than just citations, however widely measured, as has been pointed out in King’s College London and Digital Science (2015) . This report analyses the impact case studies prepared for the UK’s most recent research evaluation exercise and refers to the exercise’s definition of impact as ‘any effect on, change or benefit to the economy, society, culture, public policy or services, health, the environment or quality of life, beyond academia’. In the UK’s 2014 exercise, it was measured by short case studies written up by the relevant assessment units. We capture this type of impact with a measure that, admittedly, reflects only partially this broader view of impact.
16
We choose the citation profile for the American Economic Review from their results, as this represents a ‘middle road’ in their citation profiles without some of the extreme movements that they detect in, for example, the Quarterly Journal of Economics . We do not, then, correct our citation adjustment for the specific patterns observed in different journals.
17
We use the same adjustment factor for publications from 1960 to 1969 as what we observe in the Card and DellaVigna paper for 1970. We use an adjustment of 6.3 for citations in the earliest period, which fall to 2.2 in the 1980s and then to 1.5 for the early 1990s. After this, the adjustment is nil. These factors are averages: we do not attempt a separate factor for each different year of publication.
18
In an earlier version of this paper, we used citations of the top four outlet-ranked outputs for each individual. The results were quite similar to the measure we use here of the top four cited outputs, regardless of the outlet rank of those outputs. Hence, we present only one set of results.
19
We also computed results on a restricted sample of periods after the 1960 evaluation period for the Nobel group, but this made little difference to the results and so, again, is not reported.
20
The other metrics calculated using the five-point scale are available from the authors.
21
The 2015 QS World ranking for research is 27th for Bocconi while Sheffield is not ranked within the top 200.
22
We evaluate only the periods from 1960 for this calculation, using the adjustment factors we mentioned in the ‘Methodology’ section.
23
This remains true if we use a single institution at the top rather that our pool of Top 4 institutions.
24
Not shown as part of Table 7 .
25
Two countries which already maintain their own rankings are France (CNRS at https://www.gate.cnrs.fr/spip.php?/rubrique31&lang=en ) and Germany ( http://www.handelsblatt.com/downloads/9665428/1/journal-ranking.pdf ). Other countries are singled out in the EC Expert Group on Assessment of University-Based Research (2008) report as giving separate consideration to local language publication. Please see the report for details.
27
See, for example, Nardo et al. (2005) .

REFERENCES

Abramo
G.
D’Angelo
C.
Caprasecca
A.
(
2009
). ‘
Allocative efficiency in public research funding: can bibliometrics help?
’,
Research Policy
  ,
38
,
206
15
.
Bertocchi
G.
Gambardella
A.
Jappelli
T.
Nappi
C.
Peracchi
F.
(
2015
). ‘
Biometric evaluation vs. informed peer review: Evidence from Italy
’,
Research Policy
  ,
44
,
451
66
.
Card
D.
DellaVigna
S.
(
2013
). ‘
Nine facts about top journals in Economics
’,
Journal of Economic Literature
  ,
51
,
144
61
.
Chang
C-L.
McAleer
M.
Oxley
L.
(
2011
). ‘
What makes a great journal great in economics? The singer not the song
’,
Journal of Economic Surveys
  ,
25
,
326
61
.
Clerides
S.
Pashardes
P.
Polycarpou
A.
(
2011
). ‘
Peer review vs. metric-based assessment: Testing for bias in the RAE ratings of UK Economics Departments
’,
Economica
  ,
78
,
565
83
.
European Commission Expert Group on Assessment of University-Based Research
(
2008
). Assessing Europe’s University-Based Research. Eur 24187 EN.
Frey
B.
Osterloh
M.
(
2011
). ‘Ranking games’, Working Paper 39, University of Zurich Department of Economics.
Frey and Rost
(
2010
). ‘
Do rankings reflect research quality?
’,
Journal of Applied Economics
  ,
XIII
,
1
38
.
Gans
J.
Shepherd
G.
(
1994
). ‘
How are the mighty fallen: Rejected classic articles by leading economists
’,
Journal of Economic Perspectives
  ,
8
,
165
79
.
Hudson
J.
(
2013
). ‘
Ranking journals
’,
The Economic Journal
  ,
123
,
F202
22
.
Jump
P.
(
2015a
). ‘
Can you win by fielding the whole team?
’,
Times Higher Education
  ,
20
1
.
Jump
P.
(
2015b
). ‘
Guess which part of the 2014 REF came to £55 million…
’,
Times Higher Education
  ,
8
.
King’s College London and Digital Science
(
2015
). ‘The nature, scale and beneficiaries of research impact: an initial analysis of research excellence framework REF 2014 impact case studies’, Research Report 2015/01, Higher Education Funding Council of England.
Laband
D.
(
2013
). ‘
On the use and abuse of economics journal rankings
’,
The Economic Journal
  ,
123
,
F223
54
.
Nardo
M.
Saisana
M.
Saltelli
A.
Tantarola
S.
Hoffman
A.
Giovannini
E.
(
2005
).
Handbook on Constructing Composite Indicators
  ,
OECD
, ISSN
1815
2031
.
OECD
(
2010
). Performance Based Funding for Public Research in Tertiary Education Institutions: Workshop Proceedings ,
OECD Publishing,
Paris, France
.
Oswald
A.
(
2007
). ‘
An examination of the reliability of prestigious scholarly journals - evidence and implications for decision-makers
’,
Economica
  ,
74
,
21
31
.
Régibeau
P.
Rockett
K.
(
2015
). A Tale of Two Metrics: Research Assessment vs Recognised Excellence , mimeo. Available at: http://www.economic-policy.org/62nd-economic-policy-panel/a-tale-of-two-metrics-research-assessment-vs-recognised-excellence/ .
Sgroi
D.
Oswald
A.
(
2013
). ‘
How should peer-review panels behave?
’,
The Economic Journal
  ,
123
,
F255
78
.
Starbuck
W.
(
2005
). ‘
How much better are the most prestigious journals? The statistics of academic publication
’,
Organization Science
  ,
16
,
180
200
.
Starbuck
W.
(
2006
).
The Production of Knowledge: The Challenge of Social Science Research
  ,
Oxford University Press
,
Oxford
.
Times Higher Education Table of Excellence in RAE
(
2008
).
The Results
. Available at: http://www.timeshighereducation.co.uk/404786.article .
Wang
J.
Veugelers
R.
Stephan
P.
(
2016
). ‘Bias against novelty in science: a cautionary tale for users of bibliometric indicators’, NBER Working Paper 22180.