Changing research on research evaluation: A critical literature review to revisit the agenda

The current range and volume of research evaluation-related literature is extensive and incorporates scholarly and policy/practice-related perspectives. This reflects academic and practical interest over many decades and trails the changing funding and reputational modalities for universities, namely increased selectivity applied to institutional research funding streams and the perceived importance of university rankings and other reputational devices. To make sense of this highly diverse body of literature, we undertake a critical review of over 350 works constituting, in our view, the ‘state-of-the-art’ on institutional performance-based research evaluation arrangements (PREAs). We focus on PREAs because they are becoming the predominant means worldwide to allocate research funds and accrue reputation for universities. We highlight the themes addressed in the literature and offer critical commentary on the balance of scholarly and policy/ practice-related orientations. We then reflect on five limitations to the state-of-the-art and propose a new agenda, and a change of perspective, to progress this area of research in future studies.


Introduction
In this article, we undertake a critical review of over 350 relevant publications that together constitute, in our view, a diverse and wide-ranging literature 'state-of-the-art' on the performance-based research evaluation arrangements (PREAs) of universities and other public research organizations. These arrangements 1 address systematic evaluation exercises aiming to introduce resource and reputational policy incentives aligned with dominant notions of research quality (Langfeldt et al. 2019). We believe our analysis is necessary to: (1) highlight major themes addressed by literature; (2) provide a critical commentary on the balance of scholarly and policy/practicerelated orientations in this literature; (3) identify limitations in this state-of-the-art; and finally (4) propose a novel research agenda to overcome these limitations.
Evaluations of policy and funding arrangements to support public research have been undertaken and studied for many decades. However, the number of studies on the details and effects of specific research evaluation arrangements globally increased considerably during the 1990s. This growing interest trails changing funding modalities for universities and public research organizations, with a rise of competitive, project grant funding, increased selectivity applied to institutional research funding streams (Paradeise and Thoenig 2015), and the perceived importance of global rankings. Once pioneering research evaluation arrangements to allocate institutional funding, like Excellence for Research in Australia (ERA) and the UK Research Excellence Framework (REF), have also become established and seemingly intrusive enough to spur academic and policy concerns. This class of arrangements is becoming the predominant evaluative means to allocate public research funds and/ or garner global reputation. It is, therefore, our central focus.
These PREAs have been discussed in an increasingly large body of both academic and grey literature sources, addressed via both scholarly and more policy/practice-related orientations. The scope of this literature varies widely. There are small-scale studies on peer judgement and dynamics of peer review panels operating inside broader national PREAs. There is also research on wider effects for behaviours and strategies of actors, organizations and institutions in national policy, and funding 'research spaces', for example, for universities, funding agencies, and researcher career trajectories (see Nedeva 2013; see also Smith, Ward and House 2011;Waitere et al. 2011;Lee, Pham and Gu 2013;Aagaard, Bloch and Schneider 2015;Reale et al. 2018;Whitley, Glaser and Laudel 2018;Lind 2019).
Research attention has been paid to increasing selectivity, and increasing use of performance-based allocation approaches in institutional research funding in countries like Australia, the Netherlands, Sweden, and the UK (Organisation for Economic Cooperation and Development (OECD) 2009; Auranen and Nieminen 2010;Otley 2010;Wang and Hicks 2012;Tahar and Boutellier 2013;Leisyte and Westerheijden 2014;De Boer et al. 2015;Greenhalgh and Fahy 2015;Jonkers and Zacharewicz 2015;Arocena, Gö ransson and Sutz 2018;Canibano et al. 2018;Jonkers and Sachwald 2018;Woelert and McKenzie 2018). PREAs have also become of central importance in terms of research/epistemic governance. There is a perceived transition away from determination of research goals and orientations endogenously within universities and knowledge communities towards greater authority and influence from more strategic and managerial policy and university actors designing, deploying, or reacting to outcomes of PREAs (Whitley and Glä ser 2007;Langfeldt et al 2019).
Our critical review and analysis of this literature aims to identify thematic coverage, highlight limitations, and propose a new research agenda we believe is needed to move studies forward in this area. There have been previous surveys of research evaluation-related literature, for example, cross-sectional surveys and thematic reviews of evaluation practices and indicators (see De Rijcke et al. 2016). There have also been comprehensive studies correlating specific characteristics of differing national research evaluation arrangements to apparent national science system performance or excellence in international context (see Sandströ m and Van den Besselaar 2018; also Jonkers and Sachwald 2018). Whilst remaining within the confines of a critical review approach, our intent here is different and somewhat closer to meta-research motivations (c.f. Ioannidis 2018). We aim to analyse the themes, orientations, and limitations of research evaluation research itself, a review approach we believe has been overlooked in literature in this area to date.
In doing so we grapple with a messy reality. PREAs are dynamic, often politicized and are not 'scientific', static, standardized, or universal. They operate across multiple spatial levels and time horizons, use differing methods, involve varying degrees of transparency and costs, and are conducted by different kinds of organizations for various purposes (Galleron et al. 2017). They can be understood as socially constructed systems, their legitimacy and effectiveness can be disputed, and they blend multifaceted contextual, political, managerial, economic, and reputational elements (Bianco, Gras and Sutz 2016). We believe our critical review and analysis must therefore be purposive rather than trying to encompass all possible research on this vast topic.
To structure our article, first, we define our understanding of PREAs and use it to guide our approach. We describe our purpose in collecting and coding a bespoke dataset of 354 pieces of literature that we believe constitutes the most relevant 'state-of-the-art' on PREAs. Second, we present an analysis of five research themes we derive through inductive clustering of this state-of-the-art and provide critical commentary on the major arguments in this body of research. Third, we discuss five limitations to this PREA-related literature and suggest a novel research agenda to address them.

Approach
We understand PREAs as including 'organized sets of procedures for assessing the merits of research undertaken in publicly funded organizations that are implemented on a regular basis, usually by state or state-delegated agencies' (Whitley and Glä ser 2007: 6). PREAs operate at multiple levels, as an 'ensemble of practices and institutional arrangements in a country' and/or locally in a university organization, mediating 'between scientific quality controls and research policies' (Cruz-Castro and Sanz-Menéndez 2007: 205). They are part of the 'organizational governance' of universities, in directing 'strategy, funding' and operations, and are a potential source of tensions (Luo, Ordóñez-Matamoros and Kuhlmann 2019: 1). They are also frequently 'intended to change science by improving its quality' and possibly even altering research 'content' (Glä ser 2007: 245). They can be 'weak' and aim primarily at 'informationgathering' for benchmarking of research, researchers, and research organizations-or else 'strong' in performance-based 'national systems of research output evaluation' and be used as a basis 'to distribute research funding to universities' (Hicks 2012: 260).
To guide our critical review and analysis of the literature, we capture the most salient of these aspects by defining PREAs here as the institutionalized, or semi-institutionalized, practices and procedures aiming to assess the merit of the research output, research environment, and research engagement of research organizations with a view to incentivizing desired change or continued performance. PREAs may be conducted at different levels of social aggregationfor example, national research system, organization, etc.-and affect resource allocation and reputations.
Not to conflate our definition of PREAs with other possible forms of evaluation, we draw upon an understanding of science dynamics as involving research fields and research spaces (Nedeva 2013). We thus distinguish between PREAs and two other commonly addressed types of research evaluation. Our critical review includes only literature on PREAs located in the research space (see Figure 1). We thereby exclude literature addressing research evaluation types performed by research organizations and research fieldrelated knowledge claim assessment.
This PREA definition directed our focus to the field of science, technology, and innovation policy (STIP) studies, and mapped onto central and peripheral journals in this area. Our approach followed that of a critical narrative review; we wished to identify key contributions around our specified topic but not necessarily to address all evaluation-related material ever produced (c.f. Demiris, Oliver and Washington 2019). Our definition directed us to core STIP-related journals (e.g. Research Evaluation, Research Policy, Science and Public Policy, Scientometrics, and Minerva) and selected peripheral ones. 2 We used the keywords 'research evaluation ', 'institutional', and 'university(ies)' in searches of (1) Web of Science, (2) Scopus, and (3) Google Scholar. This resulted in 675 hits from numerous journals, books, and non-academic sources. We reviewed titles and abstracts at this stage to screen for duplicates and, guided by our PREA definition, ensured materials primarily addressed research space-related research evaluation. This was done using (1) our knowledge as active scholars in fields of research evaluation and research policy for several decades (c.f. Adler and Adler 1987); (2) our knowledge of research consultancies and their key reports; and (3) invited expert advice by email, telephone, and face-to-face from a small number of international research policy/evaluation academic and consultant colleagues (this latter element introduced an element of consensus narrative review; c.f. Wilczynski 2017). Our final set thus also included grey literature from consultancies and funders like Technopolis, PA Consulting, the European Commission, the former Higher Education Funding Council for England (HEFCE), and select others. 3 This critical narrative review process with an element of consensus review led to our final set of 354 full-text materials, including academic articles, books, funder and policy reports that we then inductively coded and analysed. The earliest piece of literature that we retrieved was published in 1968. For convenience, we set 2018 as a cut-off publication year. Just over 85% of the literature we included and reviewed in this bespoke dataset was published between 2000 and 2015-reflecting increased attention as funding modalities and evaluation arrangements have been recently changing. A total of 179 items were primarily qualitative, 103 were quantitative, and 72 were mixed methods based. The literature in the dataset addressed PREAs related to 37 countries and territories, trans-national arrangements, and international surveys of these arrangements (e.g. by the European Union [EU] and the Organisation for Economic Co-operation and Development [OECD]). 4 Following this highly selective, expert-informed, critical and consensus narrative review approach we cannot claim to have produced a comprehensive collection of all materials ever published on 'research evaluation'-related topics. However, we believe we captured enough breadth and depth of the 'state-of-the-art' on PREA-related topics to satisfy our purposive analysis, to highlight key limitations, and to underpin our proposition of a novel research agenda.
For every piece of literature in the dataset we manually read abstracts and full texts. From this reading, we wrote synopses summarizing the approach, coverage, findings, and conclusions of each piece of literature. We then analysed our database of synopses to produce an inductive clustering of all the literature into five major themes, shown in Table 1. All literature was assigned to a single major theme based upon primary message. This was based on our subjective reading of the literature content, what proportion of it addressed a given theme, and the prominence afforded that theme in the literature. 5 Our first inductive clustering theme, accounts of local PREAs was where we assigned literature whose primary content provided 'thick descriptions'. This included case studies of PREAs specific to a national research system (e.g. ERA or REF), a trans-national regional bloc (e.g. EU-level arrangements), for a sub-national region, for a specific organization (e.g. university), or for a sector or grouping of organizations (e.g. medical research in universities and research institutes). Our second theme was where we clustered comparative studies of PREAs, for instance, those comparing specific sets of countries or specific research fields. Our third theme captured literature providing discussions of rationales for (performance-based) research evaluation, for example, discussing the policy impetus for performance-based criteria and how they related to pursuit of excellence aims, efficiency, and other concerns. The fourth theme clustered appraisals of (performance-based) research evaluation methodologies; for example, debates around the relative merits of bibliometrics, altmetrics, and other indicators visà -vis peer review practices-essentially the detailed methods and machinery, technical parameters, and logistics of the design and deployment of PREAs. Our fifth and final theme clustered literature attempting studies of effects on the science system, for example, how PREAs interacted with science dynamics and researcher careers.
We found it helpful to characterize the literature further using limited additional coding: literature type (i.e. journal articles, books or book chapters, policy reports); literature content-primary research (e.g. interviews, surveys, bibliometrics, mathematical models and simulations, mixed methods) or secondary (e.g. desk-based literature reviews and/or secondary sources); literature methods (quantitative, qualitative or mixed); literature approach, that is, thick descriptions of specific cases, critical analyses, and attempts at comparative analysis; and object of analysis, that is, organization level evaluations or sub-national, national, or trans-national levels. 6 These further codes are shown in Table 2 and were included in our analytical approach. 7 Inductively clustering these five themes and using our further coding we began our purposive analysis, where we posed five specific questions: • What key themes have been addressed by this literature? • What is the balance of research attention across all the themes? • What are the analytical implications of the apparent balance between scholarly and policy/practice-orientations in this literature? • What aspects have not been addressed? • Given this state-of-the-art, what new research agenda might move PREA-related research forward?

Findings
We now present our analysis of the dataset of 354 pieces of literature. For each of the five themes, we provide a summary of key research arguments, brief critical commentary, and descriptive information using our further codes. 8

Theme 1: Accounts of local PREAs
Theme 1 grouped literature we determined to be primarily focused on providing descriptive accounts of local PREAs. Altogether we assigned 100 pieces of literature to this theme. A total of 77 pieces described PREAs at national level, for example, national evaluations like those in Australia, the Netherlands, and the UK. Within this theme, we also placed literature primarily describing arrangements at organizational (six pieces of literature), sub-national (five pieces), and trans-national levels (11 pieces). 9 National level PREAs were described for countries where these practices were already well established, like the UK (Barker 2007; see also Martin and Whitley 2010;Morris 2010) and Australia (Butler 2008;Donovan 2008). These arrangements were also described in other literature, to show them as apparent exemplars for development and implementation of new arrangements in countries or regions that had previously not used such practices (Fiala 2013;Ancaiani et al. 2015 Literature we grouped in Theme 1 had often been commissioned by national or international organizations responsible for evaluating research outputs, environments and engagements of higher education institutions or other research organizations. Nearly half the material in Theme 1 (46 pieces of literature) was policy reports describing national-level arrangements, then benchmarking them against each other to provide an international overview. These kinds of policy reports were commissioned and (presumably) funded by ministries of education in different countries, the OECD and say, the former HEFCE in the UK. We determined these bodies had funded these studies to enable policy learning about past experiences and/or arrangements used in other countries.
The bulk of Theme 1 literature we would call 'highly descriptive' (81 pieces of literature). We determined they used no explicit theoretical positions. A similar number used primarily qualitative and/or mixed methodologies (81 pieces). Ten pieces of Theme 1 literature had what we would consider more analytical approaches; 19 used quantitative methodologies, for example, Cattaneo, Meoli and Signori (2016) (see also Frølich 2008Frølich , 2011Frølich, Schmidt and Rosa 2010;Wang and Hicks 2012;Frankel, Goddard and Ransow 2014;Hamann 2016); and 42 of the 100 pieces collected primary data. The others based their descriptive accounts on secondary research and sources.
The descriptions of PREAs across Theme 1 literature addressed the following: descriptions of national-level arrangements (broad and fine details); evaluation strategies (apparent purposes, economic and social rationales); funding mechanisms (i.e. whether and how much evaluation results were linked to funding streams); assessment methods and inclusion/exclusion criteria of what was assessed; how often assessment took place; what units were assessed (research themes, research organizations, etc.); and evaluation outcomes (e.g. apparent levels of research-related performance of organizations, regions or nations, based on indicators such as publication volumes, citations, number of patents, and/or university-industry links). Theme 1 literature primarily used case study research designs and detailed the-sometimes considerable-costs associated with (repeated) use of research evaluation. Some provided cost-benefit analyses of existing evaluation exercises (e.g. see Campbell and Boxall 2004;PA Consulting Group 2008;Technopolis 2009Technopolis , 2010; see also Mahieu, Arnold andKolarz 2013, 2014;Arnold et al. 2014;Mahieu and Arnold 2015). We classed these pieces of literature as largely 'user-driven'. They seemed designed to answer research questions or address research interests of policymakers and evaluation practitioners. Turning a critical eye to Theme 1 literature, we found an absence of frameworks for theoretically or conceptually based study and analysis of PREAs. Theme 1 literature was primarily descriptiveboth for the material published in academic journals and 'grey literature', user-driven, policy reports. This potentially presents a problem and may not be an ideal basis to support robust policy learning. This literature in our critical opinion does not provide analysis and comprehension of social mechanisms around PREAs. However, it clearly does provide a source of rich empirical material and cases that could later be revisited for analytical purposes.

Theme 2: Comparative studies of PREAs
We assigned 40 pieces of literature into our clustering Theme 2. These were comparative studies of PREAs, comparing, for example, arrangements for specific sets of countries, or for particular research fields. Some undertook broad comparisons of institutional and other evaluation arrangements (Geuna and Martin 2003;Orr 2004;Hicks 2010;Arnold and Mahieu 2015; see also Frølich 2008; Geuna and Piolatto 2016; Sandströ m and Van den Besselaar 2018). Others compared selective research funding arrangements, effects for behaviours like research collaboration (Johnston 1994), actions of research funding agencies (Lepori et al. 2009), consequences of evaluation for university funding (Franzoni, Scellato and Stephan 2011; see also Sö rlin 2007), or PREA-related criteria for assessing research quality in different fields (Hug, Ochsner and Daniel 2013).
Literature here provided accounts of PREAs in multiple different settings and countries, but crucially with few attempts at analytical comparison. Hicks (2010), for instance, compared specific research evaluation objectives and strategies used by EU countries, Australia, South Africa, and some Asian countries-but did not compare wholescale the design, operation, and effects of these arrangements within a comprehensive framework. Rebora and Turri (2013; see also Geuna and Piolatto 2016) compared how research funding of universities evolved over time to incorporate selectivity and evaluation elements, specifically in the UK and Italy. Similarly, Geuna and Martin (2003) compared specific methods of evaluation used in 12 countries in Europe and the Asia-Pacific region.
Like Theme 1, the majority of Theme 2 literature we considered user-driven policy reports (26 pieces of literature or 65% of this theme was policy reports; 14 were academic publications, i.e. journal articles, a book, and a book chapter). Some Theme 2 literature also compared PREA-related practices across different countries to support policy learning (Iorwerth 2005;Grant 2010) or as guidance for policymakers wishing to implement and institutionalize PREAs in new settings (see e.g. Arnold and Mahieu 2015). Theme 2 literature was largely based on secondary research (in 29 pieces or 73% of Theme 2) and used qualitative or mixed research methods (88% of literature in Theme 2).

Theme 3: Discussions of rationales for (performance-based) research evaluation
Literature we clustered into Theme 3 primarily provided discussions of rationales for research evaluation, for example, the policy impetus and rationales for using performance-based evaluation criteria or how policy concerns and performance criteria like excellence and efficiency were interrelated. This Theme 3 was a very specific subset of the literature. It was our smallest cluster, at only 18 pieces. 10 Some analytical frameworks were present in Theme 3 but no common or shared framework was used across different literature here. 11 A first key argument in the Theme 3 literature was that the introduction of PREAs requires that one also consider value-for-money and issues of research quality. Here, Theme 3 literature suggested policymakers' rationales included values like promoting knowledgebased economies and strongly overlapped with efforts to use public research systems in different national settings to revive and/or restructure the orientation and/or performance of whole national economies (e.g. Rip  A second key argument was that PREAs are evolving in parallel with rationales asserting that more competitive allocation of research funding improves research performance, for example, as judged by measures like publication productivity, and other indicators of apparent 'excellence'. Theme 3 therefore seemed to include an emerging, critical research tradition moving close to addressing effects of competitive funding interventions as part of evolving PREAs. The interweaving of competitive funding and research evaluation was treated from research funders' perspectives, at national research system level and in some cases at the level of researchers (Benner and Sandströ m 2000; Smith, Ward and House 2011; see also Sørensen, Bloch and Young 2015).
Theme 3 literature suggested consideration of PREAs has to account for public research funding becoming more fine-grained over time. Previous research funding regimes generally treated most if not all aspects of the research system like a 'black box'. For instance, literature here described 'first generation' institutional research funding streams that did not address researchers, but simply took  Sørensen, Bloch and Young (2015) concluded that when 'excellence' was discussed in the context of PREAs it had now moved from being a marker of purely scientific performance to a broader basket of additional research performance-related criteria, for example, potential commercialization of research outputs, and indeed anything 'commercializable'. A third key argument in Theme 3 literature was a travel of global policy and economic competitiveness discourse into PREAs. The rise and diffusion of ideas (and ideology) around the global competition for knowledge, resource constraints, and resultant changing views of universities were chronicled, that is, a change from them being civic, public organizations to being more like corporations, and venues where performance must be audited. Theme 3 literature considered how conceptions of knowledge have shifted, and excellence has become a means within PREAs to reward 'winners' and punish 'losers'. This was described as a new 'strategic approach' to research policy and resource allocation through these arrangements, suggesting policymakers and governments have moved closer, in theory if not yet in practice, to selecting and affecting the types and topics of research, research content (methodologies, equipment), and even which specific researchers they believe can deliver 'excellence' within a particular research system (Benner and Sandströ m 2000;Sö rlin 2007; see also Hicks 2012;Watermeyer 2014Watermeyer , 2016. 12 An apparent merging was noted, of policymakers' search for 'excellence' and use of evaluation as a tool to measure research system effectiveness, with guiding and directing socio-economic investment decisions.
All bar one piece of literature in Theme 3 was published in academic journals. Theme 3 literature drew mainly on secondary data, used qualitative methods, and was the most analytical set, in our view. Literature here attempted to unpack varying, evolving rationales for PREAs, and to trace how they were now being seen as enablers of structural change, and as facilitating national systems that could compete more at an international level.

Theme 4: Appraisals of (performance-based) research evaluation methodologies
Nearly a third of all the pieces of literature in our database (103 pieces, 29% of the full dataset) addressed methods related to PREAs, for example, whether and which indicators were reliable measures or proxies to evaluate research performance, in terms of excellence and quality (Cozzens 1981;Donovan 2007;De Jong et al. 2011;Wunsch-Vincent 2012;Wilsdon et al. 2015; see also Aagaard 2015). These pieces we clustered in Theme 4. Literature here we judged as aiming to discover or design the 'best' methods for PREAs to assess subjective notions like research excellence and quality. Some favoured exclusive use of peer review or of bibliometrics. Others advocated mixed approaches say, combining peer review and bibliometrics techniques (Butler 2007 Theme 4 literature was very useful in highlighting two current dilemmas around design and deployment of differing PREAs. First, materials here considered which approach should be used, that is, predominantly qualitative or quantitative? Some literature addressed whether qualitative peer review was the most appropriate and/or cost-effective instrument to use or whether use of bibliometrics and other kinds of quantitative indicators was preferable. Other literature advocated use of blended or mixed approaches. Bertocchi et al. (2015), for instance, suggested research performance be evaluated using bibliometrics as an initial input for subsequent peer review. Still others proposed bibliometrics be used at national or local level to manage and/or monitor research performance within an evaluation, before feeding into later large-scale, peer review-based judgements, that is, so-called 'informed' peer review (see Neufeld and von Ins 2011;Wilsdon et al. 2015).
A second dilemma in Theme 4 literature was how current methodologies might be modified for use by policymakers and/or university managers to encourage, or at least not impede sustainable research activity in specific fields (e.g. in social sciences and humanities, SSH) or to foster research with particular properties (e.g. breakthrough, frontier, long-term). For instance, in SSH 'informed' peer review was advocated to assess better the performance of research fields where publishing journal articles represent only part of research outputs activities (e.g. in political science, where books and policy engagement also occur, Donovan 2009). Other literature suggested the same approach be part of PREAs in fields where peer review was dominated by reviewers representing only specific subfields (e.g. all denominations of economists being evaluated only by neoclassical/mainstream economists; Lee and Harley 1998;Lee, Pham and Gu 2013). Theme 4 literature advocated or designed new field-specific, more 'inclusive' quantitative indicators (e.g. social media-related 'altmetrics') to account for societal effects, broader or 'alternative' research outputs, interactions, exchanges, and outcomes (Bozeman, Dietz and Gaughan 2001;Kaufmann and Kasztler 2009;Kenna and Berche 2011;Ochsner, Hug and Daniel 2012;Kwok 2013;Sastry and Bekhradnia 2014).
Theme 4 literature predominantly featured material published in academic journals (83 of the pieces or 80% of Theme 4), relied on secondary data (80 pieces) and used quantitative methodologies (58 pieces). The predominant object of analysis was PREAs at national level (in 63 pieces of literature).

Theme 5: Studies of effects on the science system
Our final Theme 5 covered studies of effects on the science system from PREAs. Here, we clustered 93 pieces of literature, addressing effects at multiple spatial levels (regional, national, trans-national) and analytical levels (system, organization, researcher, research topics and content). Some literature instead took a cross-cutting view across these levels. Effects of PREAs on universities specifically were a dominant focus. Other literature combined this with attention to a general shift away from institutional/block funding towards proportionally more of competitive, project-based research funding allocation. Few pieces of literature addressed effects of PREAs upon additional parts of the science system beyond universities, say, effects for global research fields or aggregate effects at global level of multiple differing arrangements operating in parallel at national and/or regional levels. Some Theme 5 literature argued specific PREAs have generated effects at the 'macro' level of changing how science, universities, and scientists/researchers are perceived by society. The critical view was that strategic use by policymakers and university managers of particular arrangements-with perhaps disproportionate emphasis here upon the UK's Research Assessment Exercises (RAEs) and REFhad significantly changed organizational conditions for, and authority relations around knowledge creation (Himanen et al. 2009; see also De Jong et al. 2011;Kallerud et al. 2011;Whitley, Glaser and Laudel 2018).
At 'meso' level, literature observed that publicly funded research universities had become vulnerable to, and at risk of, being transformed by what certain exogenous stakeholders (e.g. politicians, policymakers, research funding agencies, corporate actors) considered 'best' for them. They were portrayed as losing autonomy, scholarly leadership, and ability to generate new and/or critical academic ideas. Universities and their researchers were framed as forced to abandon Mertonian notions of autonomy, disciplinarity, and freedom (c.f. Merton 1968) and expected to adopt values and quality standards shaped by outside demands (Frølich, Schmidt and Rosa 2010;Harland et al. 2010; see also Luukkonen 1997;Van der Meulen 1998;Ferlie, Musselin and Andresani 2008). Universities were diagnosed as no longer doing what they were 'best' at, and as complying with exogenous quality and excellence standards imposed by PREAs-or forced to suffer consequences of reduced research revenue and/or national and global reputation in local and world rankings/league tables (Knowles and Burrows 2014; see also Elton 2000;Luukkonen and Thomas 2016).
Other effects on universities included university management practices described as moving away from traditional 'academic' values (Linkova 2014;Agyemang and Broadbent 2015), changed university hiring, probation, and promotion strategies, allied to university strategic objectives and management practices becoming strongly coupled to criteria derived from evaluation-related goals and targets (see also Henkel 1999). Universities were also framed as embracing competition rather than resisting it and using PREAs at 'micro' level, to develop and deploy incentives, and ever more granular research information systems, monitoring and auditing mechanisms, to foster, reward, or sanction particular kinds of research productivity by research groups and at individual researcher level (Nedeva et al. 2012). Other reported 'meso' level effects were university management game-playing, particularly within 'strong' PREAs directly linked to resource allocation (Whitley, Glaser and Laudel 2018). Universities, their leaders, and managers were reported as developing and using deliberate strategies to incentivize and direct types of research, researchers, and external university-stakeholder relationships that painted them in the most favourable light within PREAs so as to maximize research funding capture (again, particularly relating to the UK's RAEs/REF). This behaviour reportedly has led to: undesirable concentration of resources by funders and universities to support short-term 'safe' rather than long-term risky research; allocation of resources to meet lay stakeholder/proxy indicators of excellence irrespective of knowledge community/substantive judgements about research quality; favouring competition over collaboration, thus risking fragmentation of academic/professional collegiality and reciprocity within and across universities; and direct or indirect promotion of 'salami slicing' publication practices to reward publication of a greater quantity of perhaps less comprehensive research works rather than focus on fewer but potentially more significant publications of 'higher' quality (Butler 2003;Leisyte and Westerheijden 2014; see also Abramo, D'Angelo and Di Costa 2011).
Further effects were reported to be: increased short-termism generally at universities; superficial attention to what in some quarters are seen as spurious markers of university reputation/excellence in national and global league tables for universities 'playing the game'; erosion of creativity; reduced diversity of the research topics, methods and approaches researchers' pursue; and strategy and management level distortions in resource allocations that undermine previous synergies between teaching and research (Whitley, Glaser and Laudel 2018; see also Paradeise and Thoenig 2015). Some authors even felt 'strong' PREAs (i.e. coupled to funding allocation) and audit cultures 'dehumanized' researchers and harmed traditional, more liberal, long-standing purposes and roles of universities in wider society (Hare 2003;Harland et al. 2010;Olssen 2016; see also Geuna and Martin 2003;Martin and Whitley 2010).
Some Theme 5 literature addressed effects at the 'micro' level of researchers and their research work processes: apparent loss of academic work-life balance and freedom; downgrading of teaching relative to research/publications; loss of intellectual curiosity; and a debasing of the general character of academic scholarship (Court 1999;Roberts 2007;Linkova 2014;Vincent 2015). Reported centralization of authority towards organizational elites like university managers, using expanding research data systems and information sourced from national/external and local/internal PREAs, were considered avenues of (negative) control over research content (Glä ser et al. 2010;see also Aagaard 2015). PREAs were also reported to increase administrative burdens for researchers and decrease research time and productivity (Martin 2016(Martin , 2011. Other Theme 5 literature indicated a fundamental transformation cutting across macro/meso/micro levels that had reportedly changed: university (research) culture; the nature, remit, processes, and practices of universities' objectives and goals; the relevance of university research; and research topic coverage and diversity. These effects were linked to changing university strategies to mobilize the outcomes of PREAs to improve positioning in university rankings (Martin 2011;Holmes 2015). Academia and knowledge were described as being reconceptualized as commodities, driven by economic efficiency and value-for-money concerns. A shift towards performativity was reported, with universities and academics assigned and/or adopting new purposes within these changing authority relations (Harland et al. 2010;Whitley 2011). These relations included policymakers, and university managers, administrators, and field elites in universities using their newfound authority to attempt to 'steer' science systems even at the expense of marginalizing input from academics and other voices. Some authors here sounded a 'wake-up call' for academics to resist supposedly harmful use of PREAs and fight to retain long-held values that give meaning to 'the academy' (Martin and Whitley 2010;Martin 2011Martin , 2016Waitere et al. 2011; see also Bence and Oppenheim 2005;Murphy and Sage 2014). Authors contended PREAs should prove their usefulness in improving research culture, financial sustainability, research capacity, and so on in universities-rather than that academics should bow and bend to fit better the parameters of these arrangements. Some authors here foreshadowed an 'end' to universities as places for reflection and creative thinking, extinguished by the utilitarian influence of PREAs-even those PREAs that advocate and incentivize seemingly more positive societal 'impact' from research (Knowles and Burrows 2014;see also Claeys-Kulik and Estermann 2015).
Other Theme 5 literature reported changes to the global communication system of science. Academic journal editors were reported as developing strategies to inflate their own journal rankings and citation counts to pander to use of PREAs and thus to become more attractive to authors (Gibson, Anderson and Tressler 2014). Journal editors were criticized for apparently seeking fewer path-breaking, critical research ideas and methods to publish (that reportedly accumulate citations more slowly), instead favouring more immediately citable, fashionable topics and approaches that can quickly inflate journal impact factors. Some Theme 5 literature described academic editors, publishers, reviewers, universities, government, and funding agencies as collectively adapting here to PREAs (Macdonald and Kam 2010;Watermeyer 2016).
We make two main critical points about this Theme 5 literature. First, little is known about causal relationships between PREAs and many if not all of these reported changes and apparent effects (see also Glä ser 2019). This holds true for micro-level changes in research topic selection and researchers' pursuit of research programmes/lines and for other levels (Waitere et al. 2011;De Rijcke et al. 2016;Hammarfelt and de Rijcke 2015;see also Laudel 2005;Whitley and Glä ser 2007). There are inherent methodological difficulties to measure and attribute PREA-related change here within and across heavily mediated, multi-level, multi-actor, regional, national, and trans-national research funding and policy 'spaces' and global 'research fields' (Nedeva et al. 2012;Whitley, Glaser and Laudel 2018).
Second, this literature may be biased by over-representation of both scholarly and more personal accounts/normative responses to the UK RAEs/REF. The UK's primary PREA is globally influential, but we must remember it is not necessarily 'best practice', has not travelled to many other regions of the world, and analytically the UK is an outlier or 'unique' (Sivertsen 2017). Reported effects there cannot be taken to be representative of effects of differing arrangements in other contexts (this criticism of course also ties in with the lack of comparative analytical frameworks across the literature state-of-theart). There are few attempts to distinguish analytically the RAEs/REF from other PREAs or to make theory-based assumptions and arguments to link causally particular arrangements to specific effects.
In overview, most Theme 5 literature was published in academic journals (82 pieces of literature or 88% of this theme). Many arguments were built on either primary (43 pieces) or secondary data (50 pieces) and used qualitative approaches (in 61 pieces of literature). We considered most Theme 5 literature to be predominantly analytical in approach (54 pieces).

Cross-cutting issues
Looking across all five clustering themes most literature seemed to share the view that, whatever the specific arrangements, PREAs are 'here to stay' (e.g. Martin and Whitley 2010; League of European Research Universities (LERU) 2012). There was resigned acceptance that although PREAs remain contentious, and evidence about their operation is uneven, they nevertheless are considered useful for multiple purposes. They enable governments to map, prioritize, and capitalize (better) upon research and researcher capacity within a science system. They are an accepted means to allocate research funding and infrastructure resources based upon such maps, prioritizations, and investment plans and strategies (e.g. Strehl, Reisinger and Kalatschan 2007;European Commission 2009;Hicks 2010;Olson and Rapporteurs 2011;Organisation for Economic Co-operation and Development (OECD) 2011;Cunningham, Salavetz and Tuytens 2012;Mahieu, Arnold and Kolarz 2013;Higher Education Funding Council for England (HEFCE) 2014;Arocena, Gö ransson and Sutz 2018).
Literature often neither sought nor found standardization or 'best practice' of PREAs. There remain open questions, and unresolved debates, for example, how to improve design and deployment of PREA-related strategies, research funding mechanisms, performance assessment methods, key criteria, how often to conduct evaluation, whether to evaluate academic and/or non-academic research, whether to distinguish between researchers and research environments, and how to determine the most appropriate unit(s) and subject(s) of assessment (e.g. Wooding and Grant 2003;Organisation for Economic Co-operation andDevelopment (OECD) 2009, 2010a,b;Ministry of Education 2012;Reale et al. 2018; see also Sivertsen 2017;Regan and Henchion 2019).
Despite this agnosticism regarding 'best' arrangements, there were fears of isomorphism-particularly of widespread diffusion of the UK's RAE/REF arrangements, either in entirety or specific elements, like arrangements to evaluate research 'impact'. Patterns of exploration, testing, and learning by various stakeholders (e.g. research funders, policymakers) were seen as enabling such adoption, translation, travel, and/or transplantation of PREAs from one country, region, or university context to another. Similarly, pathways were observed for 'trickle down' of national arrangements into bespoke-and sometimes highly contentious-local arrangements inside particular universities and other public research organizations

Limitations of literature on research evaluation arrangements?
Our analysis suggests five limitations across this set of PREA-related literature. First, there are many user-driven, policymaker/fundercommissioned reports and primarily descriptive approaches. A total of 28% of our literature set was explicitly policy/practice-oriented (i.e. policy report format) and 48% provided primarily thick descriptions of specific PREAs. Such literature is useful. However, user-oriented, thick descriptions alone seem insufficient to allow more critical perspectives and predictions regarding say, effects of arrangements and/or reactions (strategies, behaviours) of different organizational actors subjected to them (e.g. research funding agencies, universities, localized and more global knowledge communities). Similarly, descriptive accounts, even when oriented towards policy learning, may in fact hinder it because of a lack of analytical comparative foundations (and make it difficult to achieve 'mutual learning' across PREAs, as recommended by Sivertsen 2017). Descriptions of PREAs may make them appear comparable, transferable, or generalizable. Such comparisons are, however, often superficial. Lacking critical understanding of the use of whole or partial arrangements could lead to wide-ranging unintended and unexpected effects.
A second limitation is the pervasive, methodologically intractable unknowns in the literature concerning whether PREAs do produce, promote, or hinder research with specific performance-related properties (e.g. excellence, novelty, breakthrough, long-term focus, societal relevance, or impact). This is linked to a third limitation; the literature is inconclusive in answering whether-particularly after seeming early gains in using certain PREAs in specific countriesthere are now increasing or diminishing returns for policymakers and universities to develop and deploy seemingly ever more expensive, extensive, and potentially intrusive arrangements.
A fourth limitation is that research on effects of PREAs has primarily focused on (self-)reported changes in universities. Reported effects-let alone causally attributable changes-to structures and organizations of national, trans-national, and trans-organizational research fields (knowledge communities, knowledge properties) have received much less attention. Most research has focused upon micro-level changes to research topics or topic portfolios pursued by researchers in specific universities, fields, and/or national systems.
It is clear that design and deployment of PREAs does not take place in a vacuum. PREAs are parts of and are strongly 'coupled' to a wider universe of path-dependent, dynamic activities, and exercise of power, authority, resources, politics, and policy machinery (Whitley 2016). And yet a fifth limitation here is the absence of comparative frameworks to account for these aspects across the many and various development and use contexts of PREAs.

A novel research agenda on PREAs?
We believe four elements for a novel research agenda on PREAs emerge from our critical review and analysis of the state-of-the-art. First, very few, if any, analytical frameworks exist to study and compare research evaluation arrangements. There are examples of comparative frameworks (Geuna and Martin 2003) but most, with a possible exception in Whitley, Glaser and Laudel (2018) use descriptive not analytical characteristics. This reduces analytical capacity and availability of heuristics and theory to explain the many interacting mechanisms present in and across micro, meso, and macro levels of the global science system. A novel research agenda could therefore first include development and testing of comparative analytical frameworks.
Literature on rationales for PREAs has predominantly dealt with efficiency concerns. They determine-we argue-whether arrangements have achieved what they set out to achieve. Studying efficiency of research evaluation as a policy instrument is a worthy pursuit. However, there are practical and analytical limitations inherent in this delineation of a research agenda. A second element of a novel research agenda would be to incorporate effectiveness concerns, that is, are the 'right' things being done in the science system? This should trace beyond localized conditions for research (e.g. at universities) to incorporate treatment of potential changes in the structure of global 'research fields' (c.f. Nedeva 2013).
Literature studying effects of PREAs on the science system has also largely focused on ad hoc associations between effects and measures. The arrangement under discussion is commonly taken to be a universal or singular enabler of the observed effects. A third element of a novel research agenda could be to attempt to add causal attribution to verify such assumptions.
Finally, we see from the literature that PREAs typically target research organizations in national policy and funding spaces. Correspondingly studies seemingly rarely study effects beyond those for universities in their own local context. 13 A fourth and final element of a novel research agenda would seem to be to include effects on the structure of global knowledge communities and bodies of knowledge. A summary of these five limitations and four novel agenda elements is provided in Table 3.

Conclusion
Our critical review and purposive analysis of 354 pieces of literature we feel addresses the state-of-the-art on PREAs. It spanned works published from 1968 to 2018 and encompassed both scholarly and policy/practice-related research orientations. We believe our analysis satisfied our research aims, that is, to enable us to highlight key arguments, analyse limitations, and to suggest how to progress the research agenda in this area.
From our review we can conclude, first, analytical comparative frameworks are needed to study PREAs. Second, not only efficiency but also effectiveness concerns should be considered for PREAs. Third, studies should be devised and conducted on science systemlevel effects of PREAs and how global research fields are affected rather than just particular studies of local settings. Fourth, methodologies need to be advanced to measure and attribute these effects of PREAs on the (global) science system.
All four elements of this novel research agenda seem both necessary and challenging. There are numerous levels of mediation of effects and inherent complexities to unpack layer upon layer of research-related conditions here. We limited our article's aims to (re-)opening the research agenda on PREAs and their effects on the science system by means of a critical, purposive, inductive examination of PREA-related research themes, and identification of agenda gaps. Developing analytical frameworks for PREAs, perhaps even outlining 'ideal' types of PREAs, and stretching studies of effects to include research fields appear essential. Similarly, learning to cope better with effectiveness, measurement and attribution issues seem necessary next steps, to take studies of PREAs further, to the benefit of both academic and practitioner interests.
Notes 1. We provide our full definition of 'performance-based research evaluation arrangements' in the following section, and distinguish it from research evaluation 'systems'. 2. We focused our attention on journals publishing on topics of higher education studies, higher education policy, higher education management, sociology of science and science and technology policy studies, as well as fields like health policy and studies where research evaluation is addressed as a side issue in larger discussions (e.g. on priority setting We did not try to access private, commercially sensitive, or confidential evaluations of specific research performers or funders. The entire set of academic and grey literature is heterogeneous, even though we confined our search to publicly available, English-language materials. This is likely due to significant involvement of funders in sponsoring research say, to audit their resource allocation processes and evaluate the outcomes of their funded research. 4. Our full country coverage includes Australia, Austria, Belgium/Flanders, Brazil, Bulgaria, Canada, China, Czech Republic, Denmark, Estonia, Finland, Germany, Hong Kong, Hungary, India, Ireland, Italy, Japan, Korea, Latvia, Lithuania, Mexico, Morocco, Netherlands, New Zealand, Norway, Poland, Romania, Slovak Republic, Slovenia, South Africa, Spain, Sweden, Switzerland, UK, USA, and Uruguay. 5. Correspondingly we cite some literature in multiple theme sections of our later findings, when their secondary message(s) are relevant, denoted by 'see also' in our citations. Our choice to allocate by primary theme rather than cluster in multiple themes by coverage of all issues is of course contentious. However, we believe this subjective approach provides a more useful thematic clustering for our purposes than exhaustively cataloguing by primary, secondary, tertiary, etc. themes. 6. We considered coding our subjective judgement of the apparent quality of literature. We decided against this step, in case it influenced our later analysis. 7. For ease of reference, we also included in our database columns for author name(s), title of the work, and publication year.
8. We provide numbers indicatively to show how much literature clustered into each theme, and the surveyed balance of approaches and content (e.g. scholarly, policy/practiceorientations). Our numbers and percentages do not constitute general impressions about the broader universe of evaluationrelated research that exists outside our specific analytical boundaries for literature on PREAs. 9. One piece of literature we coded 'other'; it was more abstract in its descriptive approach. 10. We considered 'rationales' for research evaluation to be within our analytical remit because they were present in the PREA-related literature. Our inductive clustering of themes reflects that these issues were being discussed in material within the scope of our PREA definition. 11. Of all the clustering themes, literature in Theme 3 was the most what we would call 'synthetic', in that primary messages often combined aspects of one or more of our analytical themes. 12. This 'strategic approach' concerns selectivity and concentration of research resources to research areas, researchers and teams, and universities displaying characteristics associated with excellence: share of highly cited publications, citations/ impact, external grants capture, industry links, and patents. New tools and data to measure this notion of excellence have been associated with pressures on research systems to adapt to dominant ideas around value for money, steering and control, accountability, and measurement (Butler 2003; see also Debackere and Glanzel 2003;Geuna and Martin 2003;Linkova 2014). 13. Other effects considered do include researcher careers, but predominantly just 'organizational' careers, still constraining analysis within the policy/funding/university 'space' and not on to 'cognitive' or 'knowledge community' careers in research fields (c.f. terminology from Laudel 2017).