Evaluating health systems strengthening interventions in low-income and middle-income countries: are we asking the right questions?

the characteristics of complex adaptive systems such as non-linearity of effects or interactions between the HS building blocks. While we do not argue that all evaluations should be comprehensive, there is a need for more comprehensive evaluations of the wider range of the intervention’s effects, when appropriate. Our findings suggest that the full range of barriers to more comprehensive evaluations need to be examined and, where appropriate, addressed. Possible barriers may include limited capacity, lack of funding, inadequate time frames, lack of demand from both researchers and research funders, or difficulties in undertaking this type of evaluation.


Introduction
It is now well accepted that strong health systems are paramount to achieve health systems goals (Evans et al. 2008). Consequently, several new interventions or initiatives have been launched, at global and national levels, to address some of the bottlenecks to scale up essential health interventions and to strengthen some components of the health system (van Etten et al. 2006). At the global level, the GAVI Alliance, the Global Fund for AIDS, Tuberculosis and Malaria, and other major funders have been explicitly encouraging the inclusion of health systems strengthening interventions in grant applications in recent years, and several international initiatives dedicated to strengthening health systems have been established, e.g. the Implementation Research Platform hosted by the Alliance for Health Policy and Systems Research, the International Health Partnership (IHPþ) and the High-level Taskforce on Innovative Financing for Health Systems (Bennett et al. 2008; Alliance for Health Policy and Systems Research 2011).
At the same time, there has been increasing recognition of the need for more rigour in designing and evaluating the effects of global health initiatives and interventions that aim, at least in part, to strengthen health systems in order to improve the population's health (Evans et al. 2008;Mills et al. 2008;de Savigny and Adam 2009;Swanson et al. 2009).
Acknowledging this need, several recent publications have sought to define health systems and their boundaries (World Health Organization 2007), what is meant by health systems strengthening (World Health Organization 2007;Swanson et al. 2010), the nature of health systems research Bennett et al. 2011;Mills 2012), and the importance of systems thinking in designing, implementing and evaluating health systems strengthening interventions (Leischow et al. 2008;Shiell et al. 2008;de Savigny and Adam 2009).
In a rapidly evolving field and amid continuing calls for more rigorous health policy and health systems research (World Health Organization 2005;Mills et al. 2008;Swanson et al. 2009;Bennett et al. 2011), what has been the practice in the field of health system evaluations in recent years and how well do such evaluations address contemporary issues and recommendations? More specifically, our recent publication on systems thinking and its value for the design, implementation and evaluation of health systems strengthening interventions argued for the need for a more comprehensive and systematic approach in thinking through the underlying causes of health systems problems, and in designing new interventions and their evaluations (de Savigny and Adam 2009). In our background review of the peer-reviewed and grey literature up to 2008, we found a very limited number of evaluations assessing the wider impact of complex health interventions on health systems. When they did, the evaluations were conducted in high-income countries and on health-outcome oriented interventions such as tobacco control, obesity or cancer (Best et al. 2007;Butland et al. 2007). This paper seeks to understand what has happened since then. It does so through a review of the peer-reviewed and grey literature to map out current practices in evaluating health systems strengthening interventions in low-and middle-income countries (LMICs) and to assess the extent to which health systems effects have been explored.
For the purpose of this review, health systems strengthening (HSS) interventions include those system-level interventions that are directly targeting one or more of the six health system building blocks and their sub-components as defined by the World Health Organization (WHO) (World Health Organization 2007); or disease-specific interventions or programmes that also have important system-wide effects, e.g. scale-up of antiretroviral therapy for HIV/AIDS (de Savigny and Adam 2009). This approach implicitly reflects the relationship between the different health system components as well as the interests and power of its different actors and beneficiaries (both supply and demand side).
The overall objectives of this paper are to assess the scope and research questions explored in recent HSS evaluations. More specifically, to assess whether the research questions attempted to explore the intervention's effects across multiple health system building blocks and actors, and if they did, to what extent, and with what methodological approaches.
The objective is not to appraise the quality of evidence, e.g. whether the evaluation used appropriate study design or methods. It is rather to assess whether they ask a broader set of questions relevant for policy making. For example, whether or not the intervention worked as intended; and what are the elements that contributed to its success or failure, which could influence the replication of its impact in other settings. If the intervention has broader implications across the health system, to assess what these are with respect to both intended and unintended effects Trochim et al. 2006;de Savigny and Adam 2009;Paina and Peters 2011).
Without such deep analysis of the process and context around which the intervention worked, and, as relevant, its broader effects on the health system as a whole, evaluations may overor under-estimate the actual impact of the intervention, or overlook important effects on the system itself or other interventions already in place (Rychetnik et al. 2002;Savedoff et al. 2006).

Literature search
We conducted a systematic search in Medline and Embase as well as the individual websites of 36 institutions including research funders, think tanks, academic and research institutions, and partnerships and alliances identified through a web-based search or known to conduct and publish evaluations of public health interventions.
The search strategy builds on previous literature reviews with similar objectives (Bennett et al. 2008;Lewin et al. 2008;Ritz et al. 2010;Adam et al. 2011), with further refinements and iterative testing of individual search terms (see Supplementary Data Web Annex 1). Articles were included if they met the following criteria: (a) Evaluation studies defined as studies that report on the output, outcome or impact of an intervention on the health system or studies assessing if and how a programme worked. (b) Health system strengthening intervention defined as 'system-level' interventions directly targeting one or more of the six health system building blocks and their sub-components (see Table 4); or disease-specific interventions expected to have large system-wide effects (de Savigny and Adam 2009). (c) Low-and middle-income countries based on the World Bank classification (World Bank 2011).
For the second criteria above, we sought to assess the general relevance of interventions and to exclude those that would only weakly impact the performance of health systems, or the values and interests of its actors or beneficiaries. Our common interpretation considered this to involve: Interventions with system-level changes as opposed to changes at the organizational level (e.g. interventions involving changes to patient access to care have system-level repercussions and were included, but not those focusing on modifications of patient flow within a health facility, which is unlikely to have system-wide impact). A need for a systems approach or complex interventions that require identifying and evaluating interactions between health system building blocks/sub-systems (e.g. a systems approach for evaluating training aimed at improving the quality of care provided, but not for evaluating the relevance of the training material). Evaluations of the cost-effectiveness of system-level interventions, but not simple costing analyses (e.g. costeffectiveness analyses of task-shifting, but not a costing study of malaria case management).
Given our prior publication on systems thinking and evaluation (de Savigny and Adam 2009), the search focused on articles published in 2009-10 in order to assess the most recent practice in the field since the prior publication, and to get a good picture of the most recent evaluations. There was no language restriction.

Literature screening
Two independent raters (JH and TA) screened the first 100 articles against the inclusion criteria to determine inter-rater agreement. Any disagreement in article selection was discussed until consensus was reached. We then calculated the level of inter-rater agreement using a simple Kappa analysis (Cohen 1960); at least substantial agreement (i.e. kappa exceeding 0.6) was desired for a decision to continue with a single rater (Landis and Koch 1977). The calculated kappa score was 0.81, classifying the level of agreement to 'almost perfect' (Landis and Koch 1977). We therefore continued the screening for article selection with one rater (JH).

Data abstraction and analysis
We retrieved the full text of all articles that met the inclusion criteria. Data abstraction included the following variables: country where the evaluation was conducted; type of intervention, i.e. system-level or disease-specific (see definition above); name and brief description of the intervention; primary health system building blocks targeted by system-level interventions, as defined in the studies (see Table 4 and Supplementary Data Web Annex 2 for a description); whether the scope of the evaluation was narrowly or broadly defined; whether a conceptual framework was described; whether process and context evaluations were conducted; and finally the types of impact assessed (see Table 1).
Process evaluation is defined as evaluations which examine the extent to which the intervention was implemented as intended, including the distribution and coverage of its input components, such as availability of medicines, training of health workers, quality of care, as well as the acceptability of the intervention to the parties involved (Hawe et al. 2004;Oakley et al. 2006). It therefore helps to determine the internal validity of the evaluation, i.e. whether the intervention was adequately implemented and therefore the observed effects can be attributed to the intervention. In case of failure of an intervention, it helps to explain if the failure is due to an inherent problem with the intervention itself, i.e. the theory behind how it should work, or insufficient or inadequate 'dose' of implementation (Rychetnik et al. 2002;Schellenberg et al. 2004). Simply describing the intervention and the implementation process was therefore not considered a process evaluation, but rather information on the process (Rychetnik et al. 2002). This differentiation between process evaluation and information on the process was captured by two separate variables (see Table 1).
Context evaluation is defined as systematic documentation of naturally occurring events in the settings where the intervention was evaluated that might influence either positively or negatively the uptake of the intervention or the level of its impact. They are normally conducted throughout the evaluation period, or before and after, and are usually collected through EVALUATING HSS INTERVENTIONS key informant interviews, or logs of relevant events, or interventions likely to affect the impact of the intervention in question (Hawe et al. 2004;Schellenberg et al. 2004). If some information on the context was provided but not in a systematic pre-conceptualized manner, this information was captured separately (see Table 1).
For those evaluations considering the intervention's impact across three or more building blocks, a deeper assessment of the nature of these evaluations was conducted, including choice of study design, methodological approaches and what impact was assessed with what measures. It also included whether they took into account any of the characteristics of complex adaptive systems (CAS) in their research design or methods, such as non-linearity of effects, time delays or feedback between the different health system components (de Savigny and Adam 2009; Paina and Peters 2011). This was done by reading the methods section of these evaluations and screening them for any mentioning of CAS or approaches to account for them, as described in de Savigny and Adam (2009) and Paina and Peters (2011).
Data coding was done separately by the two raters and any discrepancies were discussed until consensus was achieved.
A database of abstracted data was developed in Excel. Cross tabulation and frequencies were performed as well as an in-depth assessment of the nature of evaluations considering systems wide-effects, as described above.

Study selection
The search in Medline and Embase resulted in a total of 2212 unique articles after removal of duplicates between the two databases, which accounted for 13%. Almost 60% of the articles were retrieved from Embase. Titles and abstracts were screened against the inclusion criteria and 91 articles were kept for data abstraction. Full text could not be located for 6 of those articles, leaving 85 articles for further analyses. The grey literature search resulted in 21 articles that met our inclusion criteria, retrieved from 7 out of 36 institutional websites ( Table 2).
The majority of exclusions concerned studies that did not evaluate the output, outcome or impact of an intervention, but were situational analyses or cross-sectional surveys. Also If the evaluation objective was not explicitly restricted, e.g. in the title or abstract, to a confined question, e.g. effect of the intervention on equity, or waiting time.
1 Health outcomes only? (yes or no) If the evaluation was limited to assessing the impact of the intervention on individuals' or populations' health (e.g. morbidity or mortality), and did not look at the impact on any of the health system (HS) building blocks.
2 HS effects on one targeted building block? (yes or no) If the evaluation explored the impact on one HS building block targeted by the intervention. Health outcomes may or may not have been assessed.

HS effects on two targeted building blocks? (yes or no)
If the evaluation explored the impact on two HS building blocks targeted by the intervention. Health outcomes may or may not have been assessed.
4 HS impact across three or more building blocks, whether or not targeted by the intervention? (yes or no) If the evaluation explored the possible impact on three or more HS building blocks, whether or not they were targeted by the intervention. Health outcomes may or may not have been assessed.

HS effects on other building blocks not targeted by the intervention? (yes or no)
This variable only looks at whether the evaluation explored the impact of the intervention on other HS building blocks not targeted by the intervention, regardless of which other effects from the above were also assessed.

HS effects on other sectors? (yes or no)
This variable only looks at whether the evaluation explored the impact of the intervention outside the health sector, regardless of which other effects from the above were also assessed.
7 Complex adaptive systems characteristics? (yes or no) This variable looks at whether the evaluation attempted to capture effects linked to any of the characteristics of complex adaptive systems such as non-linearity of intervention effects or interaction between HS building blocks.
excluded were evaluations whose research objective was not concerned with the intervention's impact on the health system but on a clinical (was the treatment effective), technical (was a costing or monitoring tool effective) or operational aspect (was the training material applicable). Eighty per cent of the evaluations were in low-income or lower-middle-income countries, with almost half of the evaluations conducted in sub-Saharan Africa (48%) followed by East Asia and the Pacific (19%) and Latin America and the Caribbean (11%) (see Figure 1).

Nature of the interventions
Out of the 106 evaluations included in this analysis, 91 were system-level interventions, targeting one or more of the health system building blocks; the remaining 15 were evaluations of the large scale-up of disease-specific interventions. Table 3 shows the types of interventions assessed by these evaluations and the most frequent examples within each type. Interventions centred around 11 major groups, with financing interventions being the most frequent, followed by models of service delivery, human resource strategies and scaling up of a health programme. HIV/AIDS was the most common disease explored followed by malaria. In general, even when interventions were classified as system-level, the entry point was often a disease rather than strengthening of a particular aspect of the system across various health services. It is worth noting that interventions with the same name and overall objective varied substantially in the way they were defined (by the studies) with respect to their degree of complexity. For example, task shifting, voucher schemes and pay-forperformance involved 1-3, 3-5 and 1-3 building blocks, respectively (data not shown). The majority of interventions targeted the supply side; only a few focused on the demand side, e.g. using voucher schemes. Table 4 shows the health system building blocks targeted by system-level interventions. So for example, 60 studies addressed service delivery, of which 20 focused on service delivery and one other building block. Of these 20 studies, 16 examined access, availability, timeliness, responsiveness or satisfaction; three evaluated public/private partnerships in service provision and four examined quality and safety of care. Most interventions were complex, targeting two or more building blocks. All building blocks were involved to a varying extent, although the most common intervention components were around service delivery, financing, health workforce and governance issues around service delivery. Information systems were the building block least targeted by system-level interventions.

Nature of the evaluations
With respect to the nature of the evaluations, 43% have chosen broadly defined questions allowing for an assessment of a wider range of the intervention's effects (Table 5). Only half of the evaluations presented or referred to a conceptual framework, often linked to multivariate regression analyses of the intervention's impact on specific outcomes. The other half either listed a small set of questions or hypotheses that they aimed to answer or went directly to describe their data sources and findings without a prior description of which outcomes they chose to explore and why.
Around 60% of the evaluations provided information on the process of implementing the intervention and 20% provided contextual information to be able to situate the intervention and the observed effects within the context in which it was being implemented. With respect to process and context evaluations, 24% and 9% have included or referred to these components in their evaluations, respectively. Most of these evaluations assessed the scaling of HIV/AIDS services or the impact of global health initiatives on health systems and most were obtained from grey literature. Despite the high proportion of studies that involved complex interventions (i.e. interventions which addressed multiple building blocks), the nature of the evaluations and the type of impact assessed did not reflect that complexity. Six evaluations looked only at health outcomes, e.g. mortality rates or treatment outcomes, while they evaluated interventions addressing up to five health system building blocks. More than half assessed the intervention's effects on one building block only, while the interventions involved were mostly complex with components covering two or more building blocks. Only seven explored the intervention's impact across three or more building blocks (Table 5).
Of the 19 evaluations that explored the intervention's impact on other building blocks, all except one looked at one other building block, most often service delivery. Only one study looked at the impact outside the health sector, in the form of household behaviour related to child labour and schooling, and employment of adults. It also looked at health outcomes but did not look at any of the building blocks targeted by the intervention (Rocha and Soares 2010). Finally, none of the evaluations explored system effects that reflect the complex adaptive nature of health systems.
In-depth assessment of evaluations that explored impact on three or more health system building blocks We now turn our attention to the seven evaluations that explored the intervention's impact across multiple building blocks to explore the full range of system-wide effects they considered. Table 6 describes their main characteristics and methodological approaches. Six evaluations used mixed methods and one used only quantitative methods. In most cases, a limited set of commonly used effects measures was used. Plausibility designs with historical controls were the most frequent design choices (Victora et al. 2004). Among these evaluations, Loevinsohn et al. However, in most cases, the range of effects explored was limited, perhaps linked to the fact that conceptual frameworks were not always elaborate or comprehensive and often only limited to hypothesis testing (Loevinsohn et al. 2009;Witter et al. 2010). Interestingly the two most comprehensive evaluations, in our view, involved evaluations of interventions using participatory approaches in designing, monitoring and continuously improving the intervention, using data-driven and

Journals publishing peer-reviewed studies
Finally, we also analysed the nature of the peer-review journals that published the evaluations included in our study. Overall, journals that accepted evaluations looking at a wider range of impact have also accepted those with a narrowly defined focus. However, evaluations that did not explore, or explored a limited set of, the intervention's impact on the health system, were mostly published in medical or specialized journals, while most evaluations that explored impact across multiple building blocks were published in journals focusing on health policy, public health or social sciences.

Discussion
In this paper, we reviewed recent evaluations of health systems strengthening interventions in LMICs to assess whether they explored the intervention's effects across multiple health system building blocks, and if they did, to what extent, and with what methodological approaches. Most of the evaluated interventions were complex, with 75% of them involving two or more health system building blocks. However, less than half of the evaluations asked a broad set of research questions to allow for a wider assessment of the intervention's impact on the health system. Only half presented or referred to a conceptual framework to guide the assessment of the intervention's impact. Less than a quarter included process evaluation and 9% included context evaluations.
Among those who conducted process evaluation, most have used classic indicators of the intervention's coverage or implementation rates, e.g. number or percentage of health workers trained, education sessions held, medicines kits distributed, etc. As Hawe et al. (2002) argued, while it is logical to measure if what was promised actually happened, a more prudent approach is to also think through and examine the intervention's causal assumptions that may have led to the measured degree of implementation and impact, which may or may not These do not include the 15 evaluations of large scale up of disease-specific intervention as the primary focus of the intervention is not a health system or sub-system building block.
match the hypothesized theories that guided the intervention's design (Hawe et al. 2004).
Assessing the degree to which context evaluation has been adequately performed was much harder to undertake. This may be partly a reflection of the lack of guidance on how to take the impact of context into account and on how to report it, which made several studies stop at listing what else is happening, without attempting to evaluate their likely impact on affecting the course of the intervention and its applicability to other contexts (Rychetnik et al. 2002). Some studies only mentioned contextual factors in the discussion section to explain ex-post why the intervention did not work as intended or why the results were not as expected (Arifeen et al. 2009).
With respect to health systems impact, half of the evaluations assessed the intervention's impact on one targeted building block. Only seven explored the impact on three or more building blocks. One evaluation assessed the intervention's impact on other sectors. None explored the relationship and interconnectedness between the different building blocks or other characteristics of complex systems such as non-linearity of effects or time delays (Shiell et al. 2008).
Interestingly the two most comprehensive evaluations, in our view, involved evaluations of interventions using participatory approaches in designing, monitoring and continuously improving the intervention (Doherty et al. 2009;Youngleson et al. 2010). This may be something inherent to participatory evaluations that led to a more comprehensive 'system-wide' approach to assessing the intervention's impact. For example, involving stakeholders early on in the design process and engaging them in assessing and solving implementation barriers is at the heart of systems thinking, where the intervention's effects, anticipated or not, can be explored, discussed and considered in designing and evaluating health interventions (de Savigny and Adam 2009).
Our findings are consistent with other similar studies. For example, Paina and Peters (2011) did not identify any examples where models of scaling up health interventions have been examined through the lens of complex adaptive systems (Paina and Peters 2011). Our findings are also consistent with our previous analysis, which could only identify few examples of comprehensive evaluations that considered the complexity and dynamic nature of health systems, all of which targeted specific diseases or conditions, e.g. tobacco control or obesity (de Savigny and Adam 2009).
We do not argue that all evaluations should be comprehensive. Indeed, in our report on systems thinking and its role in evaluations, we argued that not all interventions require a systems thinking approach. However, we argued that interventions can be seen as a continuum, where the more complex the interventions are, the more the need for systems thinking and comprehensive assessment of system-wide effects (de Savigny and Adam 2009).   Table 1 for a definition of the variables. b The remaining two studies explored the intervention's impact on one other building block but not that targeted by the intervention. They are included in the variable on health system impact on other building blocks.
iv16 HEALTH POLICY AND PLANNING This study highlights the need to understand the possible barriers to more comprehensive evaluations, when they are appropriate. A recent study eliciting the views of a wide range of stakeholders in the Eastern Mediterranean Region identified a range of barriers to more comprehensive evaluations. They included lack of technical capacity to undertake such evaluations; limited awareness and appreciation of the value of adopting a more comprehensive, systems thinking approach, in designing and evaluating health systems interventions; as well as a perceived notion of their costliness, combined with limited support from, and investments by, research funders. Respondents also highlighted the importance of generating awareness among policy makers to provide the necessary support and demand for such comprehensive evaluations (El-Jardali et al. forthcoming).
Our analysis has a number of limitations. First, it only includes evaluations available from two literature databases, Medline and Embase, and a limited number of web-based grey literature. Second, the analysis only focused on evaluations published in 2009-10. However, our aim was not to take stock of all evaluations undertaken on this topic, rather to have a general understanding of how the field of evaluations has been developing, particularly in response to recent calls for more rigorous and comprehensive assessment of efforts to strengthen health systems, including the application of systems thinking concepts and tools in conceptualizing and evaluating health interventions.

Conclusion
Very few evaluations attempted to conceptualize the possible effects of interventions on multiple health system building blocks. While we do not argue that all interventions require a comprehensive evaluation of the system-wide impact, we argue for the need for more evaluations that explore the wider range of impact on the health system as a whole, and even beyond the health sector, as appropriate.
There are several untapped resources that could make significant contribution to this field, including consideration of the underlying concepts of complex adaptive systems; systems thinking concepts, tools and approaches; as well as adopting and learning from social sciences and policy analysis perspectives, both involving complex social and political phenomena constructed and influenced by human action, all very relevant to health systems and the field of evaluations (de Savigny and Adam 2009;Gilson et al. 2011;Paina and Peters 2011).
Finally, this study highlights the need to strike a balance between identifying easy-to-answer research questions, vs asking more difficult but important research questions. The latter would require adopting a more problem-solving attitude to research and being more flexible and innovative in employing research strategies that are deemed appropriate for the research questions (Paina and Peters 2011).