Comparing the use of direct observation, standardized patients and exit interviews in low- and middle-income countries: a systematic review of methods of assessing quality of primary care

Abstract Clinical records in primary healthcare settings in low- and middle-income countries (LMIC) are often lacking or of too poor quality to accurately assess what happens during the patient consultation. We examined the most common methods for assessing healthcare workers’ clinical behaviour: direct observation, standardized patients and patient/healthcare worker exit interview. The comparative feasibility, acceptability, reliability, validity and practicalities of using these methods in this setting are unclear. We systematically review and synthesize the evidence to compare and contrast the advantages and disadvantages of each method. We include studies in LMICs where methods have been directly compared and systematic and narrative reviews of each method. We searched several electronic databases and focused on real-life (not educational) primary healthcare encounters. The most recent update to the search for direct comparison studies was November 2019. We updated the search for systematic and narrative reviews on the standardized patient method in March 2020 and expanded it to all methods. Search strategies combined indexed terms and keywords. We searched reference lists of eligible articles and sourced additional references from relevant review articles. Titles and abstracts were independently screened by two reviewers and discrepancies resolved through discussion. Data were iteratively coded according to pre-defined categories and synthesized. We included 13 direct comparison studies and eight systematic and narrative reviews. We found that no method was clearly superior to the others—each has pros and cons and may assess different aspects of quality of care provision by healthcare workers. All methods require careful preparation, though the exact domain of quality assessed and ethics and selection and training of personnel are nuanced and the methods were subject to different biases. The differential strengths suggest that individual methods should be used strategically based on the research question or in combination for comprehensive global assessments of quality.


Background
Improving healthcare quality is a major global public health challenge particularly in low-and middle-income countries (LMICs) (United Nations, 2015) and a recent report argues that quality of care has overtaken access to healthcare as the largest problem facing health systems in LMICs (Kruk et al., 2018). High-quality healthcare is an essential pillar of Universal Health Coverage and target of the United Nations' (UN) Sustainable Development Goal (SDG) 3 (United Nations, 2015). Most care is delivered in primary care and a large proportion of secondary care is based on referral from primary care. Poor quality of care provision by healthcare workers (doctors, pharmacists) in primary care in LMICs has been evidenced in many studies (Das and Sohnesen, 2007;Das et al., 2008Das et al., , 2012Das et al., , 2015Daniels et al., 2017;Kwan et al., 2018). Improving the quality of primary healthcare in LMICs is a current priority (World Health Organisation, 1978;Chabot, 1988;World Health Organisation, 2018, 2019. It is difficult to assess the quality of primary care in an LMICs setting. In high-income countries (HICs), clinical records or databases are often used for this purpose but in LMICs, these data can be poor quality or incomplete, and depending on where patients consult, may be lacking entirely (Lilford et al., 2007;Brown et al., 2008;Luna et al., 2013). Donabedian (1966) suggests that quality of care can be assessed in terms of structure, process and outcome, and described a causal chain linking structure to process and hence outcome. In this paper, we concentrate on process, which can be broken down into processes carried out at the system level, such as use of audit and feedback or improving staff morale, and clinical processes impacting directly on patients, such as questions asked to make a diagnosis or prescribe a treatment (Lilford et al., 2010). We refer to the latter as the technical quality of care, corresponding with the definition provided by Donabedian (1988).
A number of methods have been used to assess the technical quality of healthcare. Miller (1990) argued that there are differences between what providers know, know how or show how to do in an examination setting and what they actually do in a real-life clinical encounter. The use of vignettes alone-written case descriptionscan only provide an assessment of the former, we will instead focus on three methods that assess the real-life delivery of care: • Exit interviews/questionnaires: patients/carers/healthcare workers asked post-consultation about the provision of care in the consultation (Franco et al., 2002;Schoen et al., 2004); • Direct observation: clinical practice is observed first-hand during consultations or via video-or audio-recording (Stojan et al., 2016); and • Standardized patients: individuals trained to act as patients and simulate a set of symptoms/problems to portray a particular clinical case (Peabody et al., 2000).
These methods have been used extensively in medical education for training medical students and postgraduate and practising doctors in a variety of settings in HICs for decades (Beullens et al., 1997;Overeem et al., 2007;Rethans et al., 2007;Hrisos et al., 2009). There is now a growing evidence base of their application in LMICs (Watson et al., 2006;Xu et al., 2012;King et al., 2019;Kwan et al., 2019). While many of the references cited above are systematic reviews of one of these three methods, a systematic examination of the relative merits and drawbacks between these methods in LMIC settings is lacking.
In this paper, we review studies that have directly compared two (or more) of these methods 'head to head' and synthesize existing systematic and narrative review evidence on each method. We present a comparative overview of the feasibility, acceptability, validity, reliability, ethics, resources and costs involved in using these methods in the LMIC primary care setting. Our goal is to compare and contrast the pros and cons of using these methods to provide a resource to guide the future use of these methods in this context.

Methods
We carried out two systematic reviews: the first review focuses on primary studies carried out in LMICs that compare one or more of direct observation, standardized patients and exit interviews head to head (hereafter termed Direct Comparison Studies). The second review supplements these data in an overview of the existing systematic and narrative review evidence on each of the different methods (hereafter termed Overview of Reviews). The reviews were conducted in accordance with best practice guidelines from the Cochrane Collaboration (Higgins and Green, 2011), and have been reported using the guidance published in the PRISMA statement (Moher et al., 2009).

Protocol and registration
The systematic review protocol is registered on the Prospero register (CRD42018088226). 1980), CINAHL (from 1981), ASSIA (from 1987) and the Cochrane Library (from 1995). We first carried out searches to collate the Direct Comparison Studies in November 2018 and updated these searches in November 2019. We carried out the Overview of Reviews search in February 2018 and initially focused on the standardized patient method. The search was updated and expanded to all methods of interest in March 2020.
The search strategies used both indexed terms and keywords relating to important concepts of the review, including general terms related to healthcare quality and specific terms related to each of the three methods of assessing care quality. We tailored searches to the individual requirements of each database and applied an LMIC filter from the Cochrane Effective Practice and Organisation of Care (EPOC) review group (https://epoc.cochrane.org/lmic-filters) for the Direct Comparison Studies' search. We used truncations, wildcards and proximity operators where appropriate in all searches. The searches for the Overview of Reviews were restricted to review articles. Detailed search strategies can be found in the Supplementary Appendix.

Eligibility criteria and study selection
Titles and abstracts retrieved were assessed independently by two reviewers against the inclusion criteria. The inclusion criteria for the Direct Comparison Studies review were as follows: • Primarily concerns the technical quality of healthcare; • Involves at least one comparison between direct observation, standardized patients or exit interview; • Method has been applied to a primary or outpatient care encounter in a real life rather than educational setting; and • Reports on a primary research study carried out in an LMICs setting.
The inclusion criteria for the Overview of Reviews were: • Primarily concerns the technical quality of healthcare; • Involves direct observation, standardized patients or exit interview; • Method has been applied to a primary or outpatient care encounter in a real life rather than educational setting; • Systematic or narrative review; and • Provides empirical evidence on feasibility, acceptability, validity, reliability, ethics, resources and/or costs of the method(s).
While we focus on studies in which the quality-of-care assessment methods were directly compared in LMICs settings, we intentionally include review articles that have summarized literature related to application of these methods in both LMICs and HICs in order to cover a wider evidence base, as many features, strengths and weaknesses of each method hold true across different settings. An English language restriction was applied during study selection. No other restrictions were applied. Reference lists of included papers and other published reviews were hand searched to identify additional references. Duplicate references were removed. Discrepancies between reviewers' decisions were resolved through discussion, with access to full-text papers available where necessary.

Data extraction and synthesis
Data from studies confirmed to be eligible following the study selection process described above were extracted and coded according to a thematic framework covering several categories which were established a priori and refined during the data collection process. We extracted data separately for the Direct Comparison Studies and the Overview of Reviews though used the same thematic framework. The final categories were: country, location and setting; study design and sampling of patient, healthcare provider and healthcare facility; recruitment method and sample sizes (i.e. number of patients, healthcare providers, facilities and clinical encounters); sample characteristics; medical conditions or services involved; method of assessing care quality (including data collection tools); and training of study personnel. We also recorded information on feasibility, acceptability from the patient and provider perspective, practicality (including, ethical considerations, costs and resources required); inter and intra-rater reliability; content validity; criterion validity (measures of agreement between different methods or measures of accuracy of one method judged against another method/reference standard); and detection rate for the standardized patient method. Data extraction was undertaken by one reviewer and checked by a second reviewer, who together with a third reviewer derived the main themes for each of the data categories, which we used to construct summary tables and inform the narratives in this paper. In order to establish the level of agreement between methods, different methods should ideally be deployed for the same consultation and findings from different methods can be compared with all other things being held equal. However, we noticed that in some of the included studies, measures of quality of care were taken using different methods during different consultations, and then the findings from the methods (based on different consultations) were compared using healthcare worker as the unit of analysis. In these cases, measurements obtained by different methods could be influenced by differences in the nature of individual consultations (e.g. patient's presenting symptoms, health literacy, expectation, etc.). Consequently, it is difficult to attribute any observed disagreements to either the characteristics of the methods or the characteristics of individual consultations. We therefore made a clear distinction between these two types of studies, with more emphasis placed on the former which we term within-consultation comparisons (with individual consultation as the unit of analysis). Where a method did not share features with the other methods examined such as ethics of standardized patients or intrusiveness of an observer, the differences between the methods were highlighted in our descriptive analysis but were not possible to compare head to head.

Study selection
The study selection process for each review is illustrated in Figures 1  and 2. Of 1455 records identified in the Direct Comparison Studies review, we removed 416 duplicates and screened 1039 titles and abstracts for eligibility. Thirteen studies met the pre-defined criteria for inclusion and are summarized in Table 1. Of 393 records identified in the Overview of Reviews, we screened 391 for eligibility after removing two duplicates. Eight reviews met the pre-defined criteria for inclusion and are summarized in Table 2.

Characteristics of included studies and reviews
Direct comparison studies The characteristics of studies that directly compared quality of care assessment methods are summarized in Table 1. The studies were conducted in many LMICs worldwide though 10 out of the 13 took place in Sub-Saharan Africa. The healthcare settings included four family planning, antenatal and post-natal care; three community care; and five outpatient care services. One study covered both     family planning and outpatient care (Hermida et al., 1999). Outpatient services provided care for fever, malaria, diarrhoea, malnutrition, cough and pneumonia. Healthcare providers included doctors, nurses and nursing auxiliary, midwives and community health workers. Six studies were carried out with adult patients and five with children. Two studies included both adult and child patients (Leonard and Masatu, 2006;Pulford et al., 2014). Included studies covered around 3600 healthcare settings and just over 21 000 clinical encounters overall. The number of healthcare providers included was not reported in 5 out of 13 papers, though the remainder included 651 healthcare providers.
Overview of reviews We included six systematic and two narrative reviews, which are summarized in Table 2. Studies included in four of the six systematic reviews took place in HICs (USA, Canada, the Netherlands, Australia, Norway and UK). Studies included in the remaining two systematic reviews covered both HICs and LMICs and overall most were conducted in Asia or Central and South America. The six systematic reviews included 227 papers overall and these covered routine care mostly in general practice or pharmacy settings with family doctors/general practitioners, pharmacists or pharmacy staff and drug sellers. All six systematic reviews examined the use of the standardized patient method and three of these also examined direct observation and patient and provider exit interviews. Both narrative reviews examined the use of the standardized patient method in the LMICs context. They provide very detailed descriptions of issues and recommendations to be considered for adopting this method, drawing from extensive empirical evidence. Of the eight systematic and narrative reviews, quantitative comparisons between different methods were examined in one review (Hrisos et al., 2009) and we consolidated these with the direct comparison studies included in our review. We first present data from our analysis of the Direct Comparison Studies: these data summarize the quantitative comparisons between the different care quality assessment methods based on quantitative measures of agreement between the methods.

Types of head-to-head methodological comparisons made
The most common comparison was between the direct observation and patient or healthcare worker exit interview methods (n ¼ 8 studies). Two studies compared all three methods head to head (Franco et al., 1997;Tumlinson et al., 2014). A further two studies compared different types of direct observation: Miller et al. (2015) compared direct observation with repeat examination by a third party against direct observation alone and Cardemil et al. (2012) compared direct observation with repeat examination by expert examiners vs trained observers.

Assessment tools/instruments employed in directly compared studies
A typical primary healthcare consultation can be broken down into the following processes: history taking, physical examination, diagnosis, treatment/management, advice/counselling and preventive measures (Byrne and Long, 1976). Each method can assess each of these parts of the clinical encounter and included studies typically employed checklists to facilitate these assessments. The checklists captured the required or desirable actions one would expect a healthcare worker to perform during a clinical encounter (such as asking history questions, checking a symptom, ordering a test and prescribing a medication) for a given symptom or condition. Most of the criteria had been selected in accordance with accepted local and international clinical standards. Most studies created their own scoring algorithms to score checklist criteria.

Quantitative comparisons between different methods compared head to head
Here we examine quantitative measurements of agreement between the different methods described above. Comparisons between different methods can be viewed from two perspectives. The first perspective is to assume that one method is more accurate than the other method(s), and thus the former is used as a reference standard against which the 'performance' of the other methods is judged. The second perspective is to assume that different methods are broadly similar in terms of their validity, and therefore agreement between methods is measured to inform whether one method can be used in place of the other methods. Studies included in this review adopt either or both of these perspectives and these comparisons are summarized in Table 3.
As shown in the table, many studies reported measures of 'accuracy' of one method against a reference standard such as sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV), receiver operating characteristic (ROC) curve and positive likelihood ratio (LRþ) and negative likelihood ratios (LRÀ). Most of the studies also reported measures of agreement between methods such as percentage agreement or kappa statistics. Irrespective of the methods compared and measures reported, a common finding is that the levels of agreement between methods vary widely depending on the nature of the quality item (e.g. whether it relates to history taking, physical examination, diagnosis or giving advices) and the specific context (e.g. disease/service area, availability of medicines and diagnostic tests, patient's condition, presentation, needs and health literacy). For example, the reported agreements typically ranged from between 30-60% at the lower end and over 90% at the upper end for different quality items within individual studies (Table 3). Agreement between different methods could also be influenced by methodological issues, such as the wording of survey questions and level of 'probing' when conducting the interview (Franco et al., 1997). For example, Franco et al. (1997) reported different rates for performing required tasks from exit interviews with healthcare workers when only spontaneous answers were counted compared with inclusion of answers both offered spontaneously and after probing. The latter often resulted in higher rates of reported acts (e.g. for the item 'advised (the patient) to finish treatment' (for sexually transmitted diseases): 33% using spontaneous answers, 100% using both spontaneous and probed answers, vs 65% recorded in direct observation). However, the effect of probing and discrepancies between exit interviews and direct observations also appear to be item-specific and were not uniformly observed for all items.
We present below findings from pairwise comparisons between the methods and highlight pertinent methodological issues. In points (1) and (2) below, we first describe attempts to validate direct observation by comparing this approach with a reference standard perceived to be superior (in at least some aspects, such as a more accurate diagnosis through re-examination of the same patient by a more experienced/better-qualified person, or removing potential Hawthorne effect by using standardized patients). This is followed by comparison of patient/carer/healthcare worker exit interviews with these reference standards [points (3) and (4)].  h Sensitivity and specificity were defined in an opposite way in this study (e.g. sensitivity was defined as performance failures detected by the assessment compared with reference standard) compared with the definitions adopted in this review as described in the footnotes above; figures presented in this table have been inverted on this basis to reflect the standard definitions adopted in this review.
i Calculated based on data reported in the original paper.
j Coefficient of correlation.
(1) Validation of direct observation with re-examination and/or more experienced observers Three studies provided evidence to attempt to validate direct observation by using an 'improved' version of this approach (see Table 3). Two of these three studies (Cardemil et al., 2012;Miller et al., 2015) presented what we term within-consultation comparisons, i.e. comparisons based on the same clinical encounter (with the same healthcare worker-patient dyad) where quality of care is assessed using the same indicators. These two studies compared direct observation without re-examination against direct observation with reexamination and the remaining study compared direct observation made by neophyte physician observers with observation made by experienced expert as the reference standard (Hermida et al., 1999). Detailed findings are presented in Supplementary Appendix Table SA1.
Levels of agreement ranged from 23% to 99% and kappa ranged from À0.28 to 0.99. A wide range of sensitivity (33-100%), specificity (0-99%), PPV (0-96%), NPV (0-100%) and area under the ROC curve (0.61-0.77) was reported. One study (Onishi et al., 2011) examined whether the performance of exit interviews varied by types of healthcare worker (doctors, nurses or midwives) but did not find any major differences. The study by McCarthy et al. (2018) examined the potential effect of women's sociodemographic characteristics on the performance of exit interviews but also did not find an association.
We found that patients tend to remember some elements of the consultation better than others: they are more likely to remember things that are easily discernible from the encounter, such as being asked about a particular bothersome symptom (e.g. Have you noticed blood in your stool?). They are also more likely to recall actions that were done to them such as the healthcare worker asking for a stool sample or listening to their chest. Patients are much less likely to recall, or even recognize, the very technical or more abstract aspects of care, such as if the healthcare worker washed their hands or respected their confidentiality-these are elements more accurately picked up through observation of care (Bessinger and Bertrand, 2001). Patients might also remember the working diagnosis if shared with them by the healthcare worker and if they were given any counselling or specific advice, such as coming back immediately if breathing becomes difficult, or if the healthcare worker was rude or treated them disrespectfully (Onishi et al., 2011). A further issue is that patient/carers' responses might be influenced by the wording of the questions and their understanding of the procedures carried out/ advice given to them by healthcare workers (Hermida et al., 1999;McCarthy et al., 2018), and could be confounded by knowledge that they already possessed or gained elsewhere outside the consultation (Bessinger and Bertrand, 2001).
(4) Healthcare worker exit interview vs direct observation or standardized patients Three studies assessed healthcare worker interview: two compared with direct observation (Franco et al., 1997(Franco et al., , 2002 and one compared with standardized patients (Tumlinson et al., 2014). Findings from these studies are presented in Supplementary Appendix Table  SA3. Levels of agreement ranged from 31% to 96%, with reported kappas between À0.08 and 0.60. Only Tumlinson et al. (2014) reported sensitivity (50-98%), specificity (6-83%), PPV (8-100%), NPV (5-96%), LRþ (0.6-1.0) and LRÀ (0.9-4.0). In all three studies, the interview with healthcare workers might not have been directly linked to the specific consultations assessed by direct observation or standardized patients; these instead seemed to be carried out with healthcare workers after a set of observations took place. Without encouraging healthcare workers to reflect on what happened with a particular patient, a healthcare worker exit interview may rather be providing an assessment of knowledge of care rather than actual behaviour. Therefore, this approach may actually be equivalent to asking healthcare workers to complete a vignette about the clinical case.

Descriptive comparisons between different methods
Now that we have compared methods based on measures of agreement, we turn to issues of feasibility, acceptability and practical considerations (ethics, resource use and cost) relevant to each method. We derived these data from all papers included across both elements of the review: the Direct Comparison Studies and the Overview of Reviews. Two themes emerged, which we name: method The agreement between direct observation and reference standards (as noted in Table 3) was reported in one study and ranged from 43% to 97%, with kappa statistics spanning from À0.15 to 0.92. Judged against the reference standards, and focusing only on studies reporting within-consultation comparisons, direct observation demonstrated a sensitivity between 20% and 100% and a specificity between 30% and 100%, with the area under the ROC curve ranging from 0.54 to 0.90 were reported. Direct observation showed good agreement overall with reference standards, but a pattern consistent across studies was that its performance against reference standards tended to be much lower with respect to recognition and management of severe acute illness (Supplementary Appendix Table SA1. (3) Patient/carer exit interview vs direct observation or standardized patient Nine out of 12 included studies assessed exit interview of service users (Hermida et al., 1999;Bessinger and Bertrand, 2001;Franco et al., 2002;Leonard and Masatu, 2006;Onishi et al., 2011;Pulford et al., 2014;Tumlinson et al., 2014;Assaf, 2018;McCarthy et al., 2018). Seven of these studies reported within-consultation comparisons. The reference standard was direct observation in eight studies and standardized patients in Tumlinson et al. (2014). Findings of these studies are presented in Supplementary Appendix Table SA2.
Overall the prevalence of appropriate/correct responses for quality items reported by standardized patients tended to be similar or lower than that recorded through direct observation (see Supplementary Appendix Table SA1). However, the interpretation of findings from the studies requires great caution as in both studies the consultations assessed by standardized patients were not the same consultations being directly observed (i.e. the unit of analysis was providers rather than consultations), and therefore the observed discrepancies could be attributed to features of the consultations rather than the methods of assessment. One further study (Rowe et al., 2012) compared the two methods using data aggregated across consultations. The analyses did not quantify agreement between the methods and were mainly undertaken to estimate the magnitude of Hawthorne effect (described later).
preparation and implementation, covering issues such as ethics, resources required and clinical case/selection of illnesses; and methodological issues covering validity/bias. We summarize the key issues for each of these themes in Table 4 and organize the issues according to whether they are advantages or disadvantages in the use of direct observation, standardized patients or patient/carer/ healthcare worker exit interviews.
Details on the acceptability of the different methods were absent from the papers included in this review. Cost information was available in Rowe et al. (2012) and two of the reviews (Overeem et al., 2007;Kwan et al., 2019), Table 4 Pros and cons of each quality of care assessment method to guide use in LMICs

Direct observation
Standardized patients a Patient/carer/provider exit interview Pros þ Flexible-used in-person or via audio or video-recording. þ Easily transportable. þ Used for both child and adult consultations. þ Canvass a breadth of conditions. þ Structured checklists with objective criteria can remove subjectivity when coding observations. þ Reliable with either expert or trained neophyte observers.
þ Non-intrusive. þ Assesses knowledge-do gaps. þ Used extensively in pharmacies and primary care clinics in LMICs-comprehensive guidance and toolkit available to guide use in these settings. þ Not affected by Hawthorne effect. þ No social desirability bias. þ Immediate post-visit completion of assessment checklists minimizes recall bias. þ Low detection rate (<1% or 0-0-5% in recent LMIC studies). þ Low false positive rate-providers report real patients as being standardized patients-(1-6% in recent LMIC studies). þ Reliable. þ 'In-principle' consent can avoid ethical concerns. þ Used in a breadth of both common and relatively rate outpatient symptoms/conditions possible to mimic. þ Can be used with adults and for selected child conditions (e.g. malaria) with or without child present þ Flexible-data collection via questionnaire or interview. þ Not affected by Hawthorne effect. þ Straightforward to implement. þ Can be brief. þ Minimal intrusion to health facility. þ Supplements data collected using other methods. þ Reliably provides information on quality of care from the patient/spouse/carer and provider perspectives. þ Canvass a breadth of conditions. þ Easily transportable. þ Used for both child and adult consultations. þ Used across full range of primary healthcare settings. þ Patients good at recalling disrespectful treatment.

Cons
À Intrusive. À Requires significant buy-in from a range of stakeholders. À Limited information on acceptability amongst providers in LMICs. À In-person observation may be impractical to use in pharmacies. À Hawthorne effect, but people do habituate. À Resource intensive-requires multiple highly trained observers independent of the health facility. À Time-consuming to code observations. À Equipment failures possible. À Tends to assess only what the healthcare provider recommends, instead of effectiveness or appropriateness of care (but this may be possible with repeat examination). À Need to observe high numbers to ensure enough observations to compute quality scores for relatively rare symptom or conditions.
À Ethical debate around prior consent from healthcare providers. À Requires significant buy-in from a range of stakeholders. À Initial set-up resource (time, effort, finance) intensive. À Cases require careful selection-technically feasible, ethically acceptable, and suitable to local context. À Cannot be used for illnesses with physical signs (e.g. trauma, pregnancy) that cannot be mimicked. À Cannot be used where there are intimate, invasive or surgical procedures. À Requires carefully selected and highly trained standardized patients; particularly challenging if À Standardized patients must represent 'typical' patients for the specific context to ensure credibility and thus face and content validity. À 'First-visit' bias-leading to underestimated performance from one-time interactions; not suitable for assessing follow-up consultation of chronic conditions À Limited information on acceptability amongst providers in LMICs. À Visits sometimes made to the wrong premises and healthcare providers. À Samples of healthcare providers can be selfselected. À Visits capped at three per day to maintain reliability of post-visit checklist. À Need to purchase all drugs offered.
À Requires skilled field workers. À Self-reported-affected by recall bias, social desirability bias and courtesy bias. À Patients much less likely to recall or even recognize the very technical or more abstract aspects of care.
a The longer list of pros and cons for the standardized patient approach does not display a preference for this approach over the others-the literature on use of this method in the context of this review is significantly more comprehensive and detailed than it is for the direct observation and exit interview approaches. but one of them (Overeem et al., 2007) only focused on HICs. Rowe et al. (2012) reported similar costs per consultation of $73.67 and $70.19 for direct observation (with re-examination) and standardised patients, respectively. Kwan et al. (2019) offered detailed discussions on budgetary consideration for planning standardized patient methods in LMICs in their comprehensive Supplementary data. There are inevitable, substantial variations in cost estimates from previous studies depending on countries, settings and type of costs included (e.g. costs of out-ofcountry research/advisory teams), but they highlighted that the scale and complexity of individual projects have a major impact on estimated costs per patient-provider interaction, ranging from 60-150 US dollars in a project involving 8000 interactions in India to 900-1000 US dollars in a smaller project involving around 400 interactions in South Africa. They further noted that the average cost per interaction decreased over time (in the above study in India) because the teams became more efficient with accumulation of experiences and the initially higher set-up costs were divided across more subsequent interactions.

Method preparation and implementation
Preparatory work is required for all methods before implementation in clinical care. The standardized patient method benefits from the recently published comprehensive guidance and toolkit which describes how to implement this approach in practice in LMICs and covers all of the important considerations alongside exemplars and templates (King et al., 2019;Kwan et al., 2019). Comparable guid-Patient/carer/healthcare worker exit interviews are by far the most straightforward to implement in practice. The other approaches are complex to administer and resource intensive. Authors stressed the importance of carefully selecting and training field staff and is key for the standardized patient method in particular. There is some debate around the ethics approach of the standardized patient method, though the recommended approach is to seek 'in-principle consent' from healthcare workers before visits take place, i.e. permission to be visited by a standardized patient visit but ance is not available for the direct observation or exit interview methods. not being told when it will happen. Rowe et al. (2012) highlighted some ethical and practical challenges when using the standardized patient method involving children, such as minimising potential harm and discomfort for them and dealing with relatively common occurrence of (real) acute illnesses for young children.
There is a clear trade-off between direct observation and exit interviews on the one hand and standardized patients on the other. Standardized patients have a distinct advantage over other methods because it is not necessary to wait for a case with one of the conditions of interest to present. For example, it may be necessary to screen large numbers of consultations to find one with a presenting feature such as loss of weight or a persistent cough while each standardized patient encounter would already include a condition of interest. But the price to pay is that suitable conditions are limited to those that can be represented by a standardized patient (i.e. nonemergency conditions that do not require invasive or intimate examinations or interventions, and that do not require sequential visits to or long term/continual care with a specific provider). Direct observation and exit interviews canvass a larger range of conditions, but these methods are likely to capture too small a number of relatively rare conditions to allow reliable assessment of quality of care.

Methodological issues
The Hawthorne effect, which describes a change in behaviour as a result of being observed (Sommer, 1968), is a concern in direct ob- Miller et al., 2015). Rowe et al. (2012) found a median difference of 16.4 percentage points higher (range 1.7% lower to 61.1% higher) for quality indicators assessed by direct observation compared with standardised patients. Miller et al. (2015) calculated the differences in point estimates of care quality indicators obtained from medical record review between children whose consultations were observed vs those not observed. The authors found only small differences between many of the quality indicators-most of which showed statistical non-significance-and concluded that the effect of being observed was negligible. However, the validity of the finding partly relies upon the accuracy of medical record review, which was found to have generally high sensitivity but low specificity in the same study. In contrast, Leonard and Masatu (2006) quantified the Hawthorne effect by comparing quality of care measures obtained through patient exit interviews that took place either before or after the research team arrived in clinic to observe care. The authors found an increase of 13 percentage points in quality of care (from baseline scores of just over 50%) at the beginning of direct observation (i.e. a Hawthorne effect). However, the initial improvements in quality gradually dissipated over time and returned to their baseline level after 10-15 observations. Therefore, one way to mitigate the Hawthorne effect might be to carry out multiple days of observations at healthcare facilities to help individuals habituate to being observed (McCarthy et al., 2018).

Masatu
Exit interviews and standardized patient approaches are not affected by the Hawthorne effect but as exit interview data are selfreported, patients/carers/healthcare workers' responses can be affected by social desirability bias, courtesy bias and recall bias (Tumlinson et al., 2014). This could again skew the data towards higher perceived quality of care. The unannounced design of standardized patient visits can reduce introducing the risks of the Hawthorne effect and/or social desirability bias.
While it might seem that healthcare workers could detect standardized patents, this has been shown to happen rarely (<1%) especially when the standardized patients blend in with the local patient demographic (Tumlinson et al., 2014). That being said, some conditions have higher risk of discovery and Franco et al. (1997) stressed that standardized patients should be given clear instructions on when to abscond to maintain their cover. The authors used standardized patients in the context of sexually transmitted disease management and involved those who did not have the symptom (urethral discharge) for the diseases they were simulating. The danger here is that non-symptomatic standardized patients may be treated differently by healthcare workers compared with symptomatic patients. Standardized patients in this study absconded in 5 out of 20 consultations.

Discussion
Improving the quality of primary healthcare provision is an important goal for many LMICs and a current WHO priority. While recent widespread efforts have been made to assess the quality of primary healthcare in LMICs, the measurement of consultation quality remains a challenge. We reviewed the most common methods for assessing healthcare workers' clinical behaviour: direct observation, standardized patients and exit interviews. Our goal was to compare servation. The suggestion when examined in five papers (Leonard and Masatu, 2006;Tumlinson et al., 2014;Miller et al., 2015; ies attempted to quantify the Hawthorne effect (Leonard and McCarthy et al., 2018;Rowe ., 2012) is that it could lead to bias of the et al result in an upward direction (i.e. better performance than usual). Three stud-∼ 2018). Although there may be an additional cost implication and need for a skilled facilitator if feedback is to be optimally effective.

Review limitations
The findings are limited by the small number of available studies, which limits the generalizability of our quantitative comparisons. While we have focused on studies that directly compared at least two assessment methods, we are aware that there is a large body of literature in which individual methods were used singly to assess the quality of care in LMIC settings. While not providing direct comparative evidence, these may have described valuable practical lessons related to the planning and implementation of individual methods that may not have been captured in this review. This is partially compensated by our inclusion of two comprehensive narrative reviews on standardized patients, but we did not find similar reviews for direct observation and exit interviews.
The direct comparison studies we found were highly heterogeneous. Different measures were used to characterize the performance of different methods of assessment, which hinders the comparison of findings between studies. The studies were also diverse in the types of comparisons made and often did not compare the same clinical encounters or domains. We report within-consultation comparisons where available, i.e. comparisons made on the same clinical encounter where quality of care are assessed using the same indicators. Alternatively, comparisons may be made using different patients but assessing the same indicator of quality or different patients and different indicators of quality.
In this review, we have focused on comparing the fidelity of individual methods to capture what happens in individual consultations and the practical considerations in choosing between the methods. The ultimate goal in applying these methods is to ensure that quality of care can be reliably measured across the healthcare system, and that any deficiencies in the care can be detected and addressed. Evaluating the quality of care of individual consultations is therefore an essential building block but may not be sufficient on its own to achieve this goal. In order to produce reliable and comprehensive assessments, data on technical quality of care gathered from individual consultations will need to be supplemented by data describing the variation in average encounter quality at provider, facility and higher levels for any population targeted for measurement and potentially used alongside other data such as accessibility and patient experience.
Our review did not include studies investigating the use of vignettes because vignettes measure the healthcare workers' knowledge rather than actual practice which is the focus of this review. Vignettes nevertheless remain a very important tool to establish knowledge-do gaps where problems with clinical practice are identified [see Mohanan et al. (2015)], and should therefore be considered alongside the methods considered in our review when planning a programme or research to evaluate and/or improve quality of care in the primary care setting (Peabody et al., 2000;Das and Hammer, 2005;Leonard and Masatu, 2005). and contrast the pros and cons of each method and provide a resource to guide the selection of methods in this context in the future.
Direct observation and standardized patients are commonly considered to be 'gold standard' methods (Akachi and Kruk, 2017), though we did not find this to be the case. We found that no single method was superior to the other methods across the different contexts in our review. Each method may assess different aspects of quality of care provision and their differential strengths and weaknesses from a methodological and practical standpoint will most likely guide decisions on method selection.
We found that the accuracy and validity of an individual method for assessing quality of care are by no means fixed and may depend on the nature of the aspect/item of technical quality being assessed, but more crucially also rely on careful planning and implementation before and during the application of each method. The exact reasons behind the discrepancies in the accuracy and validity of these methods observed between different studies are not always clear and need to be investigated in further research. Until we have a better understanding, it is important that any chosen methods are crossvalidated, possibly with at least another method in the setting in which they are to be deployed.
When comparing the accuracy of different methods in recalling what happened in consultations, within-consultation comparisons may provide the best evidence, as confounding arising from differences in the case mix and characteristics of patients between consultations is avoided. Nevertheless, there are inherent methodological challenges related to the difficulty in isolating the influence of one method (such as direct observation) from measurements made by another method. Potential interactions between methods of assessment and patient and healthcare workers' behaviours (e.g. how they react during the consultation and what they recall after the consultation) therefore need to be taken into account when interpreting data from within-consultation comparisons. One particular concern is the Hawthorne effect that may be induced by direct observation. Findings from studies included in this review suggest that the Hawthorne effect associated with direct observation of patient consultation is likely to be small or moderate and tends to dissipate over time (Leonard and Masatu, 2006;Miller et al., 2015 to minimizing a potential Hawthorne effect [such as having a longer period of observation until the care providers habituate to the presence of observers/recording mentioned earlier (McCarthy et al., 2018)] may alleviate this problem, but these will inevitably increase resources required to undertake the observations. The standardized patient method has several advantages including avoidance of the Hawthorne effect in direct observation and various biases associated with responses given by healthcare workers or patients. Standardized patients also overcome the difficulty in establishing a 'correct diagnosis' and hence the uncertainty in judging the appropriateness of subsequent decisions made by the healthcare worker for encounters with real patients. In addition, standardized patients provide a means to standardize patient characteristics during the consultations being assessed, thereby alleviating or abolishing the problem of confounding. Nevertheless, a limitation inherent to this method is the type of conditions and nature of the clinical problems to which it can be applied and the many practical challenges and costs described earlier.
The acceptability of each method from the perspective of relevant stakeholders (healthcare workers, patients, health facilities, etc.) was not considered in any of the papers included in our review but is crucial for ensuring 'buy-in' and the smooth-running of quality of care projects . Rethans et al. (2007) suggest that combining performance feedback with quality of care assess-ments may enhance perceived acceptability. Performance feedback was considered in only one of the systematic reviews we included in our review (Overeem et al., 2007) and in only a third of their included studies. However, audit combined with the provision of feedback (so-called 'audit and feedback') is a well-established and effective strategy for improving healthcare workers clinical behaviour by making them aware of where the inconsistencies are in their clinical practice (Hysong, 2009;Ivers et al., 2012;Rowe et al., 2012). Approaches