-
PDF
- Split View
-
Views
-
Cite
Cite
Christopher T. Naugler, Maggie Guo, Mean Abnormal Result Rate: Proof of Concept of a New Metric for Benchmarking Selectivity in Laboratory Test Ordering, American Journal of Clinical Pathology, Volume 145, Issue 4, April 2016, Pages 568–573, https://doi.org/10.1093/ajcp/aqw041
- Share Icon Share
Abstract
Objectives : There is a need to develop and validate new metrics to access the appropriateness of laboratory test requests.
Methods: The mean abnormal result rate (MARR) is a proposed measure of ordering selectivity, the premise being that higher mean abnormal rates represent more selective test ordering. As a validation of this metric, we compared the abnormal rate of lab tests with the number of tests ordered on the same requisition. We hypothesized that requisitions with larger numbers of requested tests represent less selective test ordering and therefore would have a lower overall abnormal rate.
Results: We examined 3,864,083 tests ordered on 451,895 requisitions and found that the MARR decreased from about 25% if one test was ordered to about 7% if nine or more tests were ordered, consistent with less selectivity when more tests were ordered. We then examined the MARR for community-based testing for 1,340 family physicians and found both a wide variation in MARR as well as an inverse relationship between the total tests ordered per year per physician and the physician-specific MARR.
Conclusions: The proposed metric represents a new utilization metric for benchmarking relative selectivity of test orders among physicians.
There has been considerable interest in reducing unnecessary laboratory tests. In the United States and Canada, the Choosing Wisely initiative has acted as a catalyst in the discussion of low-value and inappropriately ordered laboratory tests. 1,2 A commonly encountered challenge, however, is the lack of guidance on defining tests that may be ordered “inappropriately.” Systematic reviews of this question show that inappropriate use has generally been defined as that in violation of published clinical practice guidelines (CPG), 3,4 although authors generally conclude that more work on this definition is needed. The application of CPG would appear to be the most objective assessment of appropriateness, but often requires knowledge of patient characteristics not available through secondary laboratory data, thus limiting their use in utilization audits. A second objective measure of inappropriate laboratory utilization involves quantifying redundant test orders. Minimum reasonable retest intervals can be defined based on either practice guidelines or by invoking physiologically reasonable time periods in which retesting should occur. 5 This metric has the advantage of relying only on secondary data within the laboratory information system (LIS). These two utilization metrics (compliance with CPG and redundant test ordering) we refer to as primary utilization metrics .
Both of these metrics are limited to certain relevant tests, highlighting the need to develop new metrics. We propose that a second tier of utilization metrics, which we refer to as secondary metrics , includes test-ordering patterns that may be benchmarked among physicians. These secondary metrics include total costs attributable to individual physicians 6 and practice variation among physicians. Interphysician variation in test ordering has not been extensively explored in laboratory utilization management but has been identified as a marker of inappropriate use in in other clinical settings. 7‐12 Indeed, unexplained variation in laboratory test utilization appears to be widespread. 13‐19
Benchmarking of utilization metrics among physicians may facilitate physician profiling and audit and feedback initiatives, 20 but to date no secondary metrics can provide information on the relative appropriateness of testing. The purpose of this paper is to introduce a third secondary laboratory utilization metric, mean abnormal result rate (MARR), and show how this metric could be used in utilization management initiatives.
Material and Methods
The premise underlying the proposed metric is that many laboratory tests have an expected abnormal result rate of 5% in healthy patients. This is because the reference range of many tests is determined by calculating the central 95% interpercentile interval (or the mean +/– 2 standard deviations for Gaussian distributions). 21 Therefore, if these laboratory tests were randomly ordered in healthy patients, we would expect the baseline abnormal rate to be in the 5% range. We would also expect, however, that laboratory tests are ordered preferentially in patients with a higher than random pretest probability of a positive result. Therefore the actual abnormal rate for individual physicians would be expected to be somewhat higher than 5%. If, for example, an individual physician had a MARR of 10%, we might expect that half of these tests represent true-positive results and half represent false-positive results. Likewise, if a physician had a MARR of 6%, 5/6 or 83% of positive results might be expected to be false positives. Broadly, the MARR can be expressed in numerical terms as follows:
MARR = sum of abnormal results/sum of total tests ordered
Certainly, there are times when a negative result is just as valuable clinically as a positive result. However if the average is calculated from a large number of tests, a very low MARR (close to the expected false-positive rate in healthy volunteers) is likely to represent overall ordering of tests with a lower pretest probability. If we are comparing physicians of the same specialty group in similar practice settings, it is reasonable then to use the abnormal rate as a proxy measure for selectivity of test ordering, with the expectation that higher overall abnormal rates will represent ordering with a higher pretest probability. The novelty of this metric lies in the fact that it assesses only the pretest probability of the tests being ordered and should be insensitive to factors such as patient volume.
In developing this metric, we first needed to consider how the reference ranges of different tests were determined. Although we could likely include all laboratory tests, for the data presented here, we chose to exclude tests with reference ranges defined by disease specific cut-offs (such as troponin or glucose). The included tests are listed in Table 1 . For each of these tests, we considered a given result to be abnormal if an abnormal flag had been generated in the LIS.
Alanine aminotransferase |
Albumin |
Alkaline phosphatase |
Aspartate aminotransferase |
C-reactive protein |
CA 125 |
Calcium |
Carcinoembryonic antigen |
Chloride |
CO 2 |
Complete blood count |
Creatine kinase |
Creatinine |
D-dimer |
Direct bilirubin |
Estradiol |
Ferritin |
Follicle-stimulating hormone |
γ-Glutamyl transferase |
Iron |
Lactate dehydrogenase |
Lipase |
Luteinizing hormone |
Magnesium |
Phosphate |
Potassium |
Progesterone |
Rheumatoid factor |
Sedimentation rate |
Sodium |
T3 free |
T4 free |
Thyroid-stimulating hormone |
Total bilirubin |
Total iron-binding capacity |
Total testosterone |
Transferrin saturation |
Urate |
Urea |
Alanine aminotransferase |
Albumin |
Alkaline phosphatase |
Aspartate aminotransferase |
C-reactive protein |
CA 125 |
Calcium |
Carcinoembryonic antigen |
Chloride |
CO 2 |
Complete blood count |
Creatine kinase |
Creatinine |
D-dimer |
Direct bilirubin |
Estradiol |
Ferritin |
Follicle-stimulating hormone |
γ-Glutamyl transferase |
Iron |
Lactate dehydrogenase |
Lipase |
Luteinizing hormone |
Magnesium |
Phosphate |
Potassium |
Progesterone |
Rheumatoid factor |
Sedimentation rate |
Sodium |
T3 free |
T4 free |
Thyroid-stimulating hormone |
Total bilirubin |
Total iron-binding capacity |
Total testosterone |
Transferrin saturation |
Urate |
Urea |
Alanine aminotransferase |
Albumin |
Alkaline phosphatase |
Aspartate aminotransferase |
C-reactive protein |
CA 125 |
Calcium |
Carcinoembryonic antigen |
Chloride |
CO 2 |
Complete blood count |
Creatine kinase |
Creatinine |
D-dimer |
Direct bilirubin |
Estradiol |
Ferritin |
Follicle-stimulating hormone |
γ-Glutamyl transferase |
Iron |
Lactate dehydrogenase |
Lipase |
Luteinizing hormone |
Magnesium |
Phosphate |
Potassium |
Progesterone |
Rheumatoid factor |
Sedimentation rate |
Sodium |
T3 free |
T4 free |
Thyroid-stimulating hormone |
Total bilirubin |
Total iron-binding capacity |
Total testosterone |
Transferrin saturation |
Urate |
Urea |
Alanine aminotransferase |
Albumin |
Alkaline phosphatase |
Aspartate aminotransferase |
C-reactive protein |
CA 125 |
Calcium |
Carcinoembryonic antigen |
Chloride |
CO 2 |
Complete blood count |
Creatine kinase |
Creatinine |
D-dimer |
Direct bilirubin |
Estradiol |
Ferritin |
Follicle-stimulating hormone |
γ-Glutamyl transferase |
Iron |
Lactate dehydrogenase |
Lipase |
Luteinizing hormone |
Magnesium |
Phosphate |
Potassium |
Progesterone |
Rheumatoid factor |
Sedimentation rate |
Sodium |
T3 free |
T4 free |
Thyroid-stimulating hormone |
Total bilirubin |
Total iron-binding capacity |
Total testosterone |
Transferrin saturation |
Urate |
Urea |
As a test of the external validity of this metric, we hypothesized that laboratory requisitions with larger numbers of test requests would represent less selective test ordering and should therefore show lower mean rates of abnormal results. To test this we examined test-ordering data from Calgary Laboratory Services, the sole provider of laboratory services to Calgary and surrounding areas of south-central Alberta, Canada (population of approximately 1.4 million). This analysis was considered to constitute quality assurance and did not require formal research ethics approval. We restricted our analysis to outpatient laboratory test requests ordered by family physicians. We first searched our laboratory information system for test results for the 39 tests in Table 1 that were reported between April 1, 2013, and March 31, 2014. To reiterate, these tests were chosen because they have a reference interval defined by the healthy population interpercentile interval and not a disease-specific cut-off. For complete blood counts, we considered the entire test to be abnormal if any of the constituent indices were abnormal. To avoid pseudoreplication, we included only the first testing presentation for any individual patient during the study period. We then calculated the abnormal rate for all tests ordered on the same requisition and plotted this against the total number of tests ordered on the requisition. Additional tests not meeting our inclusion criteria may have been ordered on the same requisition, but for the purposes of this evaluation we only counted tests meeting the inclusion criteria.
Secondly, we hypothesized that there would be variation in the MARR among family physicians ordering laboratory tests in the outpatient setting. To test this, we used the same data to generate a distribution of MARRs for 1,430 family physicians practicing in Calgary.
Finally, we hypothesized that physicians ordering larger numbers of tests overall (and not just larger numbers of tests on a single requisition) would have lower overall MARRs. This assessment is complicated by the fact that number of tests per requisition as well as practice volume would affect the total number of tests ordered. However an inverse relationship between MARR and total tests ordered would boost the argument against the possibility that physicians with high MARRs and low numbers of tests per requisition are simply sending their patients to the lab more frequently for small numbers of tests.
Results
Between April 1, 2013, and March 31, 2014, our laboratory reported 3,864,083 individual tests meeting the inclusion criteria (tests ordered by family physicians and collected in the community), performed on 451,895 individual patients. Of these tests, 330,452 were abnormal (outside the reference range), giving an overall MARR of 8.55%. Figure 1 shows that requisitions with only a single test ordered were abnormal just over 25% of the time. The abnormal rate of individual tests decreased in a roughly linear fashion with increasing number of tests until levelling off near the baseline after about 9 tests (Figure 1). Overlaid on the MARR in Figure 1 is the distribution of the number of tests per requisition.

MARRs (test value outside of the reference range) for requisitions requesting up to 29 tests (open circles). Vertical lines are the 95% error bars. Also shown on the Y-axis is the total number of requisitions with a given number of tests requested (closed circles). The data represent all tests ordered by 1,340 family physicians in Calgary on outpatient (community) patients over a one-year period. Requisitions requesting more than 9 tests had low overall average abnormal result rates, suggesting that these tests were being requested less selectively.
Figure 2 shows that there was a wide variation in MARR among individual physicians with a roughly normal distribution but with a long tail of physicians with higher MARRs. Finally, Figure 3 shows that there was a statistically significant inverse relationship between the total yearly number of tests ordered by individual family physicians and the yearly MARR of that physician.

Distribution of yearly MARRs for all outpatient (community) tests ordered by 1,340 family physicians in Calgary, Canada.

Comparison of total yearly outpatient (community) test volumes vs yearly MARR for 1,340 family physicians in Calgary, Canada. There is a statistically significant inverse relationship (Pearson r = 0.13, P < .001).
Discussion
An average abnormal rate of 8.55% was observed in a large sample of over 3 million individual tests. We observed a striking relationship between the abnormal test rate and the number of tests ordered. As predicted, the abnormal rate decreased with increasing number of tests ordered on the same requisition. Requisitions with greater than 9 tests ordered showed an abnormal rate only marginally above that expected in healthy volunteers. Furthermore, in a large sample of tests ordered by family physicians on community patients, we observed a wide variation in MARR. Finally, there was an inverse relationship between total yearly test volume ordered by family physicians and the yearly MARR.
The high interphysician variation in MARR reported in this study is consistent with previous reported unexplained practice variation, 6,17 for example, among the same group of family physicians, we recently reported that the total cost of laboratory testing attributable to individual physicians varies widely, with a coefficient of variation of 110%. 6 The underlying reasons for this variance are unclear and will be the subject of future research by our group.
Our results may have implications for future laboratory utilization management initiatives. One of the necessary steps in any utilization management initiative is the identification of low-value testing. As discussed earlier this may be done directly by measuring adherence with CPG and identifying instances of redundant test ordering, or may also involve secondary measures such as identification of practice variance or overall test volumes. We suggest that the MARR could be used as an additional benchmark to compare the relative selectivity of test ordering by individual physicians. This could be accomplished by calculating MARRs for individual providers and providing these data to ordering physicians as a form of audit and feedback. Additional data, however, is needed to determine specialty-specific “normal” MARRs and to demonstrate how MARR varies geographically. The fact that a larger number of tests on a given requisition is related to lower MARRs suggests that MARR may in fact be related to overall test volumes. Therefore, utilization management initiatives designed to reduce unnecessary testing may also increase MARR as a consequence. It should also be noted that while a higher MARR may indicate more selective test ordering, there is likely an upper limit to the MARR above which indicated tests may not be ordered. Therefore further work in multiple jurisdictions will be needed to define the expected MARRs for different physician and patient groups.
Our data also suggest a simpler approach: As requisitions with greater than 9 tests showed a MARR only marginally above the expected baseline, requisitions with tests above this number could serve as a marker of lower-value testing and could form the basis for utilization management initiatives. A simple example would be to append a laboratory comment to the results a large number of tests ordered advising the physician of the low abnormal rate when large numbers of tests are ordered.
The proposed approach suffers from the weaknesses inherent in other forms of secondary metrics, namely that the metric can act as a marker of potentially lower-quality ordering but cannot definitively identify low-quality practices. There are several other important caveats to the proposed metric. The first is that, like any form of benchmarking, the physicians must be as similar in practice characteristics as possible (eg, same specialty, same practice setting). In this paper we compared family medicine practitioners ordering tests on community patients. Ideally it would be best to also exclude family physicians with restricted or specialized practices, but we were unable to do that in this preliminary analysis. The second important point to consider is that we have not defined what the most appropriate MARR would be in a given setting. For example, it is reasonable to assume that lower MARRs represent less selective test ordering, with a concomitant increase in false-positive results, but it is also likely that very high MARRs may also represent overly selective test ordering with a failure to detect clinically significant positive results. So again, the best current use of this metric may be to provide information to practicing physicians as to their ordering practices relative to their peer group.
Future work should address further validation of this metric. This could include practice audits to identify optimal test ordering practices and derive optimal MARRs based on the subsequent results. Analysis is also needed to determine if one or a small number of specific tests could easily be targeted for analysis of the MARR. In this manuscript, we chose tests based on commonly ordered tests with a reference range determined by a central 95% interval of the healthy population. However different definitions could be used, especially by incorporating tests known to be overused. For example, tests identified by Choosing Wisely such as vitamin D could be included. The exact mix of tests used is therefore somewhat arbitrary. As the MARR is intended to be used as a benchmark to compare ordering practices among physicians, the important point is that all physicians be compared using the same list of tests.
Mention should also be made here of the handling of multiple tests contained within test panels. Again, it must be stated that the MARR is intended to be used to benchmark physicians or larger entities such as group practices or communities. Within a given laboratory, certain test groups may be available as panels. However the panel compositions will vary among laboratories. If we consider just the several hundred most commonly ordered tests, the number of possible combinations of these into panels is essentially infinite. Even within common panels such as “liver function,” the specific tests included will vary among laboratories. We therefore felt that the most reasonable way to address panel tests was to consider them as individual tests for the purpose of calculating MARR.
Despite these shortcomings, the proposed metric represents the first opportunity to directly assess the relative selectivity of laboratory testing without the need to resort to chart audits to determine the pretest probability of a positive result. Further work on this metric is planned to determine if audit and feedback will alter test-ordering practices.
Acknowledgments
We wish to thank Dr Leland Baskin for comments on an earlier version of this manuscript. This work was supported by CIHR Foundation Scheme funding to Dr Naugler.
References