“The art of the practice of medicine is to be learned only by experience; ‘tis not an inheritance; it cannot be revealed. Learn to see, learn to hear, learn to feel, learn to smell, and know that by practice alone can you become expert.”
Sir William Osler (1)
The accuracy of screening mammography relies on a human interpreter. It seems intuitive that high levels of accuracy would occur for radiologists who routinely interpret high volumes, have many years of experience, and who audit the results of women undergoing breast biopsies at their recommendation. This statement, although based on common sense, is not supported by the analyses presented by Beam et al. (2) in this issue of the Journal.
Beam et al. studied the association between self-reported annual interpretation volume and radiologist accuracy in screening mammography. The authors asked 110 U.S. radiologists to interpret a test set of 148 mammograms (64 of the women had breast cancer) in a controlled and timed environment. “Recent reading volume” and other radiologist-specific information such as “years of practice” were collected via surveys. Surprisingly, Beam et al. found no statistically significant association between volume and accuracy.
Just 1 year ago, Esserman et al. (3) reported that both sensitivity and specificity are better among high-volume readers (>5000 mammograms per year) than among low-volume readers. In the accompanying editorial “Does Practice Make Perfect When Interpreting Mammography?” (4), we commented that practice probably does improve accuracy, but it may not make us perfect. Efforts to optimize the human interpretation of screening mammography examinations should consider the entire context of clinical medicine, including levels of ambiguity in clinical decision making, fear of medical malpractice, financial rewards, and characteristics of the population screened and of overall health care systems.
The study by Beam et al. and similar studies identify important issues that warrant further discussion, including how to define and measure the experience of a radiologist, whether performance using a test set of films reflects what occurs in clinical practice, and how to appropriately model the relationships between accuracy and volume.
What Defines an Experienced Mammographer?
How should we define the ‘experience’ of physicians who interpret mammograms? Should the self-reported “recent volume” or number of years of experience reading mammograms be used, as was done by Beam et al.? Should the lifetime number of mammograms interpreted be used? What about continuing medical education hours obtained or the occurrence of auditing and evaluation of one’s own performance? All these factors likely play a role, because experience is a multidimensional factor not adequately described by a single measure, such as annual volume. For example, a radiologist might have read more than 5000 films in the “recent” year but may have only 1 year of practice and might not perform auditing. How does this radiologist’s experience compare with that of a radiologist who has 20 years of experience as a specialist in mammography reading more than 5000 films for all years except the “recent” year when she/he changed from full-time to part-time practice and only read 500 films? These are complex questions for which there may be no perfect answer.
Do Artificial Test Environments Reflect the Real World?
Test sets often contain a higher prevalence of disease than is seen in usual practice, which can bias the assessment of accuracy (often referred to as context bias) (5). As Beam et al. indicated, two to six cases of breast cancer are typically detected per 1000 mammograms in a screened population, which is quite different from the testing set studied by Beam et al., in which breast cancer was present in 64 of the 148 films (rates of <1% versus 43%, respectively). Regardless of how the films were selected for this study (randomly identified from a population-based sample stratified by disease state and age), the approach does not represent a population-based study, and thus the results cannot be translated to mammography as it is practiced in the United States (or elsewhere). Study radiologists were instructed that the test set did not have the mixture of mammograms expected from a typical screening population. Beam et al. state that “this instruction adequately controls for context bias” and refer to unpublished data. Our review of this data does not calm our concerns about context bias.
Participating radiologists knew of the higher prevalence of cancers in the test set, so they may have called more films positive, leading to more correct diagnoses (i.e., a higher sensitivity or cancer detection rate) than would occur in the community. This possibility is suggested by their reported sensitivity of 91%, which is much higher than recently published data from community sites across the United States (75%) (6). In a test situation, there may also be less concern about false-positive readings, because no women will undergo unnecessary diagnostic evaluations and biopsy examinations. Calling more films positive would lead to a lower specificity, which was also noted in the study by Beam et al. (specificity of 70% versus 92% in community). Although some differences in accuracy between the two settings are expected because of differences in patient populations (i.e., age or breast density) and analytic definitions, the large differences observed here suggest that context bias may be a problem.
Perhaps the best way to test for context bias and clinical applicability of test set findings would be to compare accuracy indices identified in the test situation with accuracy indices in actual practice settings. This approach has been previously taken in the field of mammography and, interestingly, no association was found between accuracy in the artificial test environment and accuracy in actual practice (7).
Complex Statistical Models—Should We Believe Them?
Analysis of the data collected by Beam et al. is methodologically challenging because radiologists read the same films; thus, the correlation within both women (films) and radiologists must be taken into account for proper inference. The authors used a creative approach of reducing the repeated measures data within radiologists into a single observation, the area under the receiver operating characteristic curve, and used this value as the outcome in a multiple linear regression model. Bootstrapping was used to estimate confidence intervals. The presence of 95% confidence intervals that do not include the estimate shows the complexity of the models and data and raises some concerns about the methods.
Although it is important to adjust for potential confounders, Beam et al. may have overfit their models by including 20 variables (8,9). In addition, multicollinearity may be a problem because many of the variables are likely to be highly correlated (e.g., years reading mammograms and years since residency). These factors could diminish the statistical power needed to address study questions.
Additional analyses of the relationship between accuracy and volume, the focus of this study, would have provided stronger support for the main study finding. For example, classifying volume into a small number of meaningful categories and reporting the sensitivity and specificity by volume category would have provided a simple way to test for associations. Using a categorical approach to further assess volume may have also revealed an undetected association with accuracy, because the relationship between accuracy and volume may not be linear, and a few outlier physicians who reported interpreting more than 5000 mammograms per year may have influenced the results. It is possible that there is a clinically relevant threshold effect of volume (i.e., a minimum volume may be necessary to achieve high levels of accuracy with no additional improvement possible after a certain volume).
What Should Women Do?
What can women do to ensure that they obtain the most accurate mammograms? If possible, women undergoing screening mammography should go to the same facility and/or ensure that prior films are available for comparison, because this can reduce unnecessary recalls (10,11). For a menstruating woman, timing her mammographic examination during the follicular phase of her menstrual cycle (first and second week) may improve accuracy (12,13). Women taking hormone replacement therapy (HRT) should be aware of possible increases in breast density that could reduce mammographic accuracy and result in the need for additional imaging or breast ultrasound (6,14). These women might want to reconsider the possible benefits of HRT in light of this and other possible risks (15). Women should understand that compression of the breast is important (even though this can be uncomfortable and sometimes painful) and that holding still during the examination will reduce motion artifact.
What Should Radiologists Do?
Currently, the annual volume of mammographic interpretation required in the United States is 480 mammograms per year. This number is vastly different from requirements in other countries, such as the U.K., which requires more than 5000 interpretations per year (16). Although the findings of Beam et al. certainly do not support the need to increase the volume requirement in the United States, it is too early to take comfort in our current volume interpretation standard.
In light of these complex issues and conflicting results, additional studies are warranted. We still suspect that reading high volumes of films annually in conjunction with auditing and continuing education programs is the best approach to obtain and maintain radiologist expertise and thereby increase the accuracy of mammography.