fMRI promises to uncover the functional structure of the brain. I argue, however, that pictures of ‘brain activity' associated with fMRI experiments are poor evidence for functional claims. These neuroimages present the results of null hypothesis significance tests performed on fMRI data. Significance tests alone cannot provide evidence about the functional structure of causally dense systems, including the brain. Instead, neuroimages should be seen as indicating regions where further data analysis is warranted. This additional analysis rarely involves simple significance testing, and so justified skepticism about neuroimages does not provide reason for skepticism about fMRI more generally.
Neuroimages Are Statistical Maps
The Skeptical Argument
Evidence and neuroimages
The problem of causal density
The problem of arbitrary thresholds
The problem of vague alternatives
Skepticism Is Due to NHST
Neuroimages versus Neuroimaging
Neuroimages—colorized pictures of ‘brain activity'—are the most well known products of fMRI experiments.1 They are often taken to be evidence for functional hypotheses: that is, evidence that a given brain region plays a particular causal role during the performance of a cognitive task.2
I will argue that neither neuroimages nor what they depict provides evidence for functional hypotheses. Further, I will argue that skepticism about neuroimages can be grounded in well-known problems with the use of null hypothesis significance testing (NHST). The problems with neuroimages are thus conceptual, rather than merely practical, and cannot be easily avoided. In this sense, I am adding to a long-established skeptical tradition in the philosophical literature on neuroimaging.3
Yet this does not mean that we should be skeptical about neuroimaging—that is, about fMRI and the associated techniques. The overwhelming majority of contemporary fMRI experiments present more evidence than is presented in neuroimages. This evidence rarely consists of simple NHSTs. As such, this further evidence is not touched by skepticism about neuroimages. In most cases, fMRI provides precisely the sort of evidence that opponents of NHSTs would urge us to seek.
I conclude that we should view neuroimages as auxiliaries to evidence, rather than evidence proper. Neuroimages indicate brain regions in which further analysis may provide warranted and fruitful evidence for functional hypotheses. It is this further analysis that provides the evidence, rather than the neuroimages themselves. Neuroimaging may thus remain a fruitful technique even if the status often attributed to neuroimages is unjustified.
Neuroimages Are Statistical Maps
Differences in brain activity when subjects perform different cognitive tasks might be thought, all things being equal, to provide evidence for the functional role of brain areas. It is this insight that drives much contemporary neuroimaging, and many take neuroimages to provide evidence for functionally relevant brain activity.
fMRI works by tracking the changes in blood oxygenation that occur after increased local brain activity. These changes in local oxygenation can be detected by a properly sequenced MRI scanner, and provide an indirect measure of increased neural activity.4 Changes in this blood-oxygen-level dependent (BOLD) MR signal are the primary data produced by fMRI.
That fMRI is an indirect measure is in itself unremarkable, and should not engender skepticism. Neuroimages are not simple pictures of BOLD signal differences, however. Quantitative signal magnitudes are effectively uninterpretable on their own, as there is no general mapping from BOLD signal to functional significance of neural activity.5 Further, the BOLD differences associated with brain activity are small, noisy, and temporally complex. In lieu of quantitative information, neuroimages instead show maps of regions where there was a statistically significant difference in BOLD signal between task conditions.
To produce such maps, the BOLD signal in each subregion is subjected to a NHST between conditions of interest. NHSTs consist of two steps. First, one computes the likelihood that one would observe a given set of data if the null hypothesis were true. The null hypothesis is the proposition that an experimental condition had no real effect on the observed MR signal, and so that the neural activity in a region remained unchanged while the subject performed different cognitive tasks. This value, representing the likelihood of observing data conditional on the null hypothesis, is referred to as the p-value. In the second step, one compares the p-value to a predetermined significance level . If the p-value is lower than , the data is statistically significant. A significant result is one that would be unlikely to be observed if the null hypothesis were true—with an of , for example, one could expect to see significant data in only about 1 of 100 observations in regions where the null hypothesis was true.
Neuroimages are produced by performing NHSTs in each three-dimensional subregion (or voxel) of the data.6 The results are plotted as statistical parametric maps (SPMs). SPMs show those voxels in which the p-value for that region was significant, i.e., less than or equal to the significance level . Typically, a range of colors is used to represent the magnitude of the p-value calculated for a region, with brighter colors indicating lower p-values. SPMs thus summarize the results of thousands of simultaneous significance tests, showing the areas in which our data permit us to reject the null hypothesis of no difference in activation between conditions.
Neuroimages are SPMs overlaid on anatomical images of subjects' brains. Neuroimages are not maps of activation per se, but rather maps of places where we may be confident that the resemblance between data and a stereotyped pattern of activation is unlikely to be the result of chance fluctuations from a true zero signal. So, neuroimages do not show differential activity. They show places where (ceteris paribus) the data warrant confident assertion of a pattern of differential activity.
Many people, especially nonspecialists, take neuroimages to be especially good evidence for functional claims (Dumit ; McCabe and Castel ). Working scientists are typically more cautious. Nevertheless, I argue that they usually take neuroimages (and what they represent) to be at least weak evidence for functional claims (see Mole and Klein [forthcoming] for a defense and discussion of this claim). I argue that this is mistaken: neuroimages do not provide even weak support for functional hypotheses.
The Skeptical Argument
Evidence and neuroimages
fMRI evidence results from a chancy sampling of the world, and requires a probabilistic analysis. I will assume that an updating of odds on a functional hypothesis Ha relative to a null hypothesis of functional unimportance H0 given some evidence D is rational just in casegives a measure of the degree to which D supports Ha over H0. A likelihood ratio greater than 1 indicates confirmation, while a ratio less than 1 indicates infirmation. Whether neuroimages are appropriate for confirming functional hypothesis thus requires consideration of three factors: the nature of the evidence D, and facts about the conditional probabilities and . We will rarely be able to put precise numbers on the latter probabilities, but we can say useful things about the rough relationship between them.
Skepticism about neuroimages amounts to the proposition that the likelihood ratio of a functional hypothesis to its null is always very low when we treat an SPM as data. The precise form of this skepticism depends on which of the three ways of construing D we choose. First, D could be the fact that there was increased activity: that is, the fact that there was more brain activity in one condition than in another. Second, D could be the fact that there was a statistically significant difference in activity: not just difference, but difference that was statistically detectable. Third, D could be associated with the actual time-course of some statistically significant data: not just the fact of significance, in other words, but the fact that the difference was significant and took thus-and-such shape. Each of the three ways of reading D makes neuroimages problematic as evidence.
The problem of causal density
Suppose D is the fact that there was task-related differential brain activity. The problem: there is decent reason to believe that any task will have widespread effects on the brain. These effects will be small and functionally insignificant—but nevertheless, they will be present. Which means that both and are high in each area of the brain, and the likelihood ratio is close to 1. Given this, D is uninformative.
fMRI is relatively insensitive: everyone agrees that there are real differences in brain activity that get lost in noise. But suppose we were able to make our fMRI experiments arbitrarily sensitive, so that even small differences in brain activity became detectable: There is a good argument that, were we able to do so, we should expect to find differential activity across the entire brain for any task. This is because brains are causally dense systems: systems in which there is a causal path between changes in any explanatory variable and most other variables. As Savoy notes, the brain is a densely interconnected system, one in which
…there are only about five synapses between any two neurones in the brain. It is reasonably likely that the activity in any one neurone (or collection of neurones, given the spatial resolution of our non-invasive imaging techniques) is going to influence almost any other neurone, albeit weakly. (Savoy , p. 30)
This is not to say, of course, that these widespread differences in activity will be functionally important. The point is merely that they are likely to be there.7 But if differences are likely to be widespread, then the observation of difference is uninformative.
This is not an abstract worry. There is good evidence that fMRI experiments that look more carefully find more activity. Studies looking at the effect of increased sensitivity confirm that improving the signal-to-noise ratio of fMRI dramatically increases the number and extent of activated regions at the same level. This is apparent in studies that increase the number of subjects (Savoy , p. 30; Thirion et al. ), the number of trials within a study (Huettel and McCarthy ), and the field strength of the main magnet (Huettel et al. , p. 237).
Put another way: if is high, then the subthreshold activation simply indicates a failure of our instrument to detect a signal.8 The fact that an imaging experiment now differentiates between activated areas thus seems like a fluke of instrumentation. In this case, as Hardcastle and Stewart complain, ‘brain imaging seems to support localist assumptions because we aren't very good at it yet.’ (Hardcastle and Stewart , p. S78)
The problem of arbitrary thresholds
Suppose D is the fact that there was a statistically significant difference in the data. Claims of statistical significance are always relative to the choice of . But, the skeptic argues, there is no rationally justifiable choice for . The argument again relies on the causal density of the brain. The theoretical justification for choosing an level depends on the desirability of reducing false positives. The actual rate of false positives is the product of and of the base rate of true null hypotheses.9 If everything in the brain is weakly connected to everything else, then every task should be expected to result in some difference in neural activation.10 This means that the null hypothesis H0 is always strictly false. But if there are no true nulls, then it is trivially impossible to have a false positive, no matter which you choose. So if brains are causally dense, then any will do, and the choice of one is arbitrary.11
Huettel et al. further note that the test statistics for individual voxels often change in a graded manner as one moves from region to region (Huettel et al. , p. 246). Thresholding at any inevitably creates artificially sharp barriers between ‘active' and ‘inactive' regions. This is why variation in can result in such dramatic differences in extent of activation: any choice of makes a sharp distinction among what is a typically continuous variation in the underlying p-values. This means that different values result in maps with dramatically different extents of activation. A conservative threshold shows very small activated areas, and a liberal threshold much larger ones.
Complaints about arbitrary thresholding are common in critiques of functional imaging (Hardcastle and Stewart ; Uttal , pp. 167–9; Roskies , p. 870). Uttal, for example, complains about thresholds that ‘a conservative assignment could hide localized activity and a reckless one suggest unique localizations that are entirely artifactual' (Uttal , p. 168).12 If choice of threshold is really arbitrary, then and will be similarly arbitrary. This means that there is never any rationally compelling way to fix the likelihood ratio and so no way to settle disputes about how strongly the data confirm a hypothesis.
Threshold choice can have theoretically important consequences. Savoy provides a graphical illustration of the point with data collected from subjects looking at flickering checkerboards (Savoy , p. 28). Different thresholds generate images that show different patterns of activation. With a relatively high threshold, the map appears to indicate focal activity in V5/MT, a visual area associated with motion processing. At lower thresholds, all early visual areas (along with other regions of extrastriate cortex) show supra-threshold levels of significant activity. The debate between distributed and modular models of face recognition in part hinges on what to do with small activations in regions outside of fusiform face area (FFA). As Haxby et al. () note, there are subthreshold activations outside of FFA that nevertheless contain enough information to recover whether a subject was looking at a face or a house. So it is consistent with the data that even small subthreshold activations might play a functionally important role in facial recognition.
The problem of vague alternatives
Suppose D is the actual difference in observed BOLD signal in a voxel. This is perhaps the most promising interpretation. Assume for a moment that functionally important areas will show a hemodynamic response, and that the data from some area do show such a canonical response. Then, one might argue, is well defined and reasonably high: it will be equal to the statistical power of the experiment. The likelihood will equal the p-value computed for the voxel. In activated regions, that will be orders of magnitude lower than . One may therefore conclude that the likelihood ratio is high, and that Ha is strongly confirmed by the data.
This reasoning is mistaken, though. The problem lies in the move from a p-value to a low : there has been a tacit slide between two different, nonequivalent null hypotheses. The p-value at a voxel is the probability of seeing data like D if there was no BOLD response at all. The likelihood , on the other hand, is the probability of seeing D if the relevant voxel is functionally unimportant. The two will be equivalent only if alternative theories predict that functionally unimportant voxels show no differential BOLD response at all. That is, it must be the case that D would appear not just when Ha is correct, but only when Ha is correct. As Cacioppo et al. note, we rarely have evidence for the second half of that claim (Cacioppo et al. ).
Consider, for example, the oft-cited work of Greene et al., which showed significant differences in activation in the angular gyrus when subjects were presented with emotionally laden moral dilemmas rather than impersonal ones (Greene et al. ). Theories that attribute no role to emotion in moral decision-making need not be particularly threatened by these data. It is perfectly consistent with the alternative hypotheses that claim that the angular gyrus activation was part of a functionally unimportant reaction to the content of the moral dilemma. So, let D be the observed BOLD response in the angular gyrus, Ha be the hypothesis that the angular gyrus plays a functionally important role in moral reasoning, and H0 the hypothesis that it plays no functionally important role. The likelihood is high, of course: one would expect to see increased activity from a functionally important area. But is also reasonably high: one would expect to see activation in the angular gyrus in response to emotionally laden scenarios, regardless of the functional role of the angular gyrus. Thus, the likelihood ratio is relatively low, and D does not provide especially compelling evidence. To generalize the argument, the poor temporal resolution of fMRI means that the evidence that a region is activated by a task can almost always instead be taken as an evidence that the region is activated as a functionally unimportant byproduct of task performance. As such, will always be relatively high, and the likelihood ratio relatively low.
The problem of vague alternatives can manifest itself in various ways. Functional hypotheses rarely commit themselves to claims about activity in other, functionally unimportant areas. This means that will be undefined in particular cases, and the strength of evidence impossible to determine (Mole and Klein [forthcoming]). This means that most experiments do not subject hypotheses to severe tests (Aktunç [unpublished]), and do not establish tight structure–function mappings (Cacioppo et al. ). If alternative functional hypotheses are simply indifferent about the response of functionally unimportant areas, then may well be reasonably high, and at worst is undefined. So long as that is the case, even clean data need not be especially compelling.
Skepticism Is Due to NHST
The three forms of skepticism about neuroimages share a common root. SPMs present, at root, the results of numerous simultaneous NHSTs. That may seem like the least problematic fact about them. NHST is widely used in contemporary psychology. Those who attack the use of functional imaging typically do so in order to defend some other way of doing psychology, not because they are skeptical about psychology itself.13 Yet NHST is theoretically controversial.14 The controversy over NHST is unnecessarily polarized: I will assume that there are clear cases where NHST provides evidence, and clear cases where it does not. When we delineate the conditions in which NHST does not give good evidence, those conditions tend to obtain in cases where SPMs are used to test functional hypotheses.
First, NHST is uninformative for testing hypotheses about systems (or parts of systems) in which null hypotheses are usually false. This is a common complaint about the use of NHST in experimental psychology. Meehl, for example, notes that any null hypothesis of no effect must be false in dense systems, as there will always be minuscule but significant correlations between any two variables of interest (Meehl , p. 108). Lykken similarly notes that, ‘In psychology, everything is likely to be related at least a little bit to everything else, for complex and uninteresting reasons' (Lykken , p. 31). Null hypotheses are rarely true in causally dense systems, making significance tests uninformative.15 Causal density of the studied systems is a common feature of other disciplines in which NHST has been controversial.16
For the same reason, thresholding p-values in causally dense systems is also problematic. Causally dense systems typically show continuous variation in p-values, depending on the resolution of the test; picking a point at which a p-value changes from evidence against to evidence for a hypothesis is theoretically arbitrary. As Abelson put it, ‘We act foolishly when we celebrate results with but wallow in self-pity when ' (Abelson , p. 12). In causally dense systems, again, no can be compelling because false positives are extremely rare.
Finally, simple significance tests are most plausibly evidence for or against ordinal hypotheses: hypotheses that state that one parameter is larger than another (Frick , p. 379). The evidence that univariate significance tests give is about the direction—positive, negative, or indeterminate—of an effect; they do not, on their own, provide information about the size of an effect or about the form of a relationship between variables of interest (Tukey , p. 100). Functional hypotheses are not ordinal hypotheses, and building a functional theory requires more than information about the direction of differential performance (Newell , p. 290). To say that some brain area A contributes to the performance of E is not merely to say that it does something when E is performed. Instead, it is a claim about whatA does—namely, that it does something particular that contributes to the performance of E. It is perfectly possible for A to do something that makes a tiny but necessary contribution to E, or for A to be extremely active but have no effect on E.
These problems with NHSTs are, of course, just the problems outlined in Section 3. Skepticism about neuroimages is therefore simply a specific instance of a more general skepticism about NHST. NHST provides poor evidence for functional hypotheses about causally dense systems, in the brain or elsewhere.
Neuroimages versus Neuroimaging
Neuroimages are only one product of fMRI. Contemporary fMRI experiments produce more data than is summarized in neuroimages, and the data can be analyzed in a variety of ways. Can skepticism about neuroimages be generalized to skepticism about neuroimaging more generally?
I think not. First, most modern imaging experiments also present more sophisticated statistical analyses of fMRI data (Sarty ). These more sophisticated analyses do not rely on the simplistic logic employed by NHSTs, and so are mostly immune from the critiques above. Second, information about neural anatomy can help constrain the implications of functional hypotheses in ways that permit more detailed testing. Experimenters likely do this in an informal way already when they choose regions of interest or interpret their data.17 Information about neural connectivity can be formally integrated into experimental analysis via structural equation modeling and related techniques, and there is some reason to believe that the resulting evidence for functional hypotheses is more sensitive and compelling than that given by SPMs. Third, converging evidence from single-cell recordings and computational modeling can be used to give functional interpretations of quantitative measures of signal change (Logothetis ; Bartels et al. ).
The use of convergent information from modeling and anatomy is worthy of special note. Those who attack NHST typically argue that quantitative information is required to establish functional claims in causally dense systems. Quantitative information about the BOLD signal allows for more sophisticated hypotheses about the functional relationship between distinct brain regions, and is more likely to provide compelling evidence for functional hypotheses. In this regard, a comparison to structural MRI is telling. The production of structural MRIs is technically similar to the production of neuroimages;18 the main difference is that structural MRIs show quantitative facts about the MR signal rather than the results of NHSTs. The images produced by structural MRI are routinely used to settle diagnostic disputes, and do not attract the general skepticism that attaches to fMRI.19 Neuroimages are problematic because significance tests are not an adequate substitute for interpretable quantitative information.
Neuroimages are not worthless, however. The literature on NHST also allows us to offer an alternative interpretation of the role of significance tests, and so of neuroimages. On this alternative view, significance testing provides a first-pass sanity check on experimental data.20 Finding statistical significance is never enough to confirm a hypothesis, but it does provide warrant for taking the data seriously and performing further analysis upon it. Similarly, neuroimages do not confirm functional hypothesis, but they do show brain areas in which the imaging data might be further used to confirm functional hypotheses. It requires more and different evidence to confirm functional hypotheses, but that evidence is something we can reasonably collect. This means that the production of neuroimages may be a necessary, though not sufficient, step in confirming functional hypotheses.
The evidence of neuroimaging is thus not in the images it produces. Nor do further data turn those images into evidence. Instead, neuroimages point us to where the evidence for functional hypotheses might be. Pictures of ‘brain activity' are essentially uninterpretable without further analysis. Skepticism about neuroimages, however, does not provide grounds for general skepticism about the results of these further analyses.
Thanks to David Hilbert, Esther Klein, Chris Mole, Adina Roskies, Don Ross, audiences at University of Illinois at Chicago, and two anonymous reviewers for helpful discussion and comment.