Abstract

To identify and categorize complex stimuli such as familiar objects or speech, the human brain integrates information that is abstracted at multiple levels from its sensory inputs. Using cross-modal priming for spoken words and sounds, this functional magnetic resonance imaging study identified 3 distinct classes of visuoauditory incongruency effects: visuoauditory incongruency effects were selective for 1) spoken words in the left superior temporal sulcus (STS), 2) environmental sounds in the left angular gyrus (AG), and 3) both words and sounds in the lateral and medial prefrontal cortices (IFS/mPFC). From a cognitive perspective, these incongruency effects suggest that prior visual information influences the neural processes underlying speech and sound recognition at multiple levels, with the STS being involved in phonological, AG in semantic, and mPFC/IFS in higher conceptual processing. In terms of neural mechanisms, effective connectivity analyses (dynamic causal modeling) suggest that these incongruency effects may emerge via greater bottom-up effects from early auditory regions to intermediate multisensory integration areas (i.e., STS and AG). This is consistent with a predictive coding perspective on hierarchical Bayesian inference in the cortex where the domain of the prediction error (phonological vs. semantic) determines its regional expression (middle temporal gyrus/STS vs. AG/intraparietal sulcus).

Introduction

To form a coherent and unified percept, the human brain combines information from multiple senses (Stein and Meredith 1993). At the behavioral level, multisensory integration of congruent information facilitates detection, identification, and categorization of objects or novel events in our environment. Electrophysiological and functional magnetic resonance imaging (fMRI) studies in human and nonhuman primates have started investigating where, when, and how the human brain integrates different types of sensory information at multiple levels of the cortical hierarchy. Multisensory convergence effects have been found in a distributed subcortical and cortical neural system encompassing presumptive unimodal (or early) sensory areas (Calvert et al. 1999; Foxe et al. 2000; Molholm et al. 2002; Schroeder and Foxe 2002; Fu et al. 2003; Kayser et al. 2005) and higher order association areas such as the superior temporal sulcus and intraparietal sulcus (IPS), the anterior cingulate (AC), and the prefrontal cortex (for review, see Calvert and Lewis 2004; Amedi et al. 2005; Schroeder and Foxe 2005; Ghazanfar and Schroeder 2006). It has been proposed that these various integration sites may support the integration of different stimulus features or parameters that are abstracted at multiple levels from the sensory inputs. In particular, recognition of complex audiovisual stimuli such as familiar objects (Gottfried and Dolan 2003; Laurienti et al. 2003, 2004; Molholm et al. 2004; Beauchamp, Argall, et al. 2004; Beauchamp, Lee, et al. 2004), actions (Barraclough et al. 2005), or speech (Calvert et al. 2000; Raij et al. 2000; Olson et al. 2002; Wright et al. 2003; Callan et al. 2004; Macaluso et al. 2004; van Atteveldt et al. 2004; Ghazanfar et al. 2005; Saito et al. 2005) may involve audiovisual integration at multiple processing stages ranging from early sensory to phonological, semantic, and higher conceptual (or decisional) processes.

Multiple different experimental paradigms and analyses have been used to characterize audiovisual interactions. Classically, multisensory integration areas have been identified by superimposition of auditory and visual activations (e.g., using implicit masking or conjunction analyses, Friston et al. 2005), audiovisual interaction effects, and congruency manipulations (Calvert 2001; Calvert et al. 2001). Complementary insights into the variety of audiovisual interactions have been obtained from visuoauditory matching (Taylor et al. 2006), recognition (Nyberg et al. 2000; Gottfried et al. 2004; Lehmann and Murray 2005; Murray et al. 2005), association learning (Gibson and Maunsell 1997; Fuster et al. 2000; Gonzalo and Büchel 2003; Tanabe et al. 2005), and priming (Badgaiyan et al. 1999) paradigms: Despite a degree of convergence in the results, these diverse paradigms are likely to highlight distinct aspects of multisensory processes: Thus, visuoauditory matching tasks require explicit access to unimodal percepts, multisensory interactions involve the integration of sensory features into a unified percept, and recognition paradigms invoke additional memory components.

In the current study, we employed immediate visuoauditory priming (for review, see Henson 2003; Henson and Rugg 2003; Grill-Spector et al. 2006) to investigate the effect of prior visual information on categorization of complex stimuli such as environmental sounds and spoken words in terms of behavioral interference/facilitation and associated activation changes.

Categorization of spoken words and environmental sounds (=source sounds, e.g., cat's meowing) both engage phonological, semantic, and higher conceptual processes. However, they do so to different degrees: recognition and categorization of spoken words or speech (i.e., verbal stimuli) relies primarily on the interaction between perceptual and phonological processes, that is, processing of speech sounds (Potter and Faulconer 1975; Plaut et al. 1996; Binder et al. 2000, 2004). By contrast, recognition and categorization of environmental sounds (i.e., nonverbal stimuli) is accomplished primarily through interaction of perceptual and semantic processes (Humphreys and Forde 2001; Lewis et al. 2004; Rogers et al. 2004; Ikeda et al. 2006). However, this is a continuous rather than categorical distinction. For instance, auditory word recognition may also activate semantic representations related to the meaning of the word. Conversely, sound object recognition may involve implicit name retrieval. Furthermore, categorization of sounds or words will involve higher level conceptual or decisional processes that do not depend on the particular stimulus format but are elicited irrespective of stimulus material (verbal, nonverbal) or modality (auditory, visual, etc.).

Incongruent prior visual information will interfere with and thus place more demands on the processes involved in sound and speech recognition. Hence, categorization of auditory stimuli that are preceded by incongruent visual stimuli may be associated primarily with phonological incongruency for spoken words (e.g., the spoken word cat) and semantic incongruency for sounds (e.g., the meowing sound of a cat). Both, incongruent spoken words and sounds may elicit higher level conceptual (or decisional) incongruency effects.

Combining visuoauditory priming for environmental sounds and spoken words may thus enable us to dissociate visuoauditory incongruency effects that may emerge at the phonological, semantic, and higher conceptual level. At the neuronal level, these incongruency effects are thought to be associated with activation increases for incongruent trials—possibly reflecting a prediction error signal (Rao and Ballard 1999; Friston and Price 2001)—in regions sustaining phonological, semantic, or higher conceptual processes.

These differential contributions of phonology, semantics, and conceptual (or decision) elements to categorization of sounds and spoken words provide the rationale for our visuoauditory priming paradigm: subjects were presented with a brief (100 ms) visual prime (i.e., a picture or a written word) that was followed by a congruent or incongruent auditory target (i.e., a sound or a spoken word) after an additional 100 ms. Both, visual primes and auditory targets could either be verbal (i.e., written words and spoken words) or nonverbal (i.e., sounds and pictures). Subjects passively attended the visual prime and categorized the auditory targets, that is, the spoken words and sounds according to their weight (heavier than 4 kg?).

Using this fully balanced multifactorial design (see Fig. 1), we first identified regions that were influenced by visuoauditory (in)congruency. Within these regions, we investigated whether the (in)congruency effects depended on the target material and were different for spoken words and sounds. This allowed us to segregate incongruency effects into 3 classes: visuoauditory incongruency effects that were 1) selective for spoken words, 2) selective for sounds, or 3) common to spoken words and sounds. Following our initial rationale, we related these 3 types of visuoauditory incongruency effects to multisensory interactions at the 1) phonological, 2) semantic, and 3) conceptual/decisional level.

Figure 1.

Study design and example stimuli. (A) 2 × 2 × 3 factorial design with the factors:

  1. Congruency: (C) congruent identity and response (e.g., cat paired with meow), (II) incongruent identity and congruent response (e.g., razor paired with bee, both items < 4 kg), (II+R) incongruent identity and incongruent response (e.g., car paired with owl, only one item < 4 kg),

  2. Prime material: written words, pictures, and

  3. Target material: spoken words, sounds.

(B) Example run and timing of 3 trials from the 3 levels of (in)congruency.

Figure 1.

Study design and example stimuli. (A) 2 × 2 × 3 factorial design with the factors:

  1. Congruency: (C) congruent identity and response (e.g., cat paired with meow), (II) incongruent identity and congruent response (e.g., razor paired with bee, both items < 4 kg), (II+R) incongruent identity and incongruent response (e.g., car paired with owl, only one item < 4 kg),

  2. Prime material: written words, pictures, and

  3. Target material: spoken words, sounds.

(B) Example run and timing of 3 trials from the 3 levels of (in)congruency.

Using dynamic causal modeling (DCM; Friston et al. 2003) with Bayesian model selection (Penny et al. 2003), we then investigated the neural mechanisms underlying these visuoauditory incongruency effects. In particular, we asked whether the incongruency effects can be better understood as a bottom-up error signal or as top-down effects from a general “cognitive control device.” Hence, we compared 2 alternative models that implemented these 2 competing neural mechanisms in a 3-level cortical hierarchy: In the first bottom-up model, the incongruency effects emerge in a material-dependent fashion (i.e., selective for sounds or spoken words) via changes in forward connections from early auditory to intermediate multisensory areas. This model embodies the idea of predictive coding, whereby the human brain learns to predict stimulus attributes on successive exposures to congruent stimuli (=priming) and fails to suppress prediction error in the context of unpredictable or incongruent bottom-up visuoauditory input which is manifest in an increase in forward connectivity. In the second top-down model, the incongruency effects emerge irrespective of stimulus material through interactions among higher cognitive control regions and propogate down the cortical hierarchy to lower areas. Here, higher cognitive control regions such as the AC/medial prefrontal cortex (mPFC) and the lateral prefrontal cortex (IFS) may act as a general “conflict monitoring and cognitive control device” (Duncan and Owen 2000; Botvinick et al. 2001; Paus 2001; Kerns et al. 2004; Brown and Braver 2005) that modulates activation in intermediate multisensory convergence areas.

Mateirals and Methods

Subjects

Seventeen healthy right-handed English native speakers (5 females, median age 25) gave informed consent to participate in the study. The study was approved by the joint ethics committee of the Institute of Neurology and University College London Hospital, London, UK.

Experimental Design

The paradigm was a 2-choice forced semantic categorization of auditory stimuli that were preceded by visual stimuli. The activation conditions conformed to a 3 × 2 × 2 factorial design manipulating

  1. Congruency (3 levels): 1) congruent identity and response (=congruent), 2) incongruent identity and congruent response (=incongruencyI), 3) incongruent identity and incongruent response (=incongruencyI+R),

  2. Prime material (2 levels): written words, pictures (i.e., verbal vs. nonverbal), and

  3. Target material (2 levels): spoken words, sounds (i.e., verbal vs. nonverbal).

At the beginning of each trial, a visual prime (i.e., written words or color pictures) was presented for 100 ms followed by the auditory target (i.e., spoken words or sounds) after additional 100 ms. A very short prime–target asynchrony (200 ms) was selected because we were interested in automatic priming and aimed to reduce any strategic components (see Neely 1977). This rapid subsequent presentation was perceived as “nearly synchronous” by subjects. The trial onset asynchrony was 3.25 s. Subjects passively attended to the visual primes and performed a semantic decision on the auditory targets (Is the target stimulus heavier than 4 kg?). Fifty percent of the stimuli weighed more than 4 kg and 50% weighed less. Altogether, there were 64 stimuli: 32 animals and 32 tools (length, mean + standard deviation, of sounds: 0.8 ± 0.2 s; spoken words: 0.76 ± 0.2 s). These 2 distinct categories were selected to enable incongruent pairings between categories and thus induce strong and reliable incongruency effects. Therefore, category-selective activations that have been characterized by numerous previous studies (Chao et al. 1999; Lewis et al. 2004, 2005; Noppeney et al. 2006) are difficult to evaluate (i.e., half of the compound trials are mixtures of both categories) and not the focus of this communication.

Fifty percent of the trials were identity congruent, that is, prime and target referred to the same object (e.g., a picture of a dog followed by the barking sound of a dog). The remaining 50% of trials were identity incongruent (i.e., visual prime and auditory target referred to different stimuli). In half of the identity incongruent trials (i.e., 25% of the total trials), both, prime and target, weighed less than 4 kg or both weighed more than 4 kg (e.g., a picture of an elephant followed by the sound of a car). In the other half of the identity incongruent trials (i.e., 25% of the total trials), only one of the objects weighed more (or less) than 4 kg (e.g., a picture of a fly followed by the sound of a car). In summary, 50% of the trials were identity congruent, 25% identity incongruent and response congruent, and 25% identity incongruent and response incongruent. This allowed us to dissociate the effect of identity incongruency from response incongruency.

Each stimulus (e.g., bear, see Appendix) was presented 16 times, 8 times as prime (i.e., 4 times as picture and 4 times as written word) and 8 times as target (i.e., 4 times as sound and 4 times as spoken word), amounting to 512 cross-modal trials (i.e., 64 × 8 = 512 trials). In the congruent trials, each stimulus was presented once in each of the following pairings: 1) written word–spoken word, 2) written word–sound, 3) picture–spoken word, and 4) picture–sound. Similarly, in the incongruent trials, each stimulus was equally often presented in each modality pairing. However, here a target stimulus (e.g., bear) was presented with 4 different primes (see Appendix). Presenting the stimuli only once in each pairing and thus changing the surface features ensured that subjects did not engage in prime–target association learning. Furthermore, it ensured that the stimuli were rotated and fully counterbalanced across conditions within and between subjects.

Additionally, 48 intramodal visual trials (i.e., picture–picture, picture–written word, written word–picture, written word–written word) were included to maintain subjects' attention to the visual primes that were response irrelevant. Fifty percent of the trials required a yes response. Yes/no responses to all conditions were indicated (as quickly and as accurately as possible) by a 2-choice key press. The activation conditions were interleaved with 6 s fixation. The stimuli and order of conditions were randomized.

Functional Magnetic Resonance Imaging

A 3-T Siemens Allegra system was used to acquire both T1 anatomical volume images and T2*-weighted axial echo-planar images with blood oxygenation level–dependent contrast (GE-EPI, Cartesian k-space sampling, time echo = 30 ms, time repetition = 2.47 s, 38 axial slices, acquired sequentially in descending direction, matrix 64 × 64, spatial resolution 3 × 3 × 3.4 mm3 voxels, interslice gap 1.4 mm, slice thickness 2.0 mm). To minimize Nyquist ghost artifacts, a generalized reconstruction algorithm was used for data processing (Josephs et al. 2000). There were 2 sessions with a total of 473 volume images per session. The first 6 volumes were discarded to allow for T1 equilibration effects (Table 1).

Table 1

Behavioural data: accuracy and reaction times

  Target 
Prime C/I Sound Wordspoken 
(a) Accuracy (proportion correct)    
    Picture 0.93 (0.07) 0.95 (0.05) 
 II 0.87 (0.08) 0.91 (0.09) 
 II+R 0.83 (0.12) 0.92 (0.07) 
    Wordwritten 0.94 (0.06) 0.95 (0.06) 
 II 0.87 (0.11) 0.95 (0.05) 
 II+R 0.86 (0.11) 0.91 (0.08) 
(b) RT (ms)    
    Picture 719 (230) 681 (211) 
 II 943 (220) 928 (199) 
 II+R 954 (241) 903 (170) 
    Wordwritten 774 (238) 679 (189) 
 II 967 (220) 932 (188) 
 II+R 949 (196) 915 (184) 
  Target 
Prime C/I Sound Wordspoken 
(a) Accuracy (proportion correct)    
    Picture 0.93 (0.07) 0.95 (0.05) 
 II 0.87 (0.08) 0.91 (0.09) 
 II+R 0.83 (0.12) 0.92 (0.07) 
    Wordwritten 0.94 (0.06) 0.95 (0.06) 
 II 0.87 (0.11) 0.95 (0.05) 
 II+R 0.86 (0.11) 0.91 (0.08) 
(b) RT (ms)    
    Picture 719 (230) 681 (211) 
 II 943 (220) 928 (199) 
 II+R 954 (241) 903 (170) 
    Wordwritten 774 (238) 679 (189) 
 II 967 (220) 932 (188) 
 II+R 949 (196) 915 (184) 

Note: Values are across volunteers' means (standard deviation). C = congruent, II = incongruencyI, II+R = incongruencyI+R.

Conventional SPM Analysis

The data were analyzed with statistical parametric mapping (using SPM2 software from the Wellcome Department of Imaging Neuroscience, London; www.fil.ion.ucl.ac.uk/spm (Friston et al. 1995). Scans from each subject were realigned using the first as a reference, spatially normalized into Montreal Neurological Institute standard space (Talairach and Tournoux 1988; Evans et al. 1992), resampled to 3 × 3 × 3 mm3 voxels, and spatially smoothed with a Gaussian kernel of 8 mm full-width half-maxium. The time series in each voxel was high-pass filtered to 1/128 Hz and globally normalized with proportional scaling. The fMRI experiment was modeled in an event-related fashion using regressors obtained by convolving each event-related unit impulse with a canonical hemodynamic response function and its first temporal derivative. In addition to modeling the 12 conditions in our 2 × 2 × 3 factorial design (only correct trials included), the statistical model included intramodal trials, errors, and non-responses. Nuisance covariates included the realignment parameters (to account for residual motion artifacts). Condition-specific effects for each subject were estimated according to the general linear model and passed to a second-level analysis as contrasts. This involved creating contrast images averaged over all cross-modal conditions > fixation (averaged over the 2 sessions) for each subject and entering them into a second-level one-sample t-test. In addition, the response for each of the 12 conditions (summed over the 2 sessions) was estimated and entered into a second-level analysis of variance (ANOVA). This ANOVA modeled the 12 effects in our 2 × 2 × 3 factorial design.

Inferences were made at the second level to allow a random effects analysis and inferences at the population level (Friston et al. 1999).

The random effects ANOVA analysis tested for the effects of incongruency. Pooling over picture and written word primes, we tested for incongruency effects that were selective for 1) spoken words, 2) sounds (i.e., the interaction between congruent vs. incongruent and sounds vs. spoken words), or 3) common to sounds and spoken words.

Search Volume Constraints

The search space (i.e., volume of interest) was constrained using orthogonal contrasts: the search space for the main and simple main effects of (in)congruency was limited to voxels that were activated for cross-modal stimuli > fixation at a threshold of P < 0.01 uncorrected (extent threshold > 15 voxels). The search space for the interaction effects was limited to voxels that were activated for cross-modal stimuli > fixation at P < 0.01 uncorrected (extent threshold > 15 voxels) and exhibited a main effect of congruency (i.e., incongruent > congruent stimuli at P < 0.001, uncorrected; extent threshold > 15 voxels). To identify conceptual (or decisional) congruency effects that were common for sounds and spoken word targets, each effect was tested within a search volume mutually constrained by the other contrast (see Friston et al. 2005). This approach is equivalent to a (conjunction-null) conjunction analysis (i.e., a logical AND). Unless otherwise stated, we only report activations that are significant (P < 0.05) corrected for the search volume.

Effective Connectivity Analysis: DCM

DCM treats the brain as a dynamic input-state-output system. The inputs correspond to conventional stimulus functions encoding experimental manipulations. The state variables are neuronal activities, and the outputs are the regional hemodynamic responses measured with fMRI. The idea is to model changes in the states, which cannot be observed directly, using the known inputs and outputs. Critically, changes in the states of one region depend on the states (i.e., activity) of others. This dependency is parameterized by effective connectivity. There are 3 types of parameters in a DCM: 1) input parameters which describe how much brain regions respond to experimental stimuli, 2) intrinsic parameters that characterize effective connectivity among regions, and 3) modulatory parameters that characterize changes in effective connectivity caused by experimental manipulation. This third set of parameters, the modulatory effects, allows us to explain fMRI incongruency effects by changes in coupling among brain areas. Importantly, this coupling (effective connectivity) is expressed at the level of neuronal states. DCM employs a forward model, relating neuronal activity to fMRI data that can be inverted during the model fitting process. Put simply, the forward model is used to predict outputs using the inputs. The parameters are adjusted (using gradient descent) so that the predicted and observed outputs match. This adjustment corresponds to the model fitting.

For each subject, 2 DCMs (Friston et al. 2003) were constructed that entailed our 2 alternative hypotheses. In the first “bottom-up model,” the incongruency effects emerge in a material-dependent fashion (i.e., selective for sounds or spoken words) via changes in forward connections from early auditory to intermediate multisensory areas. In the second “top-down model,” they emerge irrespective of stimulus material through interactions among higher cognitive control regions and propagate down the cortical hierarchy to lower areas. Here, higher cognitive control regions such as the AC/mPFC and the lateral prefrontal cortex (IFS) may act as a general conflict monitoring and cognitive control device (Duncan and Owen 2000; Botvinick et al. 2001; Paus 2001; Kerns et al. 2004; Brown and Braver 2005) that modulates activation in intermediate multisensory convergence areas.

Each DCM (Fig. 5) included 6 regions that formed a 3-level cortical hierarchy: 1) a left superior temporal area that was activated by cross-modal stimuli relative to fixation (superior temporal gyrus [STG]; x = −63, y = −24, z = 9), 2) a left fusiform region that was activated by cross-modal stimuli relative to fixation (fusiform gyrus [FG]; x = −45, y = −60, z = −21), 3) a region in the left superior temporal sulcus (STS) exhibiting an incongruency effect that was selective for spoken words (STS/middle temporal gyrus [MTG]; x = −66, y = −27, z = −3), 4) a region in the left angular gyrus (AG)/IPS exhibiting an incongruency effect selective for sounds (AG/IPS; x = −30, y = −75, z = 42), 5) the AC/mPFC (x = 0, y = 18, z = 48), and 6) left inferior frontal sulcus (IFS; x = −42, y = 12, z = 24) showing non-selective incongruency effects. The effects of stimuli entered as extrinsic input to STG and FG separately for picture–sound, picture–word, word–sound, word–word stimuli to account for material-selective activation differences. Holding the number of parameters, the intrinsic and extrinsic connectivity structure constant, the 2 DCMs differed in where congruency effects were exerted: In the bottom-up DCM, the incongruency factor increased the forward connections from STG and FG to AG/IPS and STS/MTG in a material-dependent fashion. In the top-down DCM, they increased the connections between AC and IFS in a material-independent manner. Thus, these models encode either a greater sensitivity of AG/IPS and STS/MTG to incongruent bottom-up inputs or incongruent top-down inputs. Comparing these models allowed us to distinguish between a bottom-up and top-down mediation of incongruency effects.

The regions were selected using the maxima of the relevant contrasts from our random effects analysis. Region-specific time series (concatenated over the 2 sessions and adjusted for confounds) comprised the first eigenvariate of all voxels within a 4-mm radius centered on each peak identified in the random effects analysis.

For each model, the subject-specific modulatory effects were entered into t-tests at the group level (see Fig. 4). This allowed us to summarize the consistent findings from the subject-specific DCMs using classical statistics.

Bayes factors (=the ratio of the model evidences, Kass and Raftery 1995) were used for model comparison, that is, to decide whether the bottom-up or top-down DCM was the better model (Penny et al. 2004). In brief, given the measured data y and 2 competing models, Bayes factors are the ratio of the evidences of the 2 models. A Bayes factor of one presents equal evidence for the 2 models. A Bayes factor above 3 is considered positive evidence for one of the 2 models. The model evidence does depend not only on model fit but also on model complexity. Here, we have limited ourselves to the bottom-up and top-down models that were equated for the number of parameters, that is model complexity, and did not design a third more complex model endowed with bottom-up and top-down effects.

Finally, a group analysis was implemented by taking the product of the subject-specific Bayes factors over subjects (this is equivalent to the exponentiated sum of the log model evidences of each subject-specific DCM). However, we also report the Bayes factors for each individual subject (see Fig. 5, right column) to provide an intuition of consistency over subjects. As the Bayes factors for some subjects were very large, we have selected a cutoff of 8 to focus on the consistency across subjects in Figure 5.

Results

In the following, we report 1) the behavioral results, 2) the fMRI results of the conventional analysis focussing on regionally selective activations, and 3) the DCM results providing insight into potential neural mechanisms that mediate the observed regional activations.

Behavioral Results

For performance accuracy, a 3-way ANOVA with congruency (congruent vs. incongruentI vs. incongruentI+R), prime material (picture vs. written word), and target material (sound vs. spoken word) identified a significant main effect of congruency (F1.4,21.7 = 10.5, P < 0.01) and of target material (F1,16 = 32, P < 0.001) after Greenhouse–Geisser correction. In addition, there was a significant interaction effect between congruency and target material (F2,31 = 9.2, P = 0.001). For reaction times (RTs) (limited to correct trials only), a 3-way ANOVA identified main effects of congruency (F1.8, 28.6 = 129.1, P < 0.001) and target material (F1,16 = 11.8, P < 0.01) following Greenhouse–Geisser correction. RTs were shorter for spoken words than sounds. The absence of any significant interactions of congruency with prime (written words vs. pictures) or target material (spoken words vs. sounds) suggests that the prime duration (100 ms) allowed pictures and written words to elicit comparable priming effects irrespective of the target material (sounds or spoken words).

Post hoc comparisons (Bonferroni corrected) for accuracy and RTs revealed a significant incongruency effect of identity but not of response. Overall, these behavioral results suggest that incongruency may affect processes of stimulus recognition and categorization.

Conventional SPM Analysis

The conventional SPM analysis was performed in 2 steps: First, we identified regions that showed increased activation for incongruent > congruent stimuli (within the system of regions activated relative to fixation, see Materials and Methods). Second, within this system, pooling over prime, we tested for incongruency effects that were 1) common to sounds and spoken words, 2) selective for spoken words, or 3) selective for sounds (i.e., the interaction between congruent vs. incongruent and sounds vs. spoken words). For completeness, pooling over target, we tested for incongruency effects that were selective for pictures or written words (i.e., the interaction between congruent vs. incongruent and pictures vs. written words). In other words, we used the factorial character of our experimental design and pooled over one factor to increase the power when investigating the effect of the other factor.

Main Effect of Identity and Response Incongruency

Incongruent stimuli increased activations relative to congruent stimuli, in the AC/mPFC, bilateral IFS, left insula, IPS/the AG and MTG/STS, and the right cerebellum. None of the regions showed an effect of response incongruency (P > 0.05 uncorrected at peak coordinates). In other words, the activation in those areas did not depend on whether prime and target object required the same response but was primarily driven by whether visual prime and auditory target referred to the same object. This suggests that the activation increases might at least in part be due to incongruencies at the level of object processing and categorization rather than only response selection and preparation.

No increased activation was observed for congruent relative to incongruent trials within the system of regions activated relative to fixation (see Materials and Methods) (Table 2).

Table 2

Visuoauditory congruency effects averaged over sounds and words

Region Coordinates z-score P value (VOI) 
Incongruent > congruent    
    AC medial superior frontal gyrus 0, 18, 48 >7 <0.001 
    Left IFS −42, 12, 24 7.6 <0.001 
    Left insula −33, 24, −3 5.8 <0.001 
    Right cerebellum 30, −66, −27 5.5 <0.001 
    Right IFS 51, 18, 21 5.2 <0.001 
    Left IPS extending into AG −27, −63, 45 5.1 <0.001 
    Left anterior MTG/STS −60, −21, −6 4.9 <0.001 
Region Coordinates z-score P value (VOI) 
Incongruent > congruent    
    AC medial superior frontal gyrus 0, 18, 48 >7 <0.001 
    Left IFS −42, 12, 24 7.6 <0.001 
    Left insula −33, 24, −3 5.8 <0.001 
    Right cerebellum 30, −66, −27 5.5 <0.001 
    Right IFS 51, 18, 21 5.2 <0.001 
    Left IPS extending into AG −27, −63, 45 5.1 <0.001 
    Left anterior MTG/STS −60, −21, −6 4.9 <0.001 

Note: Spatial extent >10 voxels; VOI = activations for visuoauditory > fixation at P < 0.01 uncorrected, spatial extent > 15 voxels.

Modulatory Effect of Target Material: Incongruency Effects Selective for Spoken Words, Sounds, or Both

Within the incongruency system identified above, the medial prefrontal region and the left IFS exhibited incongruency effects common for sounds and spoken words. Critically, pooling over primes, we observed a significant interaction between incongruency and target material: the left MTG/STS showed an enhanced incongruency effect for spoken words relative to sounds. In contrast, the left AG (extending into IPS) showed an increased incongruency effect for sounds relative to spoken words. Following the rationale of this experiment, the incongruency effects in mPFC/IFS may relate to higher conceptual/decisional processes, in AG/IPS to semantic processes, and in STS/MTG to phonological processes. In addition, we observed an incongruency effect selective for sounds in a more dorsal medial prefrontal region. Although only correct trials were included in our fMRI analysis, we note sound trials were still associated with greater error probability. Hence, the increased mPFC activation for incongruent sound trials may be related to their inherent ambiguity (cf., recent studies associating mPFC/AC with error probability prediction rather than error detection per se, Brown and Braver 2005) (Tables 3 and 4, Figs 2 and 3).

Figure 2.

Row 1: Increased activations for incongruent relative to congruent stimuli separately for sounds (left) and spoken words (right) are rendered on a template of the whole brain. Height threshold: P < 0.05 corrected for multiple comparisons within the search space. Row 2: Congruency effects are rendered on a template of the whole brain: red = common for sounds and spoken words, green = sounds > spoken words, blue = spoken words > sounds. Height threshold: P < 0.001 uncorrected. Extent threshold > 10 voxels (for illustration purposes).

Figure 2.

Row 1: Increased activations for incongruent relative to congruent stimuli separately for sounds (left) and spoken words (right) are rendered on a template of the whole brain. Height threshold: P < 0.05 corrected for multiple comparisons within the search space. Row 2: Congruency effects are rendered on a template of the whole brain: red = common for sounds and spoken words, green = sounds > spoken words, blue = spoken words > sounds. Height threshold: P < 0.001 uncorrected. Extent threshold > 10 voxels (for illustration purposes).

Figure 3.

Left: Increased activations for incongruent relative to congruent visuoauditory stimuli on axial and coronal slices of a mean EPI image created by averaging the subjects' normalized EPI images. Height threshold: P < 0.001 uncorrected for illustration purposes. Extent threshold: >1 voxel. Common for sounds and spoken words (Row 1 + 2). Interactions: Sounds > Spoken words (Row 3). Spoken words > Sounds (Row 4). Right: Parameter estimates for Congruent (c, grey) and Incongruent (i, black) visuoauditory trials relative to fixation. Prime: Pictures or Written Words. Targets: Sounds (S) or Spoken Words (W). The bar graphs represent the size of the effect in nondimensional units (corresponding to percent whole-brain mean). These effects are activations pooled (i.e., summed) over appropriate conditions. Row 1: x = −42, y = 12, z = 24. Row 2: x = 0, y = 18, z = 48. Row 3: x = 30, y = −75, z = 42. Row 4: x = −66, y = −27, z = −3.

Figure 3.

Left: Increased activations for incongruent relative to congruent visuoauditory stimuli on axial and coronal slices of a mean EPI image created by averaging the subjects' normalized EPI images. Height threshold: P < 0.001 uncorrected for illustration purposes. Extent threshold: >1 voxel. Common for sounds and spoken words (Row 1 + 2). Interactions: Sounds > Spoken words (Row 3). Spoken words > Sounds (Row 4). Right: Parameter estimates for Congruent (c, grey) and Incongruent (i, black) visuoauditory trials relative to fixation. Prime: Pictures or Written Words. Targets: Sounds (S) or Spoken Words (W). The bar graphs represent the size of the effect in nondimensional units (corresponding to percent whole-brain mean). These effects are activations pooled (i.e., summed) over appropriate conditions. Row 1: x = −42, y = 12, z = 24. Row 2: x = 0, y = 18, z = 48. Row 3: x = 30, y = −75, z = 42. Row 4: x = −66, y = −27, z = −3.

Table 3

Visuoauditory congruency effects for sounds and words

Region Coordinates Number of voxels P value (VOI) 
Medial superior frontal gyrus 0, 18, 48 70 <0.05 
 −3, 12, 54   
Left IFS −42, 12, 24 154 <0.05 
 −45, 27, 21   
Region Coordinates Number of voxels P value (VOI) 
Medial superior frontal gyrus 0, 18, 48 70 <0.05 
 −3, 12, 54   
Left IFS −42, 12, 24 154 <0.05 
 −45, 27, 21   

Note: z-values not appropriate because of the nature of conjunction analysis.

Table 4

Interactions: target-dependent visuoauditory congruency effects

Region Coordinates z-score P value (VOI) 
Sounds > wordsspoken    
    Left AG/IPS −30, −75, 42 3.7 0.068 
    Medial superior frontal gyrus 0, 18, 54 3.7 0.068 
Wordsspoken > sounds    
    Left MTG/STS −66, −27, −3 4.0 0.03 
Region Coordinates z-score P value (VOI) 
Sounds > wordsspoken    
    Left AG/IPS −30, −75, 42 3.7 0.068 
    Medial superior frontal gyrus 0, 18, 54 3.7 0.068 
Wordsspoken > sounds    
    Left MTG/STS −66, −27, −3 4.0 0.03 

Note: VOI = activations for incongruent > congruent <0.001 and visuoauditory > fixation, P < 0.01 uncorrected, spatial extent > 15 voxels.

Modulatory Effect of Prime Material (Written Words vs. Pictures)

For completeness, pooling over target material, we tested for incongruency effects that were modulated by prime material (i.e., the interaction between incongruent vs. congruent and pictures vs. written words). However, no regions exhibited a significant interaction effect between congruency and prime material. The absence of a significant modulatory effect of prime material may be related to several factors. 1) The prime was presented very briefly (100 ms). 2) It was task and response irrelevant. 3) At the time of target presentation (i.e., 200 ms post-prime presentation), both phonological and semantic information may be available irrespective of target material (cf., Rahman et al. 2003; Schiller et al. 2003; Moscoso del Prado et al. 2006).

Effect of Performance on fMRI Incongruency Effects

To further characterize the common incongruency effects in the mPFC/AC and left IFS, we investigated their relationship to subject's performance measures. For this, we performed a second-level multiregression analysis, where we used subject-specific behavioral interference effects (i.e., RT and accuracy differences for incongruent > congruent) as predictors for the fMRI incongruency effects, expressed physiologically, in the mPFC/AC and left IFS (i.e., increased activation for incongruent > congruent stimuli). As RT and accuracy differences for incongruent relative to congruent trials were strongly negatively correlated over subjects (correlation coefficient = −0.7), we orthogonalized the accuracy with respect to the RT regressors. Given our a priori interests in the role of AC/mPFC and left IFS in incongruency effects, the results of this analysis are reported corrected for multiple comparisons within spheres (10 mm radius) centered on the peaks identified in the previous conjunction analysis (this does not bias our inference because the effects of RT and accuracy are orthogonal to the incongruency effects).

RT interference positively predicted fMRI incongruency effects in the mPFC/AC (x = 0, y = 24, z = 48; z-score = 3.51; P(svc) = 0.04) and in the lateral prefrontal cortex (x = −45, y = 6, z = 24; z-score = 3.5; P(svc) = 0.04). In addition, incongruency effects on accuracy negatively predicted fMRI incongruency effects in the lateral prefrontal cortex (x = −51, y = 9, z = 24; z-score = 3.9; P(svc) = 0.01). In other words, strong fMRI incongruency effects are associated with relatively longer RTs and lower accuracy for incongruent relative to congruent trials. Thus, consistent with current theories that implicate the mPFC/AC, IFS circuitry in conflict monitoring and cognitive control processes, mPFC/AC and IFS activation may be associated with stronger interference as indicated by longer processing times and less accurate performance on incongruent trials (Fig. 4).

Figure 4.

Effects of subject's behavioral interference effects (RT and accuracy) on fMRI incongruency effects in AC/mPFC and left IFS. Left: Regional incongruency effects that were predicted by RT and accuracy (proportion correct) differences between incongruent and congruent trials across subjects on coronal and axial slices of a mean EPI image created by averaging the subjects' normalized EPI images. Height threshold: P < 0.001 uncorrected for illustration purposes (see Materials and Methods for further details). Extent threshold: >40 voxels. Right: Scatter plots depict the regression of the regional (adjusted) fMRI signal to RT (ms) and accuracy (proportion correct) interference (see Results for further details). The ordinate represents (adjusted) fMRI signal, the abscissa represents subject's mean interference effect (after mean correction), that is, RT and accuracy differences for incongruent versus congruent stimuli averaged over all types of compound stimuli.

Figure 4.

Effects of subject's behavioral interference effects (RT and accuracy) on fMRI incongruency effects in AC/mPFC and left IFS. Left: Regional incongruency effects that were predicted by RT and accuracy (proportion correct) differences between incongruent and congruent trials across subjects on coronal and axial slices of a mean EPI image created by averaging the subjects' normalized EPI images. Height threshold: P < 0.001 uncorrected for illustration purposes (see Materials and Methods for further details). Extent threshold: >40 voxels. Right: Scatter plots depict the regression of the regional (adjusted) fMRI signal to RT (ms) and accuracy (proportion correct) interference (see Results for further details). The ordinate represents (adjusted) fMRI signal, the abscissa represents subject's mean interference effect (after mean correction), that is, RT and accuracy differences for incongruent versus congruent stimuli averaged over all types of compound stimuli.

Summary of the Results from the Conventional SPM Analysis

In summary, our results demonstrate that 1) the left MTG/STS shows an increased incongruency effect for spoken words relative to sounds, 2) the left AG/IPS exhibits an increased incongruency effect for sounds relative to spoken words, and 3) a medial prefrontal region and the left IFS are activated for incongruent relative to congruent stimuli for sounds as well as spoken words. Furthermore, the incongruency effects in mPFC/AC and left IFS were predicted by the behavioral interference effects across subjects.

DCM Analysis

At the group level, strong evidence was provided for the bottom-up relative to the top-down model suggesting that the incongruency effects may emerge in a material-dependent fashion (i.e., selective for spoken words or sounds) via modulation of forward connections from early auditory regions to STS/MTG and AG/IPS. In other words, the STS/MTG and AG/IPS showed a greater response to bottom-up inputs when they were incongruent. Figure 5 (right column) shows the Bayes factors (relative likelihood of the bottom-up model, relative to the top-down model) for each subject, to provide an intuition of consistency over subjects. A cutoff of 8 was used to focus on the fact that—despite intersubject variability in the magnitude of the Bayes factors—the bottom-up model provided a better explanation of the data than the top-down model in all subjects (apart from one showing equal model evidences). As the 2 DCMs were equated for the number of modulatory effects and intrinsic as well as extrinsic connectivity structure, the difference in model evidence is only due to model fit but not model complexity.

Figure 5.

Dynamic causal models—Left: bottom-up DCM, incongruency effects are mediated via forward connections selective for verbal (spoken words) and nonverbal (sounds) material. Middle: top-down DCM, incongruency effects are mediated by interactions between mPFC/AC and IFS. Values are the across subject mean (standard deviation) of changes in connection strength (at P < 0.05 in bold). These parameters quantify how experimental manipulations change the values of intrinsic connections. In dynamic systems, the strength of a coupling can be thought of as a rate constant or the reciprocal of the time constant. Typically, regional activity has a time constant on the order of 1–2 s (rate of 1–0.5 s−1). Therefore, a modulatory effect of 0.05 s−1corresponds to a 5–10% increase in coupling. AC = anterior cingulate/medial prefrontal cortex; IFS = inferior frontal sulcus, STS = superior temporal sulcus, AG = angular gyrus; STG = superior temporal gyrus; FG = fusiform gyrus. Black: intrinsic connections; Purple: extrinsic input; Green: modulatory effects. I Stimuli = all incongruent stimuli, I Words = incongruent spoken words, I Sounds = incongruent sounds. Right: Bar chart of Bayes factors for bottom-up relative to top-down model for each subject. A cutoff of 8 was selected to focus on the fact that despite intersubject variability, all subjects (apart from one) consistently showed Bayes factors in favor of the bottom-up model.

Figure 5.

Dynamic causal models—Left: bottom-up DCM, incongruency effects are mediated via forward connections selective for verbal (spoken words) and nonverbal (sounds) material. Middle: top-down DCM, incongruency effects are mediated by interactions between mPFC/AC and IFS. Values are the across subject mean (standard deviation) of changes in connection strength (at P < 0.05 in bold). These parameters quantify how experimental manipulations change the values of intrinsic connections. In dynamic systems, the strength of a coupling can be thought of as a rate constant or the reciprocal of the time constant. Typically, regional activity has a time constant on the order of 1–2 s (rate of 1–0.5 s−1). Therefore, a modulatory effect of 0.05 s−1corresponds to a 5–10% increase in coupling. AC = anterior cingulate/medial prefrontal cortex; IFS = inferior frontal sulcus, STS = superior temporal sulcus, AG = angular gyrus; STG = superior temporal gyrus; FG = fusiform gyrus. Black: intrinsic connections; Purple: extrinsic input; Green: modulatory effects. I Stimuli = all incongruent stimuli, I Words = incongruent spoken words, I Sounds = incongruent sounds. Right: Bar chart of Bayes factors for bottom-up relative to top-down model for each subject. A cutoff of 8 was selected to focus on the fact that despite intersubject variability, all subjects (apart from one) consistently showed Bayes factors in favor of the bottom-up model.

The numbers by the connections are the change in coupling (i.e., responsiveness of the target region) induced by incongruency or material (sounds vs. spoken words) effects averaged across subjects. Note that in both models, 3 modulatory effects are significant across subjects (Fig. 5).

Discussion

This visuoauditory priming study demonstrates the effect of prior visual information on recognition and categorization of environmental sounds and spoken words at the neural and behavioral level. Subjects spent more time and were less accurate for incongruent relative to congruent trials. Consistent with the behavioral interference effect, incongruent relative to congruent visuoauditory trials increased activation in a large distributed neural system encompassing the AC/mPFC, IFS, AG/IPS, and MTG/STS. These effects were observed for incongruent trials irrespective of additional response incongruency. Critically, while the behavioral interference—as measured by longer RTs—was equivalent for sounds and spoken words, our functional imaging results revealed that they were mediated by distinct neural systems. Combining visuoauditory priming for spoken words and environmental sounds enabled us to test for the interaction between congruency and target material (i.e., sounds vs. spoken words) and segregate the incongruency effects into 3 classes: visuoauditory incongruency effects were enhanced for 1) spoken words in the left anterior MTG/STS, 2) sounds in the left AG/IPS, and 3) both words and sounds, in the mPFC/AC and left IFS.

From a cognitive perspective, these distinct classes suggest that prior visual information modulates categorization of complex auditory stimuli at multiple stages. Based on our initial rationale that processing auditory-visual stimuli relies more on phonology for spoken words and semantics for environmental sounds, these regionally selective responses may implicate 1) the MTG/STS in phonological, 2) the AG/IPS in semantic and associated recognition processes, and 3) mPFCm/IFS in higher conceptual or “conflict monitoring” processes. In terms of neural mechanisms, our DCM results suggest that these incongruency effects may emerge in a material-dependent fashion, that is, selective for spoken words and environmental sounds via a greater influence of forwards connections from early auditory regions to MTG/STS and AG/IPS.

The selective response enhancement in STS/MTG for incongruent spoken words is consistent with its established role in auditory speech processing (Mummery et al. 1999; Binder et al. 2000; Scott et al. 2000; Giraud and Price 2001; Price et al. 2003; Scott and Johnsrude 2003). Furthermore, activation in multiple STS regions has been shown for visuoauditory integration and congruency of 1) seen mouth movements and heard speech during speech reading (Calvert et al. 2000; Macaluso et al. 2004) as well as 2) spoken and written letters (van Atteveldt et al. 2004). Interestingly, when presenting written and spoken phonemes synchronously during passive listening and viewing, only congruent auditory-visual speech that allows successful binding is associated with increased STS activation relative to unimodal speech (Calvert et al. 2000; van Atteveldt et al. 2004). In contrast, in our visuoauditory priming paradigm, the task-irrelevant but incongruent visual prime induces behavioral interference and increases STS activation for categorization of the subsequent spoken word. This demonstrates that different multisensory paradigms can point to the same anatomical locus, but nevertheless be very distinct in their multisensory interaction. During a passive viewing–listening task, both stimulus components are task relevant enabling integration into a coherent percept. In contrast, visuoauditory priming can be considered a selective attention task, where task-irrelevant incongruent visual information needs to be suppressed or overcome by amplification of the task-relevant auditory information. Collectively, both types of visuoauditory interaction effects suggest that the anterior MTG/STS region may be the locus of neuronal processes that underpin visuoauditory interactions or incongruency effects that are conveyed phonologically.

In the AG/IPS, the incongruency effect was selective for categorization of environmental sounds. As recognition and categorization of environmental sounds is accomplished through interactions between perceptual and semantic processing, this speaks to a role in congruency primarily at the level of semantic representations and converges with recent functional imaging results implicating the IPS in semantic rather than response conflict (van Veen and Carter 2005). Furthermore, the AG/IPS is part of a frontotemporoparietal semantic retrieval system that is generally activated for semantic relative to perceptual or phonological tasks (Vandenberghe et al. 1996; Noppeney and Price 2003, 2004; Binder et al. 2005; Sabsevitz et al. 2005). However, its role in semantic processing has been elusive. Thus, only large but not focal parietal lesions are associated with semantic retrieval deficits as measured by standard neuropsychological tests (e.g., Pyramids and Palm Tree test, Alexander et al. 1989). One interesting possibility that arises from our findings is that the AG/IPS is involved in controlling, accessing, and combining semantic information from multiple senses. This hypothesis needs to be investigated further by 1) comparing visuoauditory priming to unimodal (the limited number of unimodal trials in our experiment did not allow that comparison) and audiovisual (i.e., auditory prime and visual target) priming using fMRI and 2) testing patients with left-lateralized parietal lesions on nonverbal cross-modal (e.g., sound–picture) matching or priming tasks.

In contrast to STS/MTG and AG/IPS, activation in mPFC/AC and left IFS was increased for both, incongruent sounds and spoken words. The behavioral relevance of these effects was highlighted by their significant correlations with subjects' increases in RT and decreases in accuracy for incongruent relative to congruent trials. In other words, subjects spending relatively more time on and showing more accuracy reductions for incongruent trials, exhibit strong mPFC/AC and IFS incongruency effects. These results extend the role of the mPFC/AC—IFS circuitry in cognitive control processes such as conflict monitoring or predicting error probability to the multisensory domain (Duncan and Owen 2000; Botvinick et al. 2001; Paus 2001; Noppeney and Price 2002; Laurienti et al. 2003; Kerns et al. 2004; Brown and Braver 2005). Thus, the AC/mPFC and IFS may be engaged in evaluating and integrating higher level conceptual information abstracted from different stimulus materials (i.e., verbal vs. nonverbal) and modalities (auditory vs. visual).

The multiple incongruency effects raise the question at which level of the cortical hierarchy they emerge (Mesulam 1990; McIntosh 2000; Horwitz 2003). More specifically, are the incongruency effects mediated via sensitization to forward connections from early auditory regions to STS/MTG and AG/IPS, or do they emerge through greater top-down influences from the AC–IFS circuitry indicating increased cognitive control? Bayesian model comparison provided strong evidence for the bottom-up model where the influence of early auditory regions on STS and AG/IPS is increased during incongruent trials. Critically, the forward connectivity is selectively modulated by the different classes of incongruency: phonological incongruency increases the forward connections to STS/MTG, semantic incongruency to AG/IPS. The proposed bottom-up mechanism converges with recent electroencephalography results demonstrating auditory-visual incongruency and category-specific effects for tools and animals as early as 100 ms poststimulus (Molholm et al. 2004; Hauk et al. 2006; Murray et al. 2006). Hence, the human brain may be able to distinguish rapidly between tools and animals and detect higher level incongruencies between auditory and visual stimulus components. Collectively, these results are consistent with predictive coding hypotheses, in which prediction errors at all levels of a cortical hierarchy guide perceptual inference (Rao and Ballard 1999). In our case, the failure to suppress prediction error, in the context of unpredictable or incongruent bottom-up cross-modal inputs, is manifest as an increase in forward connectivity. Furthermore, the nature of the prediction error (phonological vs. semantic) determines where it is expressed (MTG/STS vs. AG/IPS).

Our findings suggest that prior visual information influences the neural processes underlying speech and sound recognition at multiple levels with the left anterior MTG/STS being involved in phonological, the AG/IPS in semantic, and the mPFC/IFS in higher conceptual processes. In terms of neural mechanisms, effective connectivity analyses indicate that the incongruency effects emerge via a failure to suppress incongruent, bottom-up inputs from early auditory regions to MTG/STS or AG/IPS. This is consistent with a predictive coding perspective on hierarchical Bayesian inference in the cortex where the domain of the prediction error (phonological vs. semantic) determines its regional expression (MTG/STS vs. AG/IPS).

Funding

The Deutsche Forschungsgemeinschaft; the Max-Planck Society; and the Wellcome Trust.

Conflict of Interest: None declared.

Appendix

Example trials: Each stimulus (e.g., bear) was presented 4 times as a target in congruent pairs and 4 times as a target in incongruent pairs. Across subjects, each stimulus was counterbalanced across stimulus modalities and conditions.graphic

References

Alexander
MP
Hiltbrunner
B
Fischer
RS
Distributed anatomy of transcortical aphasia
Arch Neurol
 , 
1989
, vol. 
46
 (pg. 
885
-
892
)
Amedi
A
von Kriegstein
K
van Atteveldt
NM
Beauchamp
MS
Naumer
MJ
Functional imaging of human crossmodal identification and object recognition
Exp Brain Res
 , 
2005
, vol. 
166
 (pg. 
559
-
571
)
Badgaiyan
RD
Schacter
DL
Alpert
NM
Auditory priming within and across modalities: evidence from positron emission tomography
J Cogn Neurosci
 , 
1999
, vol. 
11
 (pg. 
337
-
348
)
Barraclough
NE
Xiao
D
Baker
CI
Oram
MW
Perrett
DI
Integration of visual and auditory information by superior temporal sulcus neurons responsive to the sight of actions
J Cogn Neurosci
 , 
2005
, vol. 
17
 (pg. 
377
-
391
)
Beauchamp
MS
Argall
BD
Bodurka
J
Duyn
JH
Martin
A
Unraveling multisensory integration: patchy organization within human STS multisensory cortex
Nat Neurosci
 , 
2004
, vol. 
7
 (pg. 
1190
-
1192
)
Beauchamp
MS
Lee
KE
Argall
BD
Martin
A
Integration of auditory and visual information about objects in superior temporal sulcus
Neuron
 , 
2004
, vol. 
41
 (pg. 
809
-
823
)
Binder
JR
Frost
JA
Hammeke
TA
Bellgowan
PS
Springer
JA
Kaufman
JN
Possing
ET
Human temporal lobe activation by speech and nonspeech sounds
Cereb Cortex
 , 
2000
, vol. 
10
 (pg. 
512
-
528
)
Binder
JR
Liebenthal
E
Possing
ET
Medler
DA
Ward
BD
Neural correlates of sensory and decision processes in auditory object identification
Nat Neurosci
 , 
2004
, vol. 
7
 (pg. 
295
-
301
)
Binder
JR
Westbury
CF
McKiernan
KA
Possing
ET
Medler
DA
Distinct brain systems for processing concrete and abstract concepts
J Cogn Neurosci
 , 
2005
, vol. 
17
 (pg. 
905
-
917
)
Botvinick
MM
Braver
TS
Barch
DM
Carter
CS
Cohen
JD
Conflict monitoring and cognitive control
Psychol Rev
 , 
2001
, vol. 
108
 (pg. 
624
-
652
)
Brown
JW
Braver
TS
Learned predictions of error likelihood in the anterior cingulate cortex
Science
 , 
2005
, vol. 
307
 (pg. 
1118
-
1121
)
Callan
DE
Jones
JA
Munhall
K
Kroos
C
Callan
AM
Vatikiotis-Bateson
E
Multisensory integration sites identified by perception of spatial wavelet filtered visual speech gesture information
J Cogn Neurosci
 , 
2004
, vol. 
16
 (pg. 
805
-
816
)
Calvert
GA
Crossmodal processing in the human brain: insights from functional neuroimaging studies
Cereb Cortex
 , 
2001
, vol. 
11
 (pg. 
1110
-
1123
)
Calvert
GA
Brammer
MJ
Bullmore
ET
Campbell
R
Iversen
SD
David
AS
Response amplification in sensory-specific cortices during crossmodal binding
Neuroreport
 , 
1999
, vol. 
10
 (pg. 
2619
-
2623
)
Calvert
GA
Campbell
R
Brammer
MJ
Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortex
Curr Biol
 , 
2000
, vol. 
10
 (pg. 
649
-
657
)
Calvert
GA
Hansen
PC
Iversen
SD
Brammer
MJ
Detection of audio-visual integration sites in humans by application of electrophysiological criteria to the BOLD effect
Neuroimage
 , 
2001
, vol. 
14
 (pg. 
427
-
438
)
Calvert
GA
Lewis
JW
Calvert
GA
Spence
C
Stein
BE
Hemodynamic studies of audio-visual interactions
The handbook of multi-sensory processes
 , 
2004
Cambridge (MA)
MIT press
(pg. 
483
-
502
)
Chao
LL
Haxby
JV
Martin
A
Attribute-based neural substrates in temporal cortex for perceiving and knowing about objects
Nat Neurosci
 , 
1999
, vol. 
2
 (pg. 
913
-
919
)
Duncan
J
Owen
AM
Common regions of the human frontal lobe recruited by diverse cognitive demands
Trends Neurosci
 , 
2000
, vol. 
23
 (pg. 
475
-
483
)
Evans
AC
Collins
DL
Milner
B
An MRI-based stereotactic atlas from 250 young normal subjects
Soc Nuerosci Abstr.
 , 
1992
Foxe
JJ
Morocz
IA
Murray
MM
Higgins
BA
Javitt
DC
Schroeder
CE
Multisensory auditory-somatosensory interactions in early cortical processing revealed by high-density electrical mapping
Brain Res Cogn Brain Res
 , 
2000
, vol. 
10
 (pg. 
77
-
83
)
Friston
KJ
Harrison
L
Penny
W
Dynamic causal modelling
Neuroimage
 , 
2003
, vol. 
19
 (pg. 
1273
-
1302
)
Friston
KJ
Holmes
A
Worsley
KJ
Poline
JB
Frith
CD
Frackowiak
R
Statistical parametric mapping: a general linear approach
Hum Brain Mapp
 , 
1995
, vol. 
2
 (pg. 
189
-
210
)
Friston
KJ
Holmes
AP
Price
CJ
Buchel
C
Worsley
KJ
Multisubject fMRI studies and conjunction analyses
Neuroimage
 , 
1999
, vol. 
10
 (pg. 
385
-
396
)
Friston
KJ
Penny
WD
Glaser
DE
Conjunction revisited
Neuroimage
 , 
2005
, vol. 
25
 (pg. 
661
-
667
)
Friston
KJ
Price
CJ
Generative models, brain function and neuroimaging
Scand J Psychol
 , 
2001
, vol. 
42
 (pg. 
167
-
177
)
Fu
KM
Johnston
TA
Shah
AS
Arnold
L
Smiley
J
Hackett
TA
Garraghty
PE
Schroeder
CE
Auditory cortical neurons respond to somatosensory stimulation
J Neurosci
 , 
2003
, vol. 
23
 (pg. 
7510
-
7515
)
Fuster
JM
Bodner
M
Kroger
JK
Cross-modal and cross-temporal association in neurons of frontal cortex
Nature
 , 
2000
, vol. 
405
 (pg. 
347
-
351
)
Ghazanfar
AA
Maier
JX
Hoffman
KL
Logothetis
NK
Multisensory integration of dynamic faces and voices in rhesus monkey auditory cortex
J Neurosci
 , 
2005
, vol. 
25
 (pg. 
5004
-
5012
)
Ghazanfar
AA
Schroeder
CE
Is neocortex essentially multisensory?
Trends Cogn Sci
 , 
2006
, vol. 
10
 (pg. 
278
-
285
)
Gibson
JR
Maunsell
JH
Sensory modality specificity of neural activity related to memory in visual cortex
J Neurophysiol
 , 
1997
, vol. 
78
 (pg. 
1263
-
1275
)
Giraud
AL
Price
CJ
The constraints functional neuroimaging places on classical models of auditory word processing
J Cogn Neurosci
 , 
2001
, vol. 
13
 (pg. 
754
-
765
)
Gonzalo
D
Büchel
C
Crossmodal associative learning modulates fusiform face's areas response to sound
 , 
2003
Geneva
3rd Annual Meeting International Multi-Sensory Research Forum
Gottfried
JA
Dolan
RJ
The nose smells what the eye sees: crossmodal visual facilitation of human olfactory perception
Neuron
 , 
2003
, vol. 
39
 (pg. 
375
-
386
)
Gottfried
JA
Smith
AP
Rugg
MD
Dolan
RJ
Remembrance of odors past: human olfactory cortex in cross-modal recognition memory
Neuron
 , 
2004
, vol. 
42
 (pg. 
687
-
695
)
Grill-Spector
K
Henson
R
Martin
A
Repetition and the brain: neural models of stimulus-specific effects
Trends Cogn Sci
 , 
2006
, vol. 
10
 (pg. 
14
-
23
)
Hauk
O
Shtyrov
Y
Pulvermuller
F
The sound of actions as reflected by mismatch negativity: rapid activation of cortical sensory-motor networks by sounds associated with finger and tongue movements
Eur J Neurosci
 , 
2006
, vol. 
23
 (pg. 
811
-
821
)
Henson
RN
Neuroimaging studies of priming
Prog Neurobiol
 , 
2003
, vol. 
70
 (pg. 
53
-
81
)
Henson
RN
Rugg
MD
Neural response suppression, haemodynamic repetition effects, and behavioural priming
Neuropsychologia
 , 
2003
, vol. 
41
 (pg. 
263
-
270
)
Horwitz
B
The elusive concept of brain connectivity
Neuroimage
 , 
2003
, vol. 
19
 (pg. 
466
-
470
)
Humphreys
GW
Forde
EM
Hierarchies, similarity, and interactivity in object recognition: “category-specific” neuropsychological deficits
Behav Brain Sci
 , 
2001
, vol. 
24
 (pg. 
453
-
476
)
Ikeda
M
Patterson
K
Graham
KS
Ralph
MA
Hodges
JR
A horse of a different colour: do patients with semantic dementia recognise different versions of the same object as the same?
Neuropsychologia
 , 
2006
, vol. 
44
 (pg. 
566
-
575
)
Josephs
O
Deichmann
R
Turner
R
Trajectory measurement and generalized reconstruction in rectilinear EPI
ISMRM Meeting
 , 
2000
, vol. 
151
 pg. 
7
 
Kass
RE
Raftery
AE
Bayes factors
J Am Stat Assoc
 , 
1995
, vol. 
90
 (pg. 
773
-
795
)
Kayser
C
Petkov
CI
Augath
M
Logothetis
NK
Integration of touch and sound in auditory cortex
Neuron
 , 
2005
, vol. 
48
 (pg. 
373
-
384
)
Kerns
JG
Cohen
JD
MacDonald
AW
3rd
Cho
RY
Stenger
VA
Carter
CS
Anterior cingulate conflict monitoring and adjustments in control
Science
 , 
2004
, vol. 
303
 (pg. 
1023
-
1026
)
Laurienti
PJ
Kraft
RA
Maldjian
JA
Burdette
JH
Wallace
MT
Semantic congruence is a critical factor in multisensory behavioral performance
Exp Brain Res
 , 
2004
, vol. 
158
 (pg. 
405
-
414
)
Laurienti
PJ
Wallace
MT
Maldjian
JA
Susi
CM
Stein
BE
Burdette
JH
Cross-modal sensory processing in the anterior cingulate and medial prefrontal cortices
Hum Brain Mapp
 , 
2003
, vol. 
19
 (pg. 
213
-
223
)
Lehmann
S
Murray
MM
The role of multisensory memories in unisensory object discrimination
Brain Res Cogn Brain Res
 , 
2005
, vol. 
24
 (pg. 
326
-
334
)
Lewis
JW
Brefczynski
JA
Phinney
RE
Janik
JJ
DeYoe
EA
Distinct cortical pathways for processing tool versus animal sounds
J Neurosci
 , 
2005
, vol. 
25
 (pg. 
5148
-
5158
)
Lewis
JW
Wightman
FL
Brefczynski
JA
Phinney
RE
Binder
JR
DeYoe
EA
Human brain regions involved in recognizing environmental sounds
Cereb Cortex
 , 
2004
, vol. 
14
 (pg. 
1008
-
1021
)
Macaluso
E
George
N
Dolan
R
Spence
C
Driver
J
Spatial and temporal factors during processing of audiovisual speech: a PET study
Neuroimage
 , 
2004
, vol. 
21
 (pg. 
725
-
732
)
McIntosh
AR
Towards a network theory of cognition
Neural Netw
 , 
2000
, vol. 
13
 (pg. 
861
-
870
)
Mesulam
MM
Large-scale neurocognitive networks and distributed processing for attention, language, and memory
Ann Neurol
 , 
1990
, vol. 
28
 (pg. 
597
-
613
)
Molholm
S
Ritter
W
Javitt
DC
Foxe
JJ
Multisensory visual-auditory object recognition in humans: a high-density electrical mapping study
Cereb Cortex
 , 
2004
, vol. 
14
 (pg. 
452
-
465
)
Molholm
S
Ritter
W
Murray
MM
Javitt
DC
Schroeder
CE
Foxe
JJ
Multisensory auditory-visual interactions during early sensory processing in humans: a high-density electrical mapping study
Brain Res Cogn Brain Res
 , 
2002
, vol. 
14
 (pg. 
115
-
128
)
Moscoso del Prado
MF
Hauk
O
Pulvermuller
F
Category specificity in the processing of color-related and form-related words: an ERP study
Neuroimage
 , 
2006
, vol. 
29
 (pg. 
29
-
37
)
Mummery
CJ
Ashburner
J
Scott
SK
Wise
RJ
Functional neuroimaging of speech perception in six normal and two aphasic subjects
J Acoust Soc Am
 , 
1999
, vol. 
106
 (pg. 
449
-
457
)
Murray
MM
Camen
C
Gonzalez Andino
SL
Bovet
P
Clarke
S
Rapid brain discrimination of sounds of objects
J Neurosci
 , 
2006
, vol. 
26
 (pg. 
1293
-
1302
)
Murray
MM
Foxe
JJ
Wylie
GR
The brain uses single-trial ultisensory memories to discriminate without awareness
Neuroimage
 , 
2005
, vol. 
27
 (pg. 
473
-
478
)
Neely
JH
Semantic priming and retrieval from lexical memory: roles of inhibitionless spreading activation and limited-capacity attention
J Exp Psycho Gen
 , 
1977
, vol. 
106
 (pg. 
226
-
254
)
Noppeney
U
Price
C
Retrieval of abstract semantics
Neuroimage
 , 
2004
, vol. 
22
 (pg. 
164
-
170
)
Noppeney
U
Price
CJ
A PET study of stimulus- and task-induced semantic processing
Neuroimage
 , 
2002
, vol. 
15
 (pg. 
927
-
935
)
Noppeney
U
Price
CJ
Functional imaging of the semantic system: retrieval of sensory-experienced and verbally-learnt knowledge
Brain Lang
 , 
2003
, vol. 
84
 (pg. 
120
-
133
)
Noppeney
U
Price
CJ
Penny
WD
Friston
KJ
Two distinct neural mechanisms for category-selective responses
Cereb Cortex
 , 
2006
, vol. 
16
 (pg. 
437
-
445
)
Nyberg
L
Habib
R
McIntosh
AR
Tulving
E
Reactivation of encoding-related brain activity during memory retrieval
Proc Natl Acad Sci USA
 , 
2000
, vol. 
97
 (pg. 
11120
-
11124
)
Olson
IR
Gatenby
JC
Gore
JC
A comparison of bound and unbound audio-visual information processing in the human cerebral cortex
Brain Res Cogn Brain Res
 , 
2002
, vol. 
14
 (pg. 
129
-
138
)
Paus
T
Primate anterior cingulate cortex: where motor control, drive and cognition interface
Nat Rev Neurosci
 , 
2001
, vol. 
2
 (pg. 
417
-
424
)
Penny
WD
Stephan
KE
Mechelli
A
Friston
KJ
Comparing dynamic causal models
Neuroimage
 , 
2004
, vol. 
22
 (pg. 
1157
-
1172
)
Plaut
DC
McClelland
JL
Seidenberg
MS
Patterson
K
Understanding normal and impaired word reading: computational principles in quasi-regular domains
Psychol Rev
 , 
1996
, vol. 
103
 (pg. 
56
-
115
)
Potter
MC
Faulconer
BA
Time to understand pictures and words
Nature
 , 
1975
, vol. 
253
 (pg. 
437
-
438
)
Price
CJ
Winterburn
D
Giraud
AL
Moore
CJ
Noppeney
U
Cortical localisation of the visual and auditory word form areas: a reconsideration of the evidence
Brain Lang
 , 
2003
, vol. 
86
 (pg. 
272
-
286
)
Rahman
RA
van Turennout
M
Levelt
WJ
Phonological encoding is not contingent on semantic feature retrieval: an electrophysiological study on object naming
J Exp Psychol Learn Mem Cogn
 , 
2003
, vol. 
29
 (pg. 
850
-
860
)
Raij
T
Uutela
K
Hari
R
Audiovisual integration of letters in the human brain
Neuron
 , 
2000
, vol. 
28
 (pg. 
617
-
625
)
Rao
RP
Ballard
DH
Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects
Nat Neurosci
 , 
1999
, vol. 
2
 (pg. 
79
-
87
)
Rogers
TT
Lambon Ralph
MA
Garrard
P
Bozeat
S
McClelland
JL
Hodges
JR
Patterson
K
Structure and deterioration of semantic memory: a neuropsychological and computational investigation
Psychol Rev
 , 
2004
, vol. 
111
 (pg. 
205
-
235
)
Sabsevitz
DS
Medler
DA
Seidenberg
M
Binder
JR
Modulation of the semantic system by word imageability
Neuroimage
 , 
2005
, vol. 
27
 (pg. 
188
-
200
)
Saito
DN
Yoshimura
K
Kochiyama
T
Okada
T
Honda
M
Sadato
N
Cross-modal binding and activated attentional networks during audio-visual speech integration: a functional MRI study
Cereb Cortex
 , 
2005
, vol. 
15
 (pg. 
1750
-
1760
)
Schiller
NO
Bles
M
Jansma
BM
Tracking the time course of phonological encoding in speech production: an event-related brain potential study
Brain Res Cogn Brain Res
 , 
2003
, vol. 
17
 (pg. 
819
-
831
)
Schroeder
CE
Foxe
J
Multisensory contributions to low-level, ‘unisensory’ processing
Curr Opin Neurobiol
 , 
2005
, vol. 
15
 (pg. 
454
-
458
)
Schroeder
CE
Foxe
JJ
The timing and laminar profile of converging inputs to multisensory areas of the macaque neocortex
Brain Res Cogn Brain Res
 , 
2002
, vol. 
14
 (pg. 
187
-
198
)
Scott
SK
Blank
CC
Rosen
S
Wise
RJ
Identification of a pathway for intelligible speech in the left temporal lobe
Brain
 , 
2000
, vol. 
123
 (pg. 
2400
-
2406
Pt 12
Scott
SK
Johnsrude
IS
The neuroanatomical and functional organization of speech perception
Trends Neurosci
 , 
2003
, vol. 
26
 (pg. 
100
-
107
)
Stein
BE
Meredith
MA
Merging of the senses
1993
Cambridge (MA)
MIT Press
Talairach
J
Tournoux
P
Co-planar stereotaxic atlas of the human brain
1988
Stuttgart (Germany)
Thieme
Tanabe
HC
Honda
M
Sadato
N
Functionally segregated neural substrates for arbitrary audiovisual paired-association learning
J Neurosci
 , 
2005
, vol. 
25
 (pg. 
6409
-
6418
)
Taylor
KI
Moss
HE
Stamatakis
EA
Tyler
LK
Binding crossmodal object features in perirhinal cortex
Proc Natl Acad Sci USA
 , 
2006
, vol. 
103
 (pg. 
8239
-
8244
)
van Atteveldt
N
Formisano
E
Goebel
R
Blomert
L
Integration of letters and speech sounds in the human brain
Neuron
 , 
2004
, vol. 
43
 (pg. 
271
-
282
)
van Veen
V
Carter
CS
Separating semantic conflict and response conflict in the Stroop task: a functional MRI study
Neuroimage
 , 
2005
, vol. 
27
 (pg. 
497
-
504
)
Vandenberghe
R
Price
C
Wise
R
Josephs
O
Frackowiak
RS
Functional anatomy of a common semantic system for words and pictures [see comments]
Nature
 , 
1996
, vol. 
383
 (pg. 
254
-
256
)
Wright
TM
Pelphrey
KA
Allison
T
McKeown
MJ
McCarthy
G
Polysensory interactions along lateral temporal regions evoked by audiovisual speech
Cereb Cortex
 , 
2003
, vol. 
13
 (pg. 
1034
-
1043
)