Abstract

In this study we investigate previous claims that a region in the left posterior superior temporal sulcus (pSTS) is more activated by audiovisual than unimodal processing. First, we compare audiovisual to visual–visual and auditory–auditory conceptual matching using auditory or visual object names that are paired with pictures of objects or their environmental sounds. Second, we compare congruent and incongruent audiovisual trials when presentation is simultaneous or sequential. Third, we compare audiovisual stimuli that are either verbal (auditory and visual words) or nonverbal (pictures of objects and their associated sounds). The results demonstrate that, when task, attention, and stimuli are controlled, pSTS activation for audiovisual conceptual matching is 1) identical to that observed for intramodal conceptual matching, 2) greater for incongruent than congruent trials when auditory and visual stimuli are simultaneously presented, and 3) identical for verbal and nonverbal stimuli. These results are not consistent with previous claims that pSTS activation reflects the active formation of an integrated audiovisual representation. After a discussion of the stimulus and task factors that modulate activation, we conclude that, when stimulus input, task, and attention are controlled, pSTS is part of a distributed set of regions involved in conceptual matching, irrespective of whether the stimuli are audiovisual, auditory–auditory or visual–visual.

Introduction

There has been growing interest in how the brain integrates information from different sensory modalities into a unified concept. A number of studies have associated a region of the posterior superior temporal sulcus (pSTS) with crossmodal binding of auditory and visual information (Calvert et al. 2000; Wright et al. 2003; Beauchamp, Argall, et al. 2004;,Beauchamp, Lee, et al. 2004; van Atteveldt et al. 2004; Kreifelts et al. 2007; van Atteveldt, Formisano, Blomert, et al. 2007; van Atteveldt, Formisano, Goebel, et al. 2007). The focus on this neural region has stemmed primarily from early neuroanatomical and electrophysiological data in nonhuman primates, that demonstrated the convergence of afferents from different senses within the superior temporal polysensory region—the primate homolog of human pSTS (e.g., Seltzer and Pandya 1978; Leinonen et al. 1980; Desimone and Ungerleider 1986).

Here we consider evidence that a specific region in the human pSTS actively integrates auditory and visual inputs into an amodal representation. According to this hypothesis human pSTS is a crossmodal “binding site” (Calvert et al. 2000; Calvert 2001; Sekiyama et al. 2003; Beauchamp, Argall, et al. 2004;,Beauchamp, Lee, et al. 2004; van Atteveldt et al. 2004; van Atteveldt, Formisano, Blomert, et al. 2007; van Atteveldt, Formisano, Goebel, et al. 2007). The alternative hypothesis is that pSTS activation reflects amodal processing that is independent of the sensory input modality (e.g., auditory, visual or both), for example, learned conceptual or speech production processes that are subsequent to the processing stage where bottom up audiovisual inputs are integrated (van Wassenhove et al. 2005; Skipper et al. 2007). Figure 1 illustrates the anatomical location of our pSTS region of interest (ROI) relative to more anterior STS (aSTS) sites associated with audiovisual processing and more ventral posterior middle temporal areas associated with action and tool processing. The anatomical co-ordinates of these regions are listed in Table 1.

Table 1

Anatomical co-ordinates of temporal regions

 Date Visual stimuli Auditory stimuli Co-ordinates 
(a) Audiovisual effects in pSTS 
    Audiovisual > unimodal 
        Calvert et al. 2000 Faces Speech streams −49 −50 9 
        Wright et al. 2003 Faces Speech (single words) y = −40 to −55 
        Beauchamp et al. 2004b Pictures Object sounds −50 −55 7 
        van Atteveldt et al. 2004 Letters Speech sounds −54 −48 9 
        van Atteveldt et al. 2007a Letters Speech sounds −54 −43 13 
        Kreifelts et al. 2007 Faces Speech (single words) −54 −51 18 
        Saito et al. 2005 Faces Speech sounds Not significant 
        Taylor et al. 2006 Pictures Sounds and words Not significant (−46, −76,22) 
    Difficult > easy AV 
        Sekiyama et al. 2003 Faces Speech (sounds) −56 −49 9/−43 −55 17 
    Congruent > not 
        Calvert et al. 2000 Faces Speech streams −49 −50 9 
        Ojanen et al. 2005 Faces Speech sounds Not significant 
        van Atteveldt et al. 2007a Letters Speech sounds Not significant 
        van Atteveldt et al. 2007b Letters Speech sounds Not significant 
        Hein et al. 2007 Pictures Object sounds Not significant 
        Taylor et al. 2006 Pictures Sounds and words Not significant 
    Synchronous > not 
        Olson et al. 2002 Faces Speech (words) Not significant 
        Macaluso et al. 2004 Faces Speech (words) Not significant (−64 −58 0) 
        Miller and D'Esposito 2005 Faces Speech sounds Not significant 
        van Atteveldt et al. 2007a Letters Speech sounds Not significant 
(b) Audiovisual effects in aSTS 
    Audiovisual > unimodal 
        Calvert et al. 1999 Faces Numbers (1–10) −46 −25 13/57 −22 13 
        van Atteveldt et al. 2004 Letters Speech sounds −46 −19 2 
    Congruent > not 
        van Atteveldt et al. 2007a Letters Speech sounds −52 −31 15/60 −20 16 
        van Atteveldt et al. 2007b Letters Speech (sounds) −47 −20 7/−59 −33 12 
    Words > sounds 
Noppeney et al. 2007 Words/pictures Speech/sounds −66 −27 −3 
    Recognition > location 
        Sestieri et al. 2006 Pictures Object sounds −58 −18 −3 
(c) Semantic and sentence effects in temporal regions 
    Sentences 
        Narain et al. 2003  Auditory sentences −52 −54 14 
        Vandenberghe et al. 2002 Written sentences  −52 −54 12 
    Tools 
        Phillips et al. 2002 Pictures and words  −58 −64 4 
        Noppeney et al. 2007 Words and pictures Auditory words −51 −66 −6 
        Lewis et al. 2005  Environmental sounds −51 −57 3/49 −51 5 
    Actions 
        Phillips et al. 2002 Pictures and words  −58 −60 4 
        Kellenbach et al. 2003 Pictures only  −48 −62 0 
        Emmorey et al. 2004 Body gestures  −49 −59 0 
        Noppeney et al. 2005 Written words Auditory words −57 −63 6 
 Date Visual stimuli Auditory stimuli Co-ordinates 
(a) Audiovisual effects in pSTS 
    Audiovisual > unimodal 
        Calvert et al. 2000 Faces Speech streams −49 −50 9 
        Wright et al. 2003 Faces Speech (single words) y = −40 to −55 
        Beauchamp et al. 2004b Pictures Object sounds −50 −55 7 
        van Atteveldt et al. 2004 Letters Speech sounds −54 −48 9 
        van Atteveldt et al. 2007a Letters Speech sounds −54 −43 13 
        Kreifelts et al. 2007 Faces Speech (single words) −54 −51 18 
        Saito et al. 2005 Faces Speech sounds Not significant 
        Taylor et al. 2006 Pictures Sounds and words Not significant (−46, −76,22) 
    Difficult > easy AV 
        Sekiyama et al. 2003 Faces Speech (sounds) −56 −49 9/−43 −55 17 
    Congruent > not 
        Calvert et al. 2000 Faces Speech streams −49 −50 9 
        Ojanen et al. 2005 Faces Speech sounds Not significant 
        van Atteveldt et al. 2007a Letters Speech sounds Not significant 
        van Atteveldt et al. 2007b Letters Speech sounds Not significant 
        Hein et al. 2007 Pictures Object sounds Not significant 
        Taylor et al. 2006 Pictures Sounds and words Not significant 
    Synchronous > not 
        Olson et al. 2002 Faces Speech (words) Not significant 
        Macaluso et al. 2004 Faces Speech (words) Not significant (−64 −58 0) 
        Miller and D'Esposito 2005 Faces Speech sounds Not significant 
        van Atteveldt et al. 2007a Letters Speech sounds Not significant 
(b) Audiovisual effects in aSTS 
    Audiovisual > unimodal 
        Calvert et al. 1999 Faces Numbers (1–10) −46 −25 13/57 −22 13 
        van Atteveldt et al. 2004 Letters Speech sounds −46 −19 2 
    Congruent > not 
        van Atteveldt et al. 2007a Letters Speech sounds −52 −31 15/60 −20 16 
        van Atteveldt et al. 2007b Letters Speech (sounds) −47 −20 7/−59 −33 12 
    Words > sounds 
Noppeney et al. 2007 Words/pictures Speech/sounds −66 −27 −3 
    Recognition > location 
        Sestieri et al. 2006 Pictures Object sounds −58 −18 −3 
(c) Semantic and sentence effects in temporal regions 
    Sentences 
        Narain et al. 2003  Auditory sentences −52 −54 14 
        Vandenberghe et al. 2002 Written sentences  −52 −54 12 
    Tools 
        Phillips et al. 2002 Pictures and words  −58 −64 4 
        Noppeney et al. 2007 Words and pictures Auditory words −51 −66 −6 
        Lewis et al. 2005  Environmental sounds −51 −57 3/49 −51 5 
    Actions 
        Phillips et al. 2002 Pictures and words  −58 −60 4 
        Kellenbach et al. 2003 Pictures only  −48 −62 0 
        Emmorey et al. 2004 Body gestures  −49 −59 0 
        Noppeney et al. 2005 Written words Auditory words −57 −63 6 

The methodological principles used to identify audiovisual binding areas have been derived in the main from studies of subcortical structures, in particular the superior colliculus (Stein and Meredith 1993; Stein et al. 1993). These principles include sensitivity to temporal and spatial correspondence, response enhancement and depression, and the rule of inverse effectiveness, where responses to crossmodal inputs are maximal when responses to individual stimuli are minimally effective. Although initial functional imaging studies of humans followed the principle that the audiovisual response should be superadditive compared with the unimodal response (e.g., Calvert et al. 2000), the relevance of this rule to functional magnetic resonance imaging (fMRI) data has been questioned by subsequent investigators (e.g., Beauchamp, Argall, et al. 2004;,Beauchamp, Lee, et al. 2004; Beauchamp 2005) who have identified audiovisual integration areas on the basis of an enhanced response to bimodal audiovisual stimuli relative to either auditory or visual stimuli alone.

In this paper, we investigate the role of the pSTS by examining its response characteristics. If the pSTS actively binds auditory and visual inputs then we would expect the response to audiovisual inputs to be greater than the response to unimodal inputs (Calvert et al. 2000; Beauchamp, Lee, et al. 2004; van Atteveldt et al. 2004; Beauchamp 2005). In contrast, if pSTS activation reflects amodal processing that is independent of the stimulus modality, then we would expect the response to depend on the task demands but not the stimulus modality. Comparing pSTS activation to different types of verbal and nonverbal stimuli and tasks also helps to identify the level of audiovisual convergence (e.g., Sestieri et al. 2006; Noppeney et al. 2007). Early perceptual integration has been investigated with stimuli such as tones and circles that have no a priori relationship (e.g., Giard and Peronnet 1999; Bushara et al. 2001; Degerman et al. 2007). This differs from phonetic or conceptual stimuli that involve top-down processing from prior knowledge (van Wassenhove et al. 2005; Skipper et al. 2007). In studies of continuous and meaningful speech (e.g., Calvert et al. 2000), audiovisual information converge at both a phonetic and conceptual level. Phonetic without conceptual convergence can be studied using temporally brief speech sounds (e.g., “ta” or “ba”) that are heard while viewing mouths articulating the same or different sounds (Saito et al. 2003; Sekiyama et al. 2003; Miller and D'Esposito 2005; Ojanen et al. 2005; Skipper et al. 2007) or written letters presented with their auditory speech sounds (van Atteveldt et al. 2004; van Atteveldt, Formisano, Blomert, et al. 2007; van Atteveldt, Formisano, Goebel, et al. 2007). In contrast, conceptual without phonetic convergence can be studied using pictures or objects and their associated auditory sounds (Beauchamp, Lee, et al. 2004; Sestieri et al. 2006; Taylor et al. 2006; Hein et al. 2007).

The initial claims for pSTS as an audiovisual binding site came from Calvert et al. (2000) who contrasted audiovisual speech to each modality in isolation (i.e., heard words or silent lip-reading). This revealed a superadditive response in left pSTS when the audiovisual input was congruent but a subadditive response when the audiovisual input was incongruent. Subsequently, pSTS activation has been associated with bimodal audiovisual stimuli in studies using single words (Wright et al. 2003; Kreifelts et al. 2007), phonetic stimuli (Sekiyama et al. 2003; van Atteveldt et al. 2004; van Atteveldt, Formisano, Goebel, et al. 2007), and conceptual stimuli (Beauchamp, Lee, et al. 2004). However, there are 2 points of inconsistency. The first is that other studies (e.g., Saito et al. 2003; Taylor et al. 2006) did not report enhanced activation in our pSTS ROI for bimodal audiovisual relative to unimodal stimuli. The second is that the super- and subadditive effects in the pSTS for congruent and incongruent bimodal stimuli (Calvert et al. 2000) have not been replicated, see Table 1a for a summary.

Below we reconsider the evidence that has been used to associate human pSTS with audiovisual binding. This evidence includes enhanced responses for 1) audiovisual relative to auditory or visual inputs alone and 2) congruent relative to incongruent audiovisual stimuli. On the basis of this review, we argue that the current data do not allow us to reject the alternative hypothesis: that is, that pSTS activation is not selective for audiovisual processing but instead reflects amodal processing that is independent of stimulus modality. In this context, we discuss a number of criteria that need to be controlled in fMRI studies of audiovisual integration. We then describe 2 new experiments that investigate the role of pSTS in audiovisual processing after controlling for stimulus input, attention, and task.

Bimodal versus Unimodal Stimuli

The association of pSTS with audiovisual binding has been supported by observations that the response to bimodal audiovisual stimuli is greater than the response to the unimodal parts. The problem here is that if the bimodal response is not superadditive, then there is no clear way to distinguish the binding process from subsequent (downstream) amodal processing. This is because both explanations predict higher activation when there are 2 stimuli (e.g., bimodal) than when there is only 1 stimulus (i.e., unimodal). This was demonstrated in several early functional imaging experiments that observed linear and nonlinear activation increases in perceptual and semantic processing regions when the stimulus input rate increased (Fox 1989; Price et al. 1992, 1996; Binder et al. 1994). The interpretation of enhanced or superadditive activation for bimodal relative to unimodal stimuli therefore depends on whether the total stimulus input has been controlled on a trial by trial basis, for example by comparing bimodal audiovisual stimuli to trials that present 2 unimodal visual stimuli or 2 unimodal auditory stimuli.

In our review of the literature on audiovisual phonetic and conceptual processing (see Table 1), the only studies that compared bimodal audiovisual stimuli to dual presentation of 2 auditory or visual stimuli were the only studies that did not observe enhanced activation in our pSTS ROI (Saito et al. 2003; Taylor et al. 2006). Instead, Saito et al. (2003) discuss the importance of parietal regions and Taylor et al. (2006) discuss the importance of the perirhinal cortex and an occipito-temporal area that lies 2.5 cm posterior to our ROI (see Table 1a). Put another way, none of the experiments reporting pSTS activation for audiovisual stimuli controlled for the total stimulus input per trial. Instead, they compared bimodal stimuli with trials that only present 1 single unimodal stimulus.

In addition to stimulus confounds, some of the studies reporting pSTS activation for audiovisual stimuli were also confounded by task differences. For example, Beauchamp, Lee, et al. (2004; Experiment 2) compared a same/different matching task on audiovisual stimuli with a semantic decision task on the unimodal stimuli (e.g., 4 or 2 legs? with true/false response). Increased activation in left pSTS during audiovisual matching may therefore have been driven by task as well as stimulus rate differences.

The Effect of Congruency and Synchrony

The effect of congruency on audiovisual processing in the pSTS was highlighted by Calvert et al. (2000) who observed a superadditive effect for congruent audiovisual stimuli (AV > A + V) and a subadditive effect of incongruent audiovisual stimuli (AV < A + V). An effect of audiovisual congruency has also been reported by van Atteveldt, Formisano, Blomert, et al. (2007) and van Atteveldt, Formisano, Goebel, et al. (2007) but their effects were located in aSTS, not pSTS (see Table 1b). In fact, we were unable to find any studies that have replicated the Calvert et al. (2000) study showing enhanced pSTS activation for congruent relative to incongruent bimodal stimuli. Nor did we find any study that observed enhanced pSTS activation for synchronous versus asynchronous speech. For example, in Macaluso et al. (2004), synchronous speech, referring to animals or tools, increased activation in the left middle temporal area associated with tool processing (see Table 1 and Fig. 1).

We suggest 3 possible reasons to explain why the effect of congruency in pSTS as reported by Calvert et al. (2000) has not been replicated. The first is that the stimuli used by Calvert et al. (2000) were speech streams (stories) whereas the other studies comparing congruent and incongruent stimuli used temporally brief stimuli (e.g., single speech sounds, words or environmental sounds) that did not offer a sufficiently long time frame to enhance audiovisual integration (see Calvert and Lewis 2004). The problem with this explanation is that it does not explain why pSTS has been associated with audiovisual binding in studies using temporally brief sounds and words (Wright et al. 2003; Beauchamp, Lee, et al. 2004; van Atteveldt et al. 2004; Kreifelts et al. 2007; van Atteveldt, Formisano, Blomert, et al. 2007).

The second (but not mutually exclusive) explanation is that the congruency effects reported in Calvert et al. (2000) arose from stimulus interference effects. For example, when synchronous audiovisual speech streams are congruent, speech comprehension benefits from both auditory and visual processing. In contrast, when audiovisual inputs are incongruent, speech comprehension is impaired because the visual information conflicts with the auditory information (see Calvert and Lewis 2004). Reduced pSTS activation for incongruent speech streams may therefore reflect reduced comprehension. Although this explanation is consistent with studies showing enhanced pSTS activation for written and auditory sentence comprehension (Vandenberghe et al. 2002; Narain et al. 2003, see Table 1c), it does not explain why pSTS activation is activated by nonsemantic speech stimuli that have no meaning or syntax (Sekiyama et al. 2003).

The third explanation for inconsistent congruency effects arises from previous observations that attention to 1 modality only during bimodal presentation elicits subadditive effects (Talsma and Woldorff 2005; Talsma et al. 2007). It is therefore possible that, to minimize interference during incongruent audiovisual speech streams, subjects may automatically or attentionally reduce visual processing (Deneve and Pouget 2004; Ernst and Bulthoff 2004), particularly in the study by Calvert et al. (2000) where congruent and incongruent conditions were presented in separate experiments with no instructions to attend to the visual stimuli. This would explain the absence of congruency effects in studies that presented brief stimuli or forced subjects to attend to the visual input during incongruent audiovisual conditions.

Summary of Key Points that Motivate our Study

To summarize, several studies have highlighted pSTS as an area that is important for the binding of audiovisual conceptual and phonetic inputs even when temporally brief phonetic or conceptual stimuli are used. However, none of these studies equated the number of stimuli per trial or observed an effect of temporal synchrony on the audiovisual stimuli. Moreover, only 1 study observed super- and subadditive effects of audiovisual congruency and this was during a passive task that was susceptible to audiovisual interference effects in the incongruent condition. The role of this pSTS region therefore requires further investigation.

In the present study we investigate the role of pSTS during audiovisual conceptual processing, while keeping the task and number of stimuli constant. In 2 complementary experiments, the task required same–different object decisions on pairs of stimuli that combined pictures of objects, their associated sounds, their written names or their auditory names. In Experiment 1, we compared bimodal audiovisual stimuli with 2 unimodal auditory or 2 unimodal visual stimuli. The inclusion of pairs of unimodal stimuli in this experiment meant that the individual stimuli within a pair had to be presented sequentially (one after the other). Simultaneous presentation was not used because when 2 auditory stimuli are presented at the same time, they interfere with one another at both perceptual and attentional levels (Jancke and Shah 2002; Lipschutz et al. 2002). To investigate the effect of temporal presentation (simultaneous vs. sequential) and its interaction with congruency, Experiment 2 was conducted using the same bimodal stimuli as Experiment 1 but with simultaneous rather than sequential presentation. Comparing the effects in Experiments 1 and 2 thus allowed us to look for any differences in pSTS activation for audiovisual pairs that were congruent or incongruent when presented simultaneously or sequentially. In addition, Experiment 2 included audiovisual stimuli that were both verbal (written and auditory object names) and nonverbal (pictures of objects and their associated sounds). This allowed us to investigate whether pSTS activation was differentially sensitive to whether the stimulus matching depended on the conceptual or phonetic content of the stimuli.

Materials and Methods

Subjects

There were 18 subjects in Experiment 1a (11 women, mean age 26), 8 subjects in Experiment 1b (4 women, mean age 30) and 18 subjects in Experiment 2 (12 women, mean age 26). All were right handed native English speakers with normal or corrected to normal vision. None had neurological or audiological problems. The study was approved by the joint ethics committee of the Institute of Neurology and University College London Hospital, London, UK.

Experimental Design and Stimuli

There were 2 experiments. Experiment 1 compared bimodal with unimodal pairs of stimuli that were sequentially presented. Experiment 2 compared verbal and nonverbal bimodal stimulus pairs that were simultaneously presented. Across experiments, task and stimuli were held constant. The task required a left or right hand key pad response to indicate whether 2 stimuli referred to the same object or not. This task was chosen to ensure attention to all stimuli and access to abstract internal object representations built on prior knowledge.

There were 4 types of stimuli in both experiments: color photographs of objects, their written names, their auditory names and their associated environmental sounds (i.e., 2 visual and 2 auditory; 2 verbal, and 2 nonverbal). All stimuli referred to the same set of 108 items: 36 animals, 36 objects, and 36 musical instruments. A complete list of stimuli, with examples, is provided in Supplementary Materials 1. Photographs were obtained from the Hemara Photo Objects CD collection; environmental sounds were downloaded from the Internet, with the majority obtained from the website www.sounddogs.com. Spoken words were recorded by a female English speaker in a sound proof room and subtended a viewing angle of 2.0°–7.0° (width) × 1.7° (maximum height). Visual stimuli were presented using a rear projector viewed via a mirror mounted on the head coil. All sounds were presented in mono via MRI-compatible electrostatic headphones (sampled at 44.1 kHz, 32 bit) and normalized using a low-pass fourth order Butterworth filter at 5000 Hz. All stimuli were 1000 ms in duration, except for spoken words which had a range of 650–1000 ms. Examples of the auditory stimuli are provided in Supplementary Materials. See Figure 2 for schematic overview of the experiments.

Figure 2.

Stimulus trials for Experiments 1a, 1b, and 2. In Experiments 1a and 1b each trial consisted of 2 simultaneously presented audio and visual stimuli, followed immediately by 2 further audio and visual stimuli with a fixation cross presented between trials. In Experiment 1a and 2, a key-press response was made as soon as subjects could determine whether the stimuli referred to the same concept or not. In Experiment 1b subjects alternated a left and right key-press response. All Experiments had 6 trials per block. Note stimuli were presented in their appropriate colors (not shown in this grayscale figure).

Figure 2.

Stimulus trials for Experiments 1a, 1b, and 2. In Experiments 1a and 1b each trial consisted of 2 simultaneously presented audio and visual stimuli, followed immediately by 2 further audio and visual stimuli with a fixation cross presented between trials. In Experiment 1a and 2, a key-press response was made as soon as subjects could determine whether the stimuli referred to the same concept or not. In Experiment 1b subjects alternated a left and right key-press response. All Experiments had 6 trials per block. Note stimuli were presented in their appropriate colors (not shown in this grayscale figure).

Experiment 1a: Bimodal Relative to Unimodal Object Matching

The focus of this experiment was the comparison of bimodal audiovisual stimulus pairs to unimodal visual–visual and audio–audio stimulus pairs. To avoid differences in divided attention within and across modality, presentation of stimuli within a pair was sequential with no interstimulus interval (i.e., the onset of the second stimulus corresponded to the offset of the first stimulus). In total there were 8 different stimulus conditions that presented the same stimuli in different combinations:

Each crossmodal trial had either

  • 1) One photograph followed by one spoken name,

  • 2) One spoken name followed by one photograph,

  • 3) One written name followed by one environmental sound,

  • 4) One environmental sound followed by one written name.

Each intramodal trial had either

  • 5) One photograph followed by one written name,

  • 6) One written name followed by one photograph,

  • 7) One spoken object name followed by one environmental sound,

  • 8) One environmental sound followed by spoken object name.

The 8 stimulus conditions were blocked with 6 trials per block. Three trials were congruent (both referred to the same object, requiring a Yes—they match response) and 3 were incongruent (referring to 2 different objects and requiring a No—they do not match response). Congruent and incongruent trials were randomized within block. Trial duration was 3.24 s (1 s for each sequential stimulus followed by 2.24-s fixation to allow for the response). This resulted in a total block time of 19.44 s (6 trials × 3.24 ms). Blocks were followed by a period of fixation which alternated between 2.7 and 13.5 s.

Over the experiment, each subject was presented with 9 blocks of each of the 8 stimulus combinations. This resulted in a total of 72 blocks which we split into 4 different scanning sessions (18 blocks per session). Within each session, there were 6 blocks of crossmodal audiovisual matching, 6 blocks of intramodal visual matching and 6 blocks of intramodal auditory matching. Over sessions, the order of conditions was counterbalanced within and between subjects. Stimuli within a block were always from the same object category and the 3 object categories were fully counterbalanced across conditions, sessions, and subjects. However, as predicted on the basis of Figure 1, there were no differential effects of category in our pSTS ROI, therefore in the analyses described below we sum over the effect of category.

The use of unimodal and sequential stimuli is subject to gross differences in visual and auditory attention across conditions. In attempts to reduce these differences, each object stimulus (i.e., a photograph, word or sound referring to an object) was presented with a meaningless stimulus in the opposite modality. Thus, each environmental sound was presented with a scrambled photograph (using the “scatter pixel” function in Corel Photo-paint v.11, Corel Corporation, Ottawa, Canada) that removed all recognizable structure. Each spoken word was presented with a row of XXXs (matched to the number of letters in the corresponding written word). Each photograph was simultaneously presented with a scrambled environmental sound and each written name was simultaneously presented with a scrambled spoken word. The scrambled auditory stimuli were created by converting the environmental sounds and spoken words using a Fast Fourier Transform to scramble their frequency. This resulted in meaningless auditory stimuli, which sounded like white noise with no phonetic content. Examples are provided in Supplementary Materials.

Experiment 1b: Baseline Task for Bimodal Matching

To reduce the impact of nonconceptual sensorimotor processing further, we created an additional baseline condition. This involved only the meaningless audiovisual stimuli from Experiment 1a, therefore it was not possible to use the object-matching task used in Experiments 1a and 2. Instead, we instructed subjects to make an alternating key-press response (right–left) at the end of the second stimulus. Data for this subsidiary experiment were conducted on a different day to Experiment 1a.

The pairings of the 4 different meaningless stimulus types resulted in 4 different conditions:

  • 1) Scrambled photograph followed by scrambled spoken word,

  • 2) Scrambled spoken word followed by scrambled photograph,

  • 3) Row of XXXs followed by scrambled environmental sound,

  • 4) Scrambled environmental sound followed by row of XXXs.

The timing of stimulus presentation, and fixation was identical to Experiment 1a. With half the number of conditions (4 in Experiment 1b and 8 in Experiment 1a), it was only necessary for each subject to participate in 2 scanning sessions (as opposed to 4 in Experiment 1a). The number of trials per condition was therefore held constant across experiments.

Experiment 2: Simultaneous Presentation of Bimodal Objects

Subjects were presented bimodally with 2 simultaneously presented stimuli, 1 in the visual modality and 1 in the auditory modality. This resulted in 4 different crossmodal conditions:

  • 1) One photograph with one spoken name,

  • 2) One written name with one environmental sound,

  • 3) One written name with one spoken name (i.e., verbal only stimulus),

  • 4) One photograph with one environmental sound (i.e., nonverbal only stimulus).

As in Experiment 1, the stimulus conditions were blocked with 3 congruent and 3 incongruent trials per block. Trial length was 2.7 s (1-s stimulus duration followed by 1.7-s fixation), block length was 16.2 s (2.7 s × 6). Fixation after each block alternated between 1.62 and 16.2 s and there were a total of 24 blocks in each of 4 different scanning sessions.

Data Acquisition

All data were acquired on a Siemens 1.5-Tesla scanner (Siemens, Erlangen, Germany). Functional images used a T2*-weighted echo-planar sequence for blood oxygen level–dependent contrast with 3 × 3 mm in plane resolution, 2-mm slice thickness and a 1-mm slice interval. Thirty slices were collected in Experiments 1a and b (resulting in an effective repetition time [TR] of 2.7 s/volume) and 36 slices were collected in Experiment 2 (TR = 3.24 s). After the functional sessions, a T1-weighted anatomical volume image was acquired from all subjects to ensure normal neurological status.

Data Analysis

Functional data were analyzed with statistical parametric mapping (SPM2, Wellcome Department of Imaging Neuroscience, London, UK) implemented in Matlab 7.1 (Mathworks, Sherborne, MA). Preprocessing included realignment and unwarping using the first volume as the reference scan (after excluding the first 4 dummy scans to allow for T1 equilibration effects) spatial normalization to a standard Montreal Neurological Institute (MNI) template (Friston et al. 1995) and spatial smoothing using a 6 mm full width half maximum isotropic Gaussian kernel. Two subjects were removed from the analysis (one from Experiment 1a, one from Experiment 2) due to excess head movement.

First level statistical analyses (single subject and fixed effects) modeled each trial type independently by convolving the onset times with the hemodynamic response function. The data were high-pass filtered using a set of discrete cosine basis functions with a cut-off period of 128 s. Parameter estimates were calculated for all voxels using the general linear model, by computing a contrast image for each condition relative to fixation and for congruent relative to incongruent crossmodal trials in Experiments 1a and 2. The parameter estimates were then fed into 3 different second level analyses.

ANOVA 1: Crossmodal versus Intramodal

The first analysis investigated the effect of crossmodal versus intramodal matching, and its interaction with congruency. In this ANOVA there were 17 parameter estimates: 16 for each subject in Experiment 1a (8 conditions × congruent and incongruent) and 1 from each subject in Experiment 1b. This allowed us to test for the main effects of presentation modality (crossmodal vs. intramodal); congruency (congruent vs. incongruent), the interactions between these variables and also between these variables and the order of presentation (word–picture vs. picture–word). In addition, we identified effects that were common to all 16 trial types in Experiment 1a, and when computing this contrast, we excluded any voxels that were also activated in Experiment 1b (P < 0.5) to remove activation related to meaningless sensorimotor processing.

ANOVA 2: The Effect of Congruency during Simultaneous and Sequential Crossmodal Matching

This analysis was based on the effect of congruent vs. incongruent in 1) the 2 crossmodal conditions in Experiments 1a and 2) the corresponding crossmodal conditions in Experiment 2 (photograph and spoken word and written word and environmental sound). This allowed us to test for the main effect of congruency (congruent vs. incongruent) and its interaction with temporal presentation (simultaneous vs. sequential). Note that our experimental design did not allow us to test the effect of spatio-temporal coincidence directly. Instead we compare the effect of congruency during sequential and simultaneous stimulus presentations.

ANOVA 3: The Effect of Simultaneous Crossmodal Matching Relative to Fixation

The third ANOVA included the parameter estimates for each of the 4 crossmodal conditions relative to fixation in Experiment 2. This allowed us to illustrate the effect sizes in the pSTS for all the different verbal and nonverbal conditions.

Statistical Threshold

The t-images for each contrast at the second level were subsequently transformed into the statistical parametric maps of the Z statistic. Unless stated otherwise, all significant effects are reported at P < 0.05 corrected for multiple comparisons either across the whole brain or in our pSTS ROI which was centered on the peak co-ordinates reported in Calvert et al. (2000: −49, −50, 9) and Beauchamp, Lee, et al. (2004; −50, −55, 7); for verbal and nonverbal audiovisual integration respectively. These co-ordinates were then converted from Talairach and Tournoux stereotactic space into the nearest estimated co-ordinates in MNI space using the algorithm developed by Matthew Brett (http://www.mrc-cbu.cam.ac.uk/Imaging/Common/mnispace.shtml). Within these 2 ROIs (transformed to [±50, −52, 8] and [±50, −56, 4]), we searched a sphere (6 mm radius) for the nearest peaks in our own data set.

Results

Behavioral Analyses

Reaction times for both experiments were analyzed using 2 repeated measures ANOVAs, modeling presentation modality (crossmodal in Experiments 1 and 2 plus intramodal auditory and visual in Experiment 1) and congruency (congruent, incongruent). Means and standard deviation are shown in Table 2. For Experiment 1a, a 3 × 2 ANOVA identified a main effect for sensory modality (F2,18 = 22.968, P < 0.0005). Pairwise comparisons across modality revealed that response latencies increased from visual–visual (VV) to audiovisual (AV) to auditory–auditory (AA) trials. This response pattern is not consistent with that observed in early sensory integration experiments, where a bimodal stimulus facilitates a task response relative to a unimodal stimulus. However, we note that Beauchamp, Lee, et al. (2004, Experiment 3) also report increases in response time from V to AV to A conditions. The most likely explanation is in terms of the difference in duration of auditory relative to visual stimuli. Visual matching is fastest because subjects can make their decision at the onset of the stimulus. Auditory matching is slowest because stimulus recognition may not occur until stimulus presentation is complete (up to 1000 ms). During AV matching, reaction times are slower than visual but faster than auditory because half the second stimuli are auditory and the other half are visual.

Table 2

Behavioral data

Experiment Matching condition Mean (ms) SD 
Sequential AV con 1001 214 
 AV inc 1029 211 
 VV con 927 246 
 VV inc 927 250 
 AA con 1114 166 
 AA inc 1122 168 
Simultaneous VwAs con 945 191 
 VwAs inc 946 176 
 VpAw con 875 153 
 VpAw inc 870 146 
Experiment Matching condition Mean (ms) SD 
Sequential AV con 1001 214 
 AV inc 1029 211 
 VV con 927 246 
 VV inc 927 250 
 AA con 1114 166 
 AA inc 1122 168 
Simultaneous VwAs con 945 191 
 VwAs inc 946 176 
 VpAw con 875 153 
 VpAw inc 870 146 

Note: Mean and standard deviation for reaction times in response to audiovisual, visual, and auditory matching tasks in Experiments 1a and 2. Data are for 10 subjects in Experiment 1 (due to technical difficulties with recording from the keypad) and for 18 subjects in Experiment 2. Sequential = Experiment 1, simultaneous = Experiment 2, con = congruent trials, inc = incongruent trials, AV = audiovisual matching, VV = visual matching, AA = auditory matching, Vw = visual words, Vp = visual pictures, Aw = auditory words, As = auditory sounds.

For Experiment 2, a 2 × 2 ANOVA, identified a main effect of modality (F1,17 = 32.564, P < 0.0005), with responses to written words/sounds faster than spoken words/pictures. There was no main effect of congruence in either experiment (Experiment 1: P = 0.423; Experiment 2: P = 0.806), and no interaction between presentation modality and congruency (Experiment 1: P = 0.562; Experiment 2: P = 0.534).

Functional Imaging

ANOVA 1: Crossmodal versus Intramodal

There was no significant difference between the crossmodal conditions (AV) and the mean response of the intramodal (VV and AA) conditions (AV > mean[VV + AA]) for either congruent, incongruent or the sum of both congruent and incongruent. Even when we reduced the threshold to P < 0.05 uncorrected in our left and right pSTS ROIs, no voxels were identified with increased activation for AV > mean[VV + AA]. This is because activation in our pSTS ROIs was part of a widely distributed system that was activated for intramodal as well as crossmodal matching. Figure 3 shows the activation pattern for each condition in Experiment 1a relative to fixation before a) and after b) the sensorimotor areas activated in Experiment 1b have been removed. The remarkable consistency in the location of the activation peaks for crossmodal and intramodal matching relative to fixation is demonstrated in Supplementary Materials 2. The peak co-ordinates in our pSTS ROI, after sensorimotor activation had been removed, were identified at [−50, −50, 10/54, −54, 8] for AV and VV and [−50, −50, 10/54, −54, 6] for AA. These effects are within 2 mm of the center of our regions of interest [±50, −52, 8] and [±50, −56, 4]) based on previous studies of audiovisual integration (see Methods). The Z scores associated with our pSTS effects were also highly significant and greater in the left than right hemisphere (Z = Inf/3.5 for AV; 5.8/3.2 for VV; and Inf/3.8 for AA).

Figure 3.

Common activation for crossmodal and intramodal matching. Increased activation for sequential matching relative to fixation before (a) and after (b) removing sensorimotor activation in Experiment 1b (P < 0.5). Across all 3 conditions, the peak co-ordinate in left pSTS was centered at [−50, −50, 10]. In the right, the peak co-ordinate was centered for AV and VV at [54, −54, 8] and for AA at [54, −54, 6]. To highlight pSTS, activation is shown at P < 0.05 corrected for multiple comparisons when the baseline was fixation (a) and P < 0.001 uncorrected on sagittal sections (x = −50) when the baseline removed sensorimotor processing (b). A white circle highlights pSTS activation. Details of all other activation peaks are provided in Supplementary Materials 2.

Figure 3.

Common activation for crossmodal and intramodal matching. Increased activation for sequential matching relative to fixation before (a) and after (b) removing sensorimotor activation in Experiment 1b (P < 0.5). Across all 3 conditions, the peak co-ordinate in left pSTS was centered at [−50, −50, 10]. In the right, the peak co-ordinate was centered for AV and VV at [54, −54, 8] and for AA at [54, −54, 6]. To highlight pSTS, activation is shown at P < 0.05 corrected for multiple comparisons when the baseline was fixation (a) and P < 0.001 uncorrected on sagittal sections (x = −50) when the baseline removed sensorimotor processing (b). A white circle highlights pSTS activation. Details of all other activation peaks are provided in Supplementary Materials 2.

ANOVA 2: The Effect of Congruency during Simultaneous and Sequential Crossmodal Matching

Across experiments, there were no significant effects of congruent > incongruent. However, incongruent > congruent activated a distributed set of bilateral regions that included the left and right pSTS ROIs, see Table 3 and Figure 4. The peak pSTS effects lay lateral and slightly superior [−60, −52, 14/+64, −48, 12] to our regions of interest [±50, −52, 8] and were primarily driven by Experiment 2 (simultaneous presentation) with no significant effect of incongruency in Experiment 1 (sequential presentation), see Table 3 and Figure 5.

Table 3

Incongruent relative to congruent trials for simultaneous matching

Anatomical region Co-ordinates (simultaneous)
 
Z scores:
 
Simul. Sequl. Interaction 
L superior temporal gyrus/sulcus *−60 −52 14 5.0 ns 3.5 
−56 −22 5.1 ns 4.0 
−62 −28 12 5.7 ns 4.1 
−62 −42 10 5.1 1.7 2.9 
R superior temporal gyrus/sulcus *64 −48 12 3.6 ns ns 
64 −12 6.6 ns 5.3 
50 −14 5.1 ns 3.3 
62 −16 −8 5.1 ns 3.8 
40 −18 −8 5.4 ns 4.0 
46 −24 16 5.2 ns 3.3 
46 −30 −2 5.7 ns 3.9 
 56 −34 5.3 ns 3.4 
R occipital 38 −80 5.4 ns 4.4 
38 −70 −18 5.3 ns 3.2 
L occipital −32 −84 −16 5.2 ns 3.8 
−50 −80 5.1 ns 3.8 
L cerebellum −38 −60 −24 5.3 ns 4.0 
−34 −58 −22 5.1 ns 3.5 
Anatomical region Co-ordinates (simultaneous)
 
Z scores:
 
Simul. Sequl. Interaction 
L superior temporal gyrus/sulcus *−60 −52 14 5.0 ns 3.5 
−56 −22 5.1 ns 4.0 
−62 −28 12 5.7 ns 4.1 
−62 −42 10 5.1 1.7 2.9 
R superior temporal gyrus/sulcus *64 −48 12 3.6 ns ns 
64 −12 6.6 ns 5.3 
50 −14 5.1 ns 3.3 
62 −16 −8 5.1 ns 3.8 
40 −18 −8 5.4 ns 4.0 
46 −24 16 5.2 ns 3.3 
46 −30 −2 5.7 ns 3.9 
 56 −34 5.3 ns 3.4 
R occipital 38 −80 5.4 ns 4.4 
38 −70 −18 5.3 ns 3.2 
L occipital −32 −84 −16 5.2 ns 3.8 
−50 −80 5.1 ns 3.8 
L cerebellum −38 −60 −24 5.3 ns 4.0 
−34 −58 −22 5.1 ns 3.5 

Note: Peak co-ordinates of significant clusters for matching incongruent relative to congruent trials with simultaneous audiovisual input (P < 0.05 corrected for multiple comparisons across the whole brain). The threshold was lowered to P < 0.05 uncorrected to search for corresponding effects during sequential audiovisual matching and the interaction between congruence and synchrony. Co-ordinates highlighted in bold and marked with an asterisk are those closest to our left and right pSTS ROI. ns, not significant.

Figure 4.

Incongruent > congruent matching of simultaneously presented audiovisual objects. To demonstrate the extent of significant activations only, the statistical threshold was set to P < 0.001, uncorrected with a minimum cluster of 35 voxels.

Figure 4.

Incongruent > congruent matching of simultaneously presented audiovisual objects. To demonstrate the extent of significant activations only, the statistical threshold was set to P < 0.001, uncorrected with a minimum cluster of 35 voxels.

Figure 5.

Effect sizes in left and right pSTS for all conditions in each experiment. Effect sizes (and variance) in the left [−60, −52, 14] and right [64, −48, 12] pSTS show the greater activation for incongruent than congruent trials for each condition. The main effect of incongruency was significant in the simultaneous matching conditions (Experiment 2; Z = 5.0) but not in the sequential matching conditions (Experiment 1; Z < 1.96). See Table 3 for more statistical details.

Figure 5.

Effect sizes in left and right pSTS for all conditions in each experiment. Effect sizes (and variance) in the left [−60, −52, 14] and right [64, −48, 12] pSTS show the greater activation for incongruent than congruent trials for each condition. The main effect of incongruency was significant in the simultaneous matching conditions (Experiment 2; Z = 5.0) but not in the sequential matching conditions (Experiment 1; Z < 1.96). See Table 3 for more statistical details.

ANOVA 3: The Effect of Simultaneous Crossmodal Matching Relative to Fixation

The activation pattern for simultaneous crossmodal matching relative to fixation (Experiment 2) was virtually identical to the activation pattern for sequential crossmodal matching relative to fixation (Experiment 1a), see Figure 6 and Supplementary Materials 3. Within Experiment 2, left and right pSTS was activated by all 4 types of crossmodal trial with no significant differences between verbal versus nonverbal conditions.

Figure 6.

Comparison of audiovisual matching in Experiments 1 and 2. Audiovisual matching > fixation for (a) simultaneous and (b) sequential audiovisual presentation. Rendered at P < 0.05, corrected for multiple comparisons. There was no significant difference in pSTS activation when (a) and (b) were directly compared.

Figure 6.

Comparison of audiovisual matching in Experiments 1 and 2. Audiovisual matching > fixation for (a) simultaneous and (b) sequential audiovisual presentation. Rendered at P < 0.05, corrected for multiple comparisons. There was no significant difference in pSTS activation when (a) and (b) were directly compared.

Discussion

This study highlights 4 findings that have implications for understanding the role of our pSTS ROI in auditory and visual processing. First, Experiment 1 demonstrated that, when task and stimulus factors were controlled, the conceptual network activated by sequential audiovisual matching was equally activated by intramodal matching in the auditory or visual domains. This observation included the left pSTS ROIs previously associated with audiovisual binding and its homolog in the right hemisphere. Therefore we did not replicate previous findings of enhanced pSTS activation for bimodal relative to unimodal stimuli at a conceptual level. Second, contrary to previous studies, Experiment 2 demonstrated increased activation for incongruent relative to congruent audiovisual inputs throughout a widely distributed network of regions that included both left and right pSTS. Third, we found that the network of brain regions activated for sequential audiovisual matching in Experiment 1a included all the areas that were activated by matching simultaneously presented audiovisual stimuli in Experiment 2. Finally, we found equivalent pSTS responses to verbal and nonverbal audiovisual stimuli. In short, there was no evidence that pSTS activation was higher for audiovisual than unimodal object matching. We therefore suggest that our pSTS ROI is involved in amodal processing that is independent of sensory modality or verbal content. Below, we discuss how and why our findings differ from those previously reported.

Bimodal versus Unimodal Object Matching

We found that no region, including our pSTS ROI, showed increased activation for bimodal relative to unimodal stimuli when the task, attention, and number of stimuli per trial were controlled. This conflicts with several previous studies that report higher pSTS activation for bimodal than unimodal inputs that are verbal (Calvert et al. 2000; Wright et al. 2003; van Atteveldt et al. 2004) or nonverbal (Beauchamp, Argall, et al. 2004;,Beauchamp, Lee, et al. 2004). We suggest that these previous studies did not control for stimulus and attentional confounds. For example, Beauchamp, Lee, et al. (2004) compared activation for audiovisual trials with 2 stimuli to activation for unimodal trials with 1 stimulus. Therefore, although the same stimuli were presented in each condition, the number of stimuli in audiovisual trials was effectively double that for the unimodal trials. This has well recognized consequences on the hemodynamic response (Fox 1989; Price et al. 1992, 1996; Binder et al. 1994), which are sufficient to explain why audiovisual activation is higher than unimodal activation when the number of stimuli per trial is not controlled.

The association of pSTS with multimodal binding could also result from task confounds during both active (Beauchamp, Lee, et al. 2004) and passive tasks (Wright et al. 2003; van Atteveldt et al. 2004). For example, if pSTS is involved in making conceptual associations between incoming stimuli, subjects are more likely to make associations between 2 stimuli that arrive in close temporal proximity (as in the audiovisual conditions) than when single stimuli are separated in time. To avoid these confounds, we used an object-matching task in all conditions thereby necessitating the comparison of 2 incoming stimuli. In this context activation in our pSTS ROIs was the same for crossmodal and intramodal matching. Our findings are therefore more consistent with the designation of pSTS as an area specialized for amodal processing, subsequent to audiovisual convergence rather than an area that actively binds auditory and visual inputs. Given that pSTS activation is observed for both conceptual and phonetic stimuli, it may play a role in matching stimulus inputs to internal representations based on prior experience (van Wassenhove et al. 2005; Skipper et al. 2007).

Congruent versus Incongruent Trials

Our results show that, when task and stimulus presentation are controlled, a network of regions, including pSTS, are activated more strongly for incongruent than congruent pairs. This suggests that, even when the number of stimuli are held constant, activation reflects processing demand which is greater when 2 simultaneously presented stimuli refer to different concepts (incongruent condition) than when 2 stimuli refer to the same object (congruent condition). The effect of congruency therefore demonstrates that pSTS activation is more dependent on the task than the stimulus modality. Nevertheless, further investigation is required. For example, pSTS activation may depend on whether the stimuli involve continuous speech or static objects (Calvert and Lewis 2004). Alternatively, we hypothesize that if subjects are able to attend to 1 input modality while suppressing the other, then pSTS activation will be less for incongruent bimodal trials. In contrast, if subjects are forced to attend to both modalities then pSTS activation will be higher for incongruent bimodal trials that effectively carry twice the conceptual and phonetic information content as congruent trials.

Simultaneous versus Sequential Presentation

It could also be argued that because we did not present moving visual stimuli with the spoken words and environmental sounds, and hence provide a truly synchronous event, we were unable to detect enhanced pSTS activation for bimodal relative to unimodal stimuli. However, as reviewed in the introduction, the association of our pSTS ROIs with multimodal binding is not limited to moving stimuli. Moreover, the stimulus and task factors that we highlight in the context of nonmoving conceptual stimuli also apply to results from studies that did use moving speech stimuli. In short, our results are based on temporally brief verbal and nonverbal conceptual stimuli that could only be integrated at a late level of processing because there was no correspondence at a perceptual level. Nevertheless, they call into question previous conclusions based on both moving and nonmoving stimuli. Further studies are therefore required to determine if our pSTS ROIs are activated by bimodal more than unimodal processing of continuous and synchronous audiovisual speech streams when attention and stimulus input are controlled. Such an experiment might involve the comparison of 1) synchronous versus asynchronous audiovisual speech when attention is controlled (e.g., if subjects were instructed to press a button when there was a mismatch in the auditory and visual inputs), and 2) audiovisual speech versus asynchronous intramodal matching (e.g., deciding if mouth movements correspond to written text).

Patchy Organization within Human STS

In an elegant, high-resolution fMRI study, Beauchamp, Argall, et al. (2004) suggest that auditory and visual inputs arrive in STS in separate patches of cortex and are integrated in intervening cortex. This conclusion was based on observations that different patches of STS responded maximally to auditory and visual stimuli with intervening patches showing enhanced response to bimodal audiovisual stimuli than either auditory or visual stimuli alone. In our study, the voxels were 3 × 3 × 3 mm (as opposed to 1.6 × 1.6 × 1.6 mm in Beauchamp, Argall, et al. 2004) and we did not attempt to dissociate unimodal visual and auditory regions. However, our results still call into question conclusions that these patches of amodal cortex actively integrate visual and auditory inputs. If these patches have enhanced responses to bimodal audiovisual stimuli then this should be detected even when the voxel size is larger (because more patches are responding to bimodal than unimodal stimuli). Indeed, Beauchamp, Argall, et al. (2004) identified the ROI for their high-resolution study on the basis of multisensory activation in a study with low resolution (voxel size 3.75 × 3.75 × 5 mm, Beauchamp, Lee, et al. 2004). Our finding that intramodal auditory and visual matching activated these pSTS ROIs as much as crossmodal audiovisual matching therefore necessitates further investigation of patchy STS cortex using high-resolution fMRI. Specifically, the effect needs to be replicated when attentional factors and the number of stimuli per trial are controlled. Only then will we be able to exclude the possibility that enhanced activation for bimodal relative to unimodal stimuli in patchy STS results from stimulus or attentional confounds.

Conclusions

In summary, Experiment 1 investigated whether pSTS activation was higher for audiovisual stimuli than unimodal conceptual processing when task, attention, and stimulus presentation parameters were controlled. Specifically, in both crossmodal and intramodal conditions, subjects were instructed to compare 2 perceptually different stimuli. Under this context, we found equivalent activation for crossmodal and intramodal stimulus matching across a distributed set of regions including our pSTS ROI in both the left and right hemispheres. The same bilateral network was also activated in Experiment 2 when audiovisual stimuli were simultaneously presented. Experiment 2 also allowed us to demonstrate that activation in our pSTS ROI was not affected by whether the stimuli were verbal or nonverbal but was greater for incongruent than congruent trials. This congruency effect demonstrates that pSTS activation depended on the number of object concepts that needed to be simultaneously attended to (2 different object concepts during incongruent audiovisual trials versus 1 object concept during congruent audiovisual trials). Taking all these results into account, we conclude that activation in both the left and right pSTS is involved in the association of 2 stimuli, irrespective of stimulus modality. This process may have a top-down influence on audiovisual binding although we found no evidence that this region is more activated by audiovisual than unimodal conceptual processing.

Supplementary Material

Supplementary material can be found at: http://www.cercor.oxfordjournals.org/.

Funding

This work was funded by the Wellcome Trust.

We are grateful to all the subjects who participated in these studies and the FIL Functional Technologies Team for support acquiring the data. We would also like to thank Uta Noppeney for her help with programming the stimulus presentation and 2 anonymous reviewers for many constructive suggestions that improved the manuscript. Conflict of Interest: None declared.

References

Beauchamp
MS
Statistical criteria in FMRI studies of multisensory integration
Neuroinformatics.
 , 
2005
, vol. 
3
 
2
(pg. 
93
-
113
)
Beauchamp
MS
Argall
BD
Bodurka
J
Duyn
JH
Martin
A
Unraveling multisensory integration: patchy organization within human STS multisensory cortex
Nat Neurosci.
 , 
2004
, vol. 
7
 
11
(pg. 
1190
-
1192
)
Beauchamp
MS
Lee
KE
Argall
BD
Martin
A
Integration of auditory and visual information about objects in superior temporal sulcus
Neuron.
 , 
2004
, vol. 
41
 
5
(pg. 
809
-
823
)
Binder
JR
Rao
SM
Hammeke
TA
Frost
JA
Bandettini
PA
Hyde
JS
Effects of stimulus rate on signal response during functional magnetic resonance imaging of auditory cortex
Brain Res Cogn Brain Res.
 , 
1994
, vol. 
2
 
1
(pg. 
31
-
38
)
Bushara
KO
Grafman
J
Hallett
M
Neural correlates of auditory-visual stimulus onset asynchrony detection
J Neurosci.
 , 
2001
, vol. 
21
 
1
(pg. 
300
-
304
)
Calvert
GA
Crossmodal processing in the human brain: insights from functional neuroimaging studies
Cereb Cortex.
 , 
2001
, vol. 
11
 
12
(pg. 
1110
-
1123
)
Calvert
GA
Brammer
MJ
Bullmore
ET
Campbell
R
Iversen
SD
David
AS
Response amplification in sensory-specific cortices during cross-modal binding
Neuroreport.
 , 
1999
, vol. 
10
 (pg. 
2619
-
2623
)
Calvert
GA
Campbell
R
Brammer
MJ
Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortex
Curr Biol.
 , 
2000
, vol. 
10
 
11
(pg. 
649
-
657
)
Calvert
GA
Lewis
JW
Calvert
GA
Spence
C
Stein
BE
Hemodynamic studies of audiovisual interactions
The handbook of multisensory processes
 , 
2004
London
MIT Press
(pg. 
483
-
502
)
Degerman
A
Rinne
T
Pekkola
J
Autti
T
Jääskeläinen
IP
Sams
M
Alho
K
Human brain activity associated with audiovisual perception and attention
Neuroimage.
 , 
2007
, vol. 
15
 
4
(pg. 
1683
-
1691
)
Deneve
S
Pouget
A
Bayesian multisensory integration and cross-modal spatial links
J Physiol Paris.
 , 
2004
, vol. 
98
 (pg. 
249
-
258
)
Desimone
R
Ungerleider
LG
Multiple visual areas in the caudal superior temporal sulcus of the macaque
J Comp Neurol.
 , 
1986
, vol. 
248
 
2
(pg. 
164
-
189
)
Emmorey
K
Grabowski
T
McCullough
S
Damasio
H
Ponto
L
Hichwa
R
Bellugi
U
Motor-iconicity of sign language does not alter the neural systems underlying tool and action naming
Brain Lang.
 , 
2004
, vol. 
89
 (pg. 
27
-
37
)
Ernst
MO
Bulthoff
HH
Merging the senses into a robust percept
Trends Cogn Sci.
 , 
2004
, vol. 
8
 (pg. 
162
-
169
)
Fox
PT
Functional brain mapping with positron emission tomography
Semin Neurol.
 , 
1989
, vol. 
9
 
4
(pg. 
323
-
329
)
Friston
KJ
Ashburner
J
Frith
CD
Poline
JB
Heather
JD
Frackowiak
RSJ
Spatial registration and normalization of images
Hum Brain Mapp.
 , 
1995
, vol. 
3
 
3
(pg. 
165
-
189
)
Giard
H
Peronnet
F
Auditory-visual integration during multimodal object recognition in humans: a behavioural and electrophysiological study
J Cogn Neurosci.
 , 
1999
, vol. 
11
 
5
(pg. 
473
-
490
)
Hein
G
Doehrmann
O
Muller
NG
Kaiser
J
Muckli
L
Naumer
MJ
Object familiarity and semantic congruency modulate responses in cortical audiovisual integration areas
J Neurosci.
 , 
2007
, vol. 
27
 (pg. 
7881
-
7887
)
Jancke
L
Shah
NJ
Does dichotic listening probe temporal lobe functions?
Neurology.
 , 
2002
, vol. 
58
 
5
(pg. 
736
-
743
)
Kellenbach
ML
Brett
M
Patterson
K
Actions speak louder than functions: the importance of manipulability and action in tool representation
J Cogn Neurosci.
 , 
2003
, vol. 
15
 (pg. 
30
-
46
)
Kreifelts
B
Ethofer
T
Grodd
W
Erb
M
Wildgruber
D
Audiovisual integration of emotional signals in voice and face: an event-related fMRI study
Neuroimage.
 , 
2007
, vol. 
37
 
4
(pg. 
1445
-
1456
)
Leinonen
L
Hyvarinen
J
Sovijarvi
AR
Functional properties of neurons in the temporo-parietal association cortex of awake monkey
Exp Brain Res.
 , 
1980
, vol. 
39
 
2
(pg. 
203
-
215
)
Lewis
JW
Brefczynski
JA
Phinney
RE
Janik
JJ
DeYoe
EA
Distinct cortical pathways for processing tool versus animal sounds
J Neurosci.
 , 
2005
, vol. 
25
 
21
(pg. 
5148
-
5158
)
Lipschutz
B
Kolinsky
R
Damhaut
P
Wikler
D
Goldman
S
Attention-dependent changes of activation and connectivity in dichotic listening
Neuroimage.
 , 
2002
, vol. 
17
 
2
(pg. 
643
-
656
)
Macaluso
E
George
N
Dolan
R
Spence
C
Driver
J
Spatial and temporal factors during processing of audiovisual speech: a PET study
Neuroimage.
 , 
2004
, vol. 
21
 
2
(pg. 
725
-
732
)
Miller
LM
D'Esposito
M
Perceptual fusion and stimulus coincidence in the cross-modal integration of speech
J Neurosci.
 , 
2005
, vol. 
25
 (pg. 
5884
-
5893
)
Narain
C
Scott
SK
Wise
RJ
Rosen
S
Leff
A
Iversen
SD
Matthews
PM
Defining a left-lateralized response specific to intelligible speech using fMRI
Cereb Cortex.
 , 
2003
, vol. 
13
 
12
(pg. 
1362
-
1368
)
Noppeney
U
Josephs
O
Kiebel
S
Friston
KJ
Price
CJ
Action selectivity in parietal and temporal cortex
Brain Res Cogn Brain Res.
 , 
2005
, vol. 
25
 (pg. 
641
-
649
)
Noppeney
U
Josephs
O
Hocking
J
Price
CJ
Friston
KJ
The effect of prior visual information on recognition of speech and sounds
Cereb Cortex.
 , 
2007
 
doi:10.1093/cercor/bhm091
Ojanen
V
Mottonen
R
Pekkola
J
Jaaskelainen
IP
Joensuu
R
Autti
T
Sams
M
Processing of audiovisual speech in Broca's area
Neuroimage.
 , 
2005
, vol. 
25
 
2
(pg. 
233
-
238
)
Olson
IR
Gatenby
JC
Gore
JC
A comparison of bound and unbound audio-visual information processing in the human cerebral cortex
Brain Res Cogn Brain Res.
 , 
2002
, vol. 
14
 
1
(pg. 
129
-
138
)
Phillips
JA
Noppeney
U
Humphreys
GW
Price
CJ
Can segregation within the semantic system account for category-specific deficits?
Brain.
 , 
2002
, vol. 
125
 (pg. 
2067
-
2080
)
Price
C
Wise
R
Ramsay
S
Friston
K
Howard
D
Patterson
K
Frackowiak
R
Regional response differences within the human auditory cortex when listening to words
Neurosci Lett.
 , 
1992
, vol. 
146
 
2
(pg. 
179
-
182
)
Price
CJ
Moore
CJ
Frackowiak
RS
The effect of varying stimulus rate and duration on brain activity during reading
Neuroimage.
 , 
1996
, vol. 
3
 
1
(pg. 
40
-
52
)
Saito
DN
Okada
T
Morita
Y
Yonekura
Y
Sadato
N
Tactile-visual cross-modal shape matching: a functional MRI study
Brain Res Cogn Brain Res.
 , 
2003
, vol. 
17
 (pg. 
14
-
25
)
Sekiyama
K
Kanno
I
Miura
S
Sugita
Y
Auditory-visual speech perception examined by fMRI and PET
Neurosci Res.
 , 
2003
, vol. 
47
 
3
(pg. 
277
-
287
)
Seltzer
B
Pandya
DN
Afferent cortical connections and architectonics of the superior temporal sulcus and surrounding cortex in the rhesus monkey
Brain Res.
 , 
1978
, vol. 
23
 
1
(pg. 
1
-
24
)
Sestieri
C
Di Matteo
R
Ferretti
A
Del Gratta
C
Caulo
M
Tartaro
A
Olivetti Belardinelli
M
Romani
GL
“What” versus “where” in the audiovisual domain: an fMRI study
Neuroimage.
 , 
2006
, vol. 
33
 
2
(pg. 
672
-
680
)
Skipper
JI
van Wassenhove
V
Nusbaum
HC
Small
SL
Hearing lips and seeing voices: how cortical areas supporting speech production mediate audiovisual speech perception
Cereb Cortex.
 , 
2007
, vol. 
17
 (pg. 
2387
-
2399
)
Stein
BE
Meredith
MA
Merging of the senses
 , 
1993
Cambridge (MA)
MIT Press
Stein
BE
Meredith
MA
Wallace
MT
The visually responsive neuron and beyond: multisensory integration in cat and monkey
Prog Brain Res.
 , 
1993
, vol. 
95
 (pg. 
79
-
90
)
Talsma
D
Doty
TJ
Woldorff
MG
Selective attention and audiovisual integration: is attending to both modalities a prerequisite for early integration?
Cereb Cortex.
 , 
2007
, vol. 
17
 
3
(pg. 
679
-
690
)
Talsma
D
Woldorff
MG
Selective attention and multisensory integration: multiple phases of effects on the evoked brain activity
J Cogn Neurosci.
 , 
2005
, vol. 
17
 
7
(pg. 
1098
-
1114
)
Taylor
KI
Moss
HE
Stamatakis
EA
Tyler
LK
Binding crossmodal object features in perirhinal cortex
Proc Natl Acad Sci USA.
 , 
2006
, vol. 
103
 
21
(pg. 
8239
-
8244
)
van Atteveldt
NM
Formisano
E
Blomert
L
Goebel
R
The effect of temporal asynchrony on the multisensory integration of letters and speech sounds
Cereb Cortex.
 , 
2007
, vol. 
17
 (pg. 
962
-
974
)
van Atteveldt
N
Formisano
E
Goebel
R
Blomert
L
Integration of letters and speech sounds in the human brain
Neuron.
 , 
2004
, vol. 
43
 
2
(pg. 
271
-
282
)
van Atteveldt
NM
Formisano
E
Goebel
R
Blomert
L
Top-down task effects overrule automatic multisensory responses to letter-sound pairs in auditory association cortex
Neuroimage.
 , 
2007
, vol. 
36
 (pg. 
1345
-
1360
)
van Wassenhove
V
Grant
KW
Poeppel
D
Visual speech speeds up the neural processing of auditory speech
Proc Natl Acad Sci USA.
 , 
2005
, vol. 
102
 (pg. 
1181
-
1186
)
Vandenberghe
R
Nobre
AC
Price
CJ
The response of left temporal cortex to sentences
J Cogn Neurosci.
 , 
2002
, vol. 
14
 
4
(pg. 
550
-
560
)
Wright
TM
Pelphrey
KA
Allison
T
McKeown
MJ
McCarthy
G
Polysensory interactions along lateral temporal regions evoked by audiovisual speech
Cereb Cortex.
 , 
2003
, vol. 
13
 
10
(pg. 
1034
-
1043
)