Abstract

Multisensory object-recognition processes were investigated by examining the combined influence of visual and auditory inputs upon object identification — in this case, pictures and vocalizations of animals. Behaviorally, subjects were significantly faster and more accurate at identifying targets when the picture and vocalization were matched (i.e. from the same animal), than when the target was represented in only one sensory modality. This behavioral enhancement was accompanied by a modulation of the evoked potential in the latency range and general topographic region of the visual evoked N1 component, which is associated with early feature processing in the ventral visual stream. High-density topographic mapping and dipole modeling of this multisensory effect were consistent with generators in lateral occipito-temporal cortices, suggesting that auditory inputs were modulating processing in regions of the lateral occipital cortices. Both the timing and scalp topography of this modulation suggests that there are multisensory effects during what is considered to be a relatively early stage of visual object-recognition processes, and that this modulation occurs in regions of the visual system that have traditionally been held to be unisensory processing areas. Multisensory inputs also modulated the visual ‘selection-negativity’, an attention dependent component of the evoked potential this is usually evoked when subjects selectively attend to a particular feature of a visual stimulus.

Introduction

The human nervous system has evolved an array of specialized sensory receptors to sample the various ‘energies’ that are reflected off or physically produced by objects in the environment (see Meredith, 2002). These energies include, for example, sound pressure waves, photons of light and vibrations in solid media. As information from multiple sensory systems is known to interact to affect perception and cognition (e.g. McGurk and MacDonald, 1976; Stein et al., 1996), there must be neural circuitry at the cortical and sub-cortical levels in which the inputs to these specialized and independent sensory receptor systems are ‘re-combined’. A good deal of work has been conducted in animals to identify these convergence areas and explicate the underlying neural mechanisms (e.g. Stein and Dixon, 1979; Duhamel et al., 1991; Andersen et al., 1997; Stein, 1998; Schroeder and Foxe, 2002), with recent work seeking to expand this understanding to humans (e.g. Foxe et al., 2000, 2002; Calvert, 2001; Berman and Colby, 2002; Molholm et al., 2002; Olson et al., 2002). These investigations have mostly focused on delineating the neural processes associated with the relatively basic functions of detection and localization, while the role of multisensory integration in higher-order processes such as object-recognition remains comparatively uncharted.

The various sensory ‘energies’ that originate from a single object usually provide complementary and/or redundant information about that object’s identity. Oftentimes, object information entering the nervous system through any single sense is degraded due to ‘noise’, such as the partial occlusion or camouflage of a visual object, or the hum of a vent masking speech. Clearly the re-combination of these potentially redundant and/or complementary inputs can benefit, and at times may even be essential to, accurate and timely object-recognition. Nevertheless, multisensory object-recognition has not been extensively examined outside the domain of speech perception. Tellingly, multisensory effects on speech perception are abundant and often profound. For instance, viewing articulatory gestures can dramatically affect the perception of a speech sound (e.g. McGurk and MacDonald, 1976), or enhance speech perception in a noisy environment (Sumby and Pollack, 1954; Campbell and Dodd, 1980; MacLeod and Summerfield, 1990; Thompson, 1995).

Functional imaging and event-related potential (ERP) studies of multisensory speech perception have detailed a network of cortical areas that plays a role in the integration of auditory–visual speech (e.g. Calvert et al., 1999, 2000). However, it is probable that this circuit for speech perception is a relatively specialized one, and that the cortical network involved in integrating information from the different sensory systems depends on the class of multisensory objects/events that are being considered (e.g. animals vs speech), as well as the specific demands of the tasks that are being performed.

To advance understanding of multisensory object-recognition, the present study examined the combined influence of visual and auditory inputs on the recognition of animals. High-density electrical mapping of the ERP was used to assess brain activity while participants performed an object-recognition task in which they were required to make a speeded button press to the occurrence of a target animal (one out of a possible eight) in the visual and/or auditory sensory modality. Both reaction times (RT) and error rates (accuracy) were computed as measures of performance. Stimuli were either unisensory (visual or auditory) or bisensory (visual plus auditory). The bisensory stimuli had visual and auditory elements that either belonged to the same animal or belonged to different animals.

To behaviorally assess the combined effect of visual and auditory information on object-recognition, we compared performance on the bisensory targets to performance on the unisensory targets. We expected that the neural interaction of visual and auditory information from the same object would facilitate performance compared to unisensory targets (visual unisensory targets and auditory unisensory targets). In contrast, we expected that the neural interaction of visual and auditory information that belonged to different objects might interfere with object-recognition processes, and result in poorer performance compared to unisensory targets.

Physiologically, we predicted that multisensory facilitation of object processing would be mediated in part by the multisensory modulation of sensory-specific object-recognition processes. In particular, we considered that object-recognition would be a visually dominated function for the class of objects used in the present design. Based on this proposal, we believed that the likeliest brain structures to mediate these multisensory object-recognition processes would be found in the ventral visual stream, which is well known for its role in object processing (e.g. Ungerleider and Mishkin, 1982; Allison et al., 1999; Doniger et al., 2000). These object processing functions have been extensively investigated through both functional imaging and ERP studies, and object processing in humans has been specifically associated with neuronal activation in a cluster of brain regions in the ventral visual stream known as the lateral occipital complex (LOC; e.g. Kohler et al., 1995; Malach et al., 1995; Puce et al., 1996, 1999; Kanwisher et al., 1997; Allison et al., 1999; Haxby et al., 1999; Ishai et al., 1999; Doniger et al., 2000, 2001, 2002; James et al., 2002a; Lerner et al., 2002; Murray et al., 2002). An ERP signature of the processing of the features of visual objects in the ventral visual stream may be found in the N1 component of the visual evoked potential (or one or more subcomponents of the N1 complex). Inverse source modeling has revealed that neuronal activity in ventral visual cortex, and in some cases specifically in LOC, substantially contributes to the generation of the visual N1 (Anllo-Vento et al., 1998; Di Russo et al., 2002, 2003; Murray et al., 2002; Schweinberger et al., 2002). We should note that source modeling also shows a contribution from the dorsal visual stream to the visual N1 (e.g. Di Russo et al., 2002, 2003). Further, this component has been implicated in ventral stream processing functions. It has, for example, been repeatedly shown to reflect visual processing of the structural features of objects (e.g. Bentin et al., 1999; Tanaka et al., 1999; Eimer, 2000; Rossion et al., 2000; Murray et al., 2002; but see Vogel and Luck, 2000).

We hypothesized that object-identity information provided in the auditory sensory modality would have its effect in visual object-recognition areas, with the presence of features that defined an auditory object affecting processing of visual features of the same object. Given the role of the visual N1 in the processing of visual features, and findings of auditory-based modulation of the visual N1 during bisensory detection and classification tasks (Giard and Peronnet, 1999; Molholm et al., 2002), it was predicted that the co-occurrence of object congruent visual and auditory elements would result in the multisensory modulation of the visual N1 component of the ERP. The visual N1 is considered to represent relatively early processing, showing a characteristic scalp topography over lateral occipital cortices that typically peaks between 140 and 200 ms.

Visual and auditory selective attention componentry were anticipated in our electrophysiological data as we expected that in performing the object-recognition task, subjects would selectively attend to the visual and auditory features of the target object (e.g. for a block of trials on which the designated target was dog, selectively attending to physical attributes specific to the picture of the dog and the sound of the dog). As such, we expected to record the so-called ‘selection negativity’ (SN), a component of the visual evoked potential that is elicited under circumstances where subjects selectively attend to relevant visual features (e.g. Anllo-Vento and Hillyard, 1996; Harter and Aine, 1984; Kenemans et al., 1993; Smid et al., 1999) [in ERPs, non-spatial visual selective attention effects are readily discerned in occasionally occurring targets (Woods and Alain, 2001)], and we expected that the same would be the case for auditory selective attention effects. and the so-called ‘negative difference’ wave (Nd), an auditory evoked potential that is elicited under circumstances where subjects selectively attend to relevant auditory features (e.g. Hansen and Hillyard, 1980).

Multisensory modulation of the visual N1 was expected in turn to affect subsequent feature based processing in the ventral visual stream, resulting in modulation of the SN. In contrast, and in keeping with our notion that multisensory effects would be biased towards the visual system under the stimulus and task conditions of the present experiment, we predicted that equivalent auditory selective attention effects, as indexed by the Nd component, would not be substantially modulated by multisensory object-recognition processes.

Materials and Methods

Subjects

Fourteen neurologically normal, paid volunteers participated (mean age 23.6 ± 6.2 years; six female; all right-handed). All reported normal hearing and normal or corrected-to-normal vision. The Institutional Review Board of the Nathan Kline Institute for Psychiatric Research approved the experimental procedures, and each subject provided written informed consent. Data from two additional subjects were excluded, one for excessive blinking, and the other for failure to perform the task adequately.

Stimuli

There were four basic stimulus types, each presented equiprobably: (i) sounds alone; (ii) pictures alone; (iii) paired pictures and sounds belonging to the same object; (iv) paired pictures and sounds belonging to different objects. In all there were 80 stimuli: eight sounds, eight pictures, and 64 pairings of each of the sound and picture stimuli.

Pictures

There were eight line drawings of animals from Snodgrass and Vanderwart (1980), standardized on familiarity and complexity. These were of a dog, chimpanzee, cow, sheep, chicken, bird, cat and frog. They were presented on a 21 inch computer monitor located 143 cm in front of the subject, and were black on a gray background. The images subtended an average of 4.8° of visual angel in the vertical plane and 4.4° of visual angle in the horizontal plane. These were presented for a duration of 340 ms.

Sounds

There were eight complementary animal sounds, adapted from Fabiani et al. (1996). These sounds were of uniquely identifiable vocalizations corresponding to the eight animal drawings. These were modified such that each had a duration of 340 ms, and were presented over two JBL speakers at a comfortable listening level of ∼75 dB SPL.

Procedure

Participants were seated in a comfortable chair in a dimly lit and electrically shielded (Braden Shielding Systems) room and asked to keep head and eye movements to a minimum, while maintaining central fixation. Eye position was monitored with horizontal and vertical electro-oculogram (EOG) recordings. Subjects were first presented with the stimuli and asked to identify them. The sounds were presented first, followed by the pictures and finally the sound–picture pairs. All subjects easily identified the sounds and pictures as the animals intended by the experimenter.

During the experiment, subjects performed an animal detection task (e.g. ‘During this block, press the button to the cow’). They were instructed to make a button press response with their right index finger to the occurrence of a target, whether in the visual sensory modality, the auditory sensory modality, or both; it was further clarified that they were to also respond to bisensory trials in which only the visual or only the auditory element was a target. Five target stimulus types and four non-target stimulus types were derived from the four basic stimulus types (see Stimuli). These are delineated here, and in Table 1 for quick reference. The five target stimulus types were as follows: visual target (V+; e.g. a picture of a cow), auditory target (A+; e.g. the ‘lowing’ sound of a cow), a picture and sound pair in which only the picture was a target (V+A–; e.g. a picture of a cow and a dog bark), a picture and sound pair in which only the sound was a target (V–A+; e.g. a picture of a chimpanzee and a cow lowing), and a picture and sound pair in which both were targets (V+A+; e.g. a picture of a cow and a cow lowing). The five target stimulus types were presented equiprobably, and targets occurred on 15.6% of trials. Each animal served as the target in two of 16 blocks, in randomized order both within and between subjects, with the exception that the full set of animals served as targets within the first eight blocks. The four non-target stimulus types were as follows: an animal picture (V–); an animal sound (A–); a paired picture and sound of the same animal (V–A– congruent); and a paired picture of one animal and sound of another animal (V–A– incongruent). Non-targets occurred on 84.4% of trials, with the V–A– incongruent stimulus type occurring at a slightly lower probability than the remaining three non-target stimulus types. This small decrease in probability was because two target types (V+A– and V–A+) instead of one were drawn from the same pool of basic stimuli as the V–A– incongruent stimuli. This, along with probabilities of the different stimulus types, is depicted in Figure 1. It should also be noted that the 16 elements that made up the stimuli, the eight animal pictures and the eight animal sounds, were presented equiprobably. Each block contained 56 instances of each basic stimulus type such that a full set of mismatching stimulus pairs was presented in each block (eight visual animals × seven mismatching auditory animal sounds). Stimulus onset asynchrony varied randomly between 750 and 3000 ms. A total of 16 blocks were presented. Breaks were encouraged between blocks to maintain high concentration and prevent fatigue. See Figure 2 for a schematic of the experimental paradigm.

Data Acquisition and Analysis

Continuous EEG was acquired from 128 scalp electrodes (impedances < 5 kW), referenced to the nose, band-pass filtered from 0.05 to 100 Hz, and digitized at 500 Hz. The continuous EEG was divided into epochs (–100 ms pre- to 500 ms post-stimulus onset). Trials with blinks and eye movements were automatically rejected off-line on the basis of the EOG. An artifact criterion of ±60 µV was used at all other scalp sites to reject trials with excessive EMG or other noise transients. The average number of accepted sweeps per non-target was 670, and per target was 95. EEG epochs were sorted according to stimulus type and averaged from each subject to compute the ERP. Baseline was then defined as the epoch from –100 ms to stimulus onset. Separate group-averaged ERPs for each of the stimulus types were calculated for display purposes and for identification of the visual N1, and the visual and auditory selective attention components the SN and Nd. Button press responses to the five target stimuli were acquired during the recording of the EEG and processed offline. Responses falling between 250 and 950 ms post stimulus onset were considered valid. This window was used so that a response could only be associated with a single trial.

Statistical Approach

Behavioral Analyses

For individual subjects, the per cent hits and average reaction time (RT) were calculated, and RT distributions were recorded, for each of the five types of target stimuli. To test for an effect of Target on RT, a one-way repeated measures analysis of variance (ANOVA) with five levels (A+, V+, V–A+, V+A–, and V+A+) was performed. A significant effect was unraveled with Tukey HSD tests. The same statistical analyses were performed to test for an effect of Target on per cent hits (i.e. accuracy).

Enhanced object-recognition processes, due to the simultaneous presentation of visual and auditory elements that belonged to the same object, were expected to be indexed by a significant improvement in performance for the V+A+ targets when compared to the unisensory targets (V+ targets and A+ targets). Interference with object-recognition processes, due to the simultaneous presentation of visual and auditory elements that belonged to different objects, would be indexed by a significant decrement in performance for the V+A– targets and V–A+ targets when compared to, respectively, the V+ targets and the A+ targets.

RT facilitation for the V+A+ targets was followed by a test of the race model. This is because RT facilitation under the present conditions can be accounted for by one of two classes of models: race models or coactivation models (see Miller, 1982). In race models each constituent of a pair of redundant targets independently competes for response initiation, and the faster of the two mediates the response for any given trial, resulting in the so-called redundant target effect (RTE). According to this model, probability summation produces the RTE, since the likelihood of either of the two targets yielding a fast RT is higher than that from one alone. By this account, the RTE need not occur because of any nonlinear neural interactions (i.e. multisensory interactions in this case). In contrast, in coactivation models, it is the interaction of neural responses to the simultaneously presented targets that facilitates response initiation and produces the RTE. It was therefore tested whether the RTE exceeded the statistical facilitation predicted by the race model (Miller, 1982). When the RTE is shown to exceed that predicted by the race model, the coactivation model is invoked to account for the facilitation. Such violation of the race model, thus, would provide evidence that object information from the visual and auditory sensory modalities interacted to produce the RT facilitation.

In the test of the race model, the V+A+ targets were compared to the V+A– targets and V–A+ targets. The race model places an upper limit on the cumulative probability (CP) of RT at a given latency for stimulus pairs with redundant targets. For any latency, t, the race model holds when this CP value is less than or equal to the sum of the CP from each of the single target stimuli (the bisensory targets in which only the visual or auditory element corresponded to the target minus an expression of their joint probability (CP(t)V+A+) < ((CP (t)V+A–+ CP(t)V–A+) – (CP(t)V+A–× CP(t)V–A+)). For each subject the RT range within the valid RTs (250–950 ms) was calculated over the three target types (V+A+, V+A– and V–A+) and divided into quantiles from the fifth to the hundredth percentile in 5% increments (5%, 10%,…, 95%, 100%). t-tests comparing the actual facilitation (CP(t)V+A+) and facilitation predicted by the race model [(CP(t)V+A–+ CP(t)V–A+) – (CP(t)V+A–× CP(t)V–A+)] were performed on quantiles that exhibited violation of the race model, to assess the reliability of the violations across subjects. Violations were expected to occur for the quantiles representing the lower end of the RTs, because this is when it was most likely that interactions of the visual and auditory inputs would result in the fulfillment of a response criterion before either source alone satisfied the same criterion (Miller, 1982; for recent application of Miller’s test of the race model, see Molholm et al., 2002; Murray et al., 2002).

Electrophysiological Analyses

The following approach was taken to constrain the analyses performed, without reference to the dependent variables. First, our primary hypothesis was that object information in the auditory sensory-modality would result in modulation of visual object-processing as represented by the N1 component. Thus, we used the response to unisensory visual stimulation (V+ and V–) to define the latency window and scalp-sites of maximal amplitude for this component a priori, before assessing whether multisensory effects were present or not. The latency window and scalp sites were consistent with our previous studies of the visual N1 evoked by these pictorial stimuli (see Doniger et al., 2000, 2001, 2002; Foxe et al., 2001).

A similar strategy was employed to identify latency windows and scalp sites for tests of the selection negativity (SN) and the negative difference wave (Nd). In the case of the SN, unisensory visual targets (V+) were compared with unisensory visual non-targets (V–), which allowed us to define the timecourse and topography of the SN, independent of putative multisensory effects during the bisensory conditions. Similarly, the Nd was defined by comparing unisensory auditory targets (A+) to unisensory non-targets (A–). Subsequently, these predefined windows and scalp-sites were used to make measures from all relevant multisensory conditions and these data were entered as the dependent measures into our ANOVAs.

It should be noted, however, that while the use of broadly defined component peaks is a good means of limiting the number of statistical tests that are conducted, these components often represent the activity of many simultaneously active brain generators at any given moment (e.g. Foxe and Simpson, 2002). As such, effects may not necessarily be coincident with a given component peak, especially in the scenario that only certain brain generators are affected by a given experimental condition. Thus, limiting the analysis to a set of discrete component peaks represents a very conservative approach to the analysis of high-density ERP data.

For the main ERP statistical analyses, only responses elicited by the bisensory stimuli were considered. This allowed us to examine ERP effects as a function of the object congruency of the visual and auditory elements. Recall that when the cow was the target, the possible bisensory combinations were: (i) V+A+ (e.g. a picture of a cow and the lowing of a cow); (ii) V+A– (e.g. a picture of a cow and the barking of a dog); (iii) V–A+ (e.g. a picture of a chimpanzee and the lowing of a cow); (iv) V–A– congruent (a picture of a dog and the barking of a dog); and (v) V–A– incongruent (a picture of a chicken and the croaking of a frog).

Multisensory Object-recognition Effects over Posterior Scalp in the Latency Range of the Visual N1. It was hypothesized that visual object-recognition processes in the ventral visual stream would be affected by the co-occurrence of visual and auditory elements that belonged to the same object, and that this would be initially reflected by modulation of the visual N1. Multisensory object-recognition effects on the mean amplitude of the ERPs over a 20 ms window, centered at the peak of the visual N1 (defined in the unisensory response), were tested with three-way repeated measures ANOVA. The factors were Stimulus (five levels: V+A+, V+A–, V–A+, V–A– congruent and V–A– incongruent), Electrode (three electrodes that best represented the visual N1 distribution over each hemisphere), and Hemisphere (left and right). When Main effects were uncovered, protected ANOVAs were conducted to unravel the effects. For these and all the following statistical tests, Geisser–Greenhouse corrections were used in reporting P values when appropriate and the alpha level for significance was set at less than 0.05.

Localizing the Underlying Neural Generators of the Multisensory Object-recognition Effect. Information about the intracranial generators contributing to the multisensory ‘N1-effect’ was obtained using two methods. The first was scalp current density (SCD) topographic mapping, as implemented in the Brain Electrical Source Analysis (BESA, Ver. 5.0) multimodal neuroimaging analysis software package (MEGIS Software GmbH, Munich, Germany). SCD analysis takes advantage of the relationship between local current density and field potential defined by Laplace’s equation; in SCD analysis the second spatial derivative of the recorded potential is calculated, which is directly proportional to the current density. This method eliminates the contribution of the reference electrode and reduces the effects of volume conduction on the surface recorded potential that is caused by tangential current flow of dispersed cortical generators. This allows for better visualization of the approximate locations of intracranial generators that contribute to a given scalp recorded ERP.

The second was the method of dipole source analysis, also implemented through BESA 5.0. BESA models the best-fit location and orientation of multiple intracranial dipole generator configurations to produce the waveform observed at the scalp, using iterative adjustments to minimize the residual variance between the solution and the observed data (see, for example, Scherg and Von Cramon, 1985; Simpson et al., 1995). For the purpose of the modeling, an idealized three-shell spherical head model with a radius of 85 mm and scalp and skull thickness of, respectively, 6 and 7 mm was assumed. The genetic algorithm module of BESA 5.0 was used to free fit a single dipole to the peak amplitude of the multisensory ‘N1-effect’. This was carried out first on the peak of the effect (within the tested latency window), and then across the whole of the tested latency window. This initial dipole was fixed and additional dipoles were successively free fit to assess if they improved the solution. Group averaged ERP data were used to maintain the highest possible signal-to-noise ratio as well as to generalize our results across individuals. The fit of the dipoles were constrained to the gray matter of the cortex. We should point out that in dipole analysis, each of the modeled equivalent current dipoles represents an oversimplification of the activity in the areas, and therefore each should be considered as representative of ‘centers of gravity’ and not necessarily discrete neural locations (Murray et al., 2002; Dias et al., 2003; Foxe et al., 2003).

Multisensory Object-recognition Effects on the Visual Selective Attention Component, the SN. The multisensory object-recognition effect on the processing of objects at the feature level was expected to be passed on from ventral visual object-recognition processes underlying the N1 to the ventral visual object-recognition processes underlying the SN, a negative going potential over occipital scalp that is elicited by relevant visual stimuli when relevance is defined on the basis of a non-spatial feature(s) (Harter and Aine, 1984; Kenemans et al., 1993; Anllo-Vento and Hillyard, 1996; Smid et al., 1999). To establish the presence of the SN, a three-way repeated measures ANOVA on the mean ERP amplitudes over lateral occipital scalp, for a 30 ms window centered at the midpoint of the SN (as defined in the difference wave of the unisensory visual target (V+) and non-target (V–) responses), was performed. This ANOVA had factors of Stimulus (five: V+A+, V+A-, V–A+, V–A– congruent and V–A– incongruent), Hemisphere (right and left), and Electrode (three scalp sites over lateral occipital scalp). A significant effect of Stimulus was followed up with two planned three-way repeated measures ANOVAs with factors of Stimulus (two), Hemisphere (right and left), and Electrode (three). It was expected that the responses elicited by target stimuli that included a target visual element (V+A+ and V+A– responses) would be significantly more negative going than the response elicited by non-target stimuli (V–A– congruent and V–A– incongruent responses). (The two bisensory non-target waveforms were averaged for these tests; non-target classes that are equivalently different from the targets should not differ with respect to selective attention effects.) To test for a multisensory object-recognition effect on the SN we similarly compared the V+A+ response and the V+A– response. Both these responses should include the unisensory visual selective attention effects, while only the V+A+ response should include any multisensory object-recognition effects on visual selective attention processes. In addition, this difference should include an Nd. SCD mapping was used to spatially dissociate the multisensory ‘SN-effect’ from the Nd.

Multisensory Object-recognition Effects on the Auditory Selective Attention Component, the Nd. Multisensory object-recognition effects were also tested on the Nd, a negative going potential over fronto-central/central scalp that is elicited by relevant auditory stimuli when relevance is defined along a physical dimension(s) (e.g. Hillyard et al., 1973; Näätänen et al., 1978; Hansen et al., 1983; Näätänen, 1992). To establish the presence of the Nd, a three-way repeated measures ANOVA on the mean ERP amplitudes over fronto-central/central scalp, for a 30 ms window centered at the midpoint of the Nd (as defined in the difference wave of the unisensory auditory target (A+) and non-target (A–) responses), was performed. This ANOVA had factors of Stimulus (five: V+A+, V+A–, V–A+, V–A– congruent and V–A– incongruent), Hemisphere (right and left) and Electrode (three scalp-sites). A significant effect of Stimulus was followed up with two three-way repeated measures ANOVAs, with factors of Stimulus (two), Hemisphere (right and left), and Electrode (three). It was expected that each of the responses elicited by stimuli that included a target auditory element (the V+A+ and V–A+ responses) would be significantly more negative going than the response elicited by the non-targets (V–A– congruent and V–A– incongruent responses; see footnote 3). To test for a multisensory object-recognition effect on the Nd, we similarly compared the V+A+ response and the V–A+ response. Both these responses should include the ‘unisensory’ auditory selective attention effects, while only the V+A+ response would include any multisensory object-recognition effects on auditory selective attention processes.

Results

Behavioral Results

In the object-recognition task, both reaction times and accuracy rates (per cent hits) were affected by Target Type (see Table 2). Consistent with the hypothesis that object-recognition was enhanced by the co-occurrence of visual and auditory elements that belonged to the same object, object-recognition for the V+A+ targets was superior to object-recognition for the other targets, with the fastest mean reaction time and highest mean hit rate. Visually based object-recognition was better and faster than auditory based object-recognition, with faster mean reaction times and higher mean percents hits for targets that included a target visual element. Overall, the false alarm rate was low at 2%; false alarms were more or less evenly distributed among the non-target stimulus types.

Reaction Times

Mean reaction times were substantially different among the target types (see Fig. 3 and Table 2), with the difference between the shortest and longest reaction time greater than 130 ms. A main effect of Target [F(4,52) = 90.16, P < 0.001] was followed up with Tukey HSD comparisons. These revealed significant RT differences between all the target stimuli except the V+ targets and the V+A– targets, and the A+ targets and the V–A+ targets (see Table 3). The mean RT to the V+A+ targets was significantly faster than the mean RTs to all the other targets, consistent with the hypothesis that targets were more rapidly recognized when the visual and auditory elements belonged to the same object. The comparisons also revealed that the RTs to the targets so defined by their visual element were significantly faster than the RTs to the target stimuli so defined by their auditory element.

Test of the Race Model

The race model was reliably violated for four successive quantiles in the early portion of the reaction time distribution (the third through the sixth: see Table 4 and Fig. 3, middle and right panels). Thus the reaction-time facilitation for the V+A+ target was not due to statistical summation, but rather to the neural interaction of visual and auditory object information.

Accuracy Rates

Per cent hits also differed considerably among the target types, with a difference of ∼15% between the highest and the lowest hit rates (see Table 2). The pattern of differences generally paralleled those of the reaction times. For example, by both measures, the best performance was for the V+A+ targets and the worst performance was for the V–A+ targets. There was a main effect of Target type [F(4,52) = 17.96, P < 0.001]. Tukey HSD comparisons revealed that the per cent hits to the targets so defined by their visual element were significantly higher than the per cent hits to the target stimuli so defined by their auditory element (see Table 3).

Interference effects for bisensory targets with visual and auditory elements that belonged to different objects were not apparent in our data, with a lack of significant differences between the V+A– targets and V+ targets, and the V–A+ targets and A+ targets, by both the RT and accuracy measures (see Table 3).

Electrophysiological Results

The group averaged electrophysiological responses elicited by the five bisensory stimuli are displayed in Figure 4. The responses showed classic visual and auditory sensory componentry. These included typical auditory components P1 peaking at ∼60 ms and N1 peaking at ∼118 ms (see Picton et al., 1974; Vaughan and Ritter, 1970; Vaughan et al., 1980); and visual components P1 peaking at ∼78 and N1 peaking at ∼150 ms. As would be expected, the auditory P1 and N1 components appeared maximal over fronto-central scalp. The visual P1 appeared maximal over lateral occipital scalp and the visual N1 appeared maximal over posterior-temporal/temporo-occipital scalp (see Fig. 5c).

The responses to all the bisensory stimuli were essentially identical until ∼125 ms post-stimulus onset. After this point there was a complex series of effects in the waveforms that appeared to be related to both multisensory object-recognition processes and the enhanced processing of target compared with non-target inputs. We will treat each of these in turn in the following sections. All P-values reported in the following sections have been corrected using the Geisser–Greenhouse corrections for multiple comparisons.

The Multisensory Object-recognition Effect in the Latency Range of the Visual N1

In line with our primary hypothesis, that multisensory object-recognition would result in the modulation of visual object-recognition processes, starting at ∼145 ms the response elicited by the V+A+ targets appeared more negative-going than the responses elicited by the other bisensory conditions (see Fig. 4h and Fig. 5a). This net negative modulation fell within the latency range and general scalp region of the visual N1. Topographic mapping revealed a distribution over right lateral occipital scalp (Fig. 4a) that was posterior to the topography of the N1 proper, seen in the V+ response (compare Figs 5a and 5b). Since the peak latency of the unisensory visual N1 was 150 ms, our dependent measure for the tests evaluating the multisensory ‘N1-effect’ was the mean ERP amplitude in the latency window from 140 to 160 ms.

A repeated measures ANOVA with factors of Stimulus (five), Electrode (three), and Hemisphere (two) resulted in a Stimulus by Hemisphere interaction [F(4,52) = 3.75, P = 0.02]. Three planned comparison three-way ANOVAs with factors of Stimulus (two), Electrode (three) and Hemisphere (two) were conducted to better understand this effect. Each of the ANOVAs compared the responses evoked by two of the stimulus types, one in which the visual and auditory elements belonged to the same object and the other in which the visual and auditory elements belonged to different objects. The V+A+ versus V+A– comparison revealed a significant Stimulus by Hemisphere interaction [F(1,13) = 8.08, P = 0.01; see Fig. 5a]. This was because the response to the V+A+ target was significantly more negative-going during the N1 timeframe, over the right hemisphere. The V+A+ target versus V–A+ target comparison revealed a main effect of Stimulus [F(1,13) = 8.7, P = 0.01]. As with the previous test, the V+A+ response was significantly more negative-going during the N1 timeframe. The comparison between the V–A– congruent and V–A– incongruent responses revealed no significant differences (for the main effect of Stimulus, [F(1,13) = 1.56, P = 0.23)]. These data indicate that when the visual and auditory elements belonged to the same object, and were both targets, there was a significantly more negative-going response over right lateral occipital scalp, in the latency range and general scalp region of the visual N1 component. This effect was not due to the summation of unisensory selective attention effects and/or target effects; this control test is provided in the results subsection, The Multisensory Object-recognition Effects Are Not Due to Summation Effects.

Mapping and Source Analysis of the Multisensory ‘N1-effect’

To obtain a more precise estimation of the location of the generators underlying the multisensory ‘N1-effect’, we examined SCD maps and performed dipole source modeling on the grand mean V+A+ minus V+A– difference wave.

SCD topographic mapping of the multisensory ‘N1-effect’ revealed a stable source-sink distribution over right lateral occipital scalp (Fig. 5a). This topography is consistent with neural activity in LOC largely accounting for the multisensory effect. Comparison with the SCD topography of the response to the unisensory visual targets (V+) revealed that the multisensory ‘N1-effect’ was in the same general scalp region as the visual N1 proper, but was clearly posterior in topography (compare Fig. 5a and b). These differences in topography may be due to (i) only some of the generators that contributed to the visual N1 contributing to the multisensory ‘N1-effect’ (or their differential contribution), (ii) the presence of additional generators contributing to the N1 effect, or (iii) an altogether different set of neural generators accounting for the effect.

To confirm that the ‘N1-effect’ was generated in LOC, we employed the genetic algorithm module of BESA to model the intracranial generators of the effect. A single dipole was allowed to freely fit to the peak of the ‘N1-effect’ (156 ms). This resulted in a dipole situated in the right temporo-occipital cortex, consistent with a generator in the general region of right LOC (see Fig. 5c). Talairach coordinates (49, –67.2, –14) for this dipole place it in or about the fusiform gyrus. This single dipole accounted for 73.3% of the variance in the data across all channels at this timepoint. Refitting the dipole at neighboring timepoints resulted in a similar localization and similar levels of explained variance suggesting a stable fit across this timeframe. Therefore, we fixed the dipole in the location found at 156 ms and opened up the fitting window across an epoch from 140 to 160 ms (spanning the bulk of the ‘N1-effect’) and allowed the orientation parameter to be freely fit. Explained variance across this larger time-epoch was 63.0%. Finally, having fixed the location and orientation of this right LOC dipole, we added a second freely fitting dipole and allowed it to fit across the 140–160 ms time-window. Multiple starting positions were given. This dipole did not find a stable fit and resulted in no more than a 3–4% increase in explained variance. Addition of a third dipole gave a similar result. That these later dipoles make only marginal contributions to the explained variance suggests that no other robust signals outside of the right LOC are being generated in this timeframe. It is important to point out at this juncture that manually shifting this single dipole within the general region of the right LOC did not substantially change the explained variance. As such, the exact coordinates of this dipole should not be considered as an exact localization to fusiform gyrus per se, but rather as a ‘center-of-gravity’ for activity that is generated in this general region of the right LOC.

In all, these results from SCD mapping and inverse source modeling are highly consistent with the interpretation that neural generators situated within the LOC of the ventral visual stream largely accounted for the ‘N1-effect’.

Multisensory Object-recognition Effects on Visual Selective Processing: the Selection Negativity (SN)

It was expected that selective processing of visual elements that were targets would be reflected in the so-called ‘selection negativity’ (SN) over lateral occipital scalp, consistent with previous reports. Figure 4fh shows that the responses evoked by the visual targets were more negative-going over occipital scalp than the responses to non-targets, with a maximal difference at ∼280 ms. This pattern of activity is suggestive of the elicitation of the SN. [An alternative N2 interpretation of the selective attention effects can be ruled out on the basis of the stimulus probability (Näätänen et al., 1982; for a detailed discussion, see Näätänen, 1992, pp. 236–244,): The N2 is elicited by infrequently occurring stimuli, irrespective of target status; in our experiment each of the visual and auditory elements occurred with equal probabilities, whether targets or non-targets, and thus a differential N2 response is not expected.] Specific to the current multisensory design, we predicted that when the visual and auditory elements belonged to the same object (i.e. V+A+ targets), the SN would be enhanced. Beginning at ∼210 ms the V+A+ response became clearly more negative-going than the V+A– response. This negative difference exhibited a bilateral SCD topographic distribution over lateral occipital scalp (see Fig. 6a) that was highly similar to that of the ‘N1-effect’ (see Fig. 5a), and extended to ∼300 ms. In line with our hypothesis, this effect appeared to be an enhancement of the SN (compare the SCD maps in Fig. 6a,b). SN effects were evaluated by repeated measures ANOVA, where the dependent measure was the mean ERP amplitude in the latency window from 265 to 295 ms (based on the SN assessed in the unisensory V+ vs V– response).

A repeated-measures ANOVA with factors of Stimulus (five), Electrode (three), and Hemisphere (two) resulted in a main effect of Stimulus [F(4,52) = 12.49, P < 0.001] and a Stimulus by Hemisphere interaction [F(4,52) = 13.67, P < 0.001]. Three planned comparison three-way ANOVAs with factors of Stimulus (two), Electrode (three) and Hemisphere (two) were conducted to better understand this effect. Two of these ANOVAs compared responses evoked by stimuli that included a visual target element to the non-target response to assess the presence of the SN. The V+A+ response was significantly more negative-going than the response to the non-targets, with a main effect of Stimulus F(1,13) = 27.3, P < 0.001]. This effect interacted with Hemisphere [F(2,26) = 22.0, P < 0.001] due to a greater difference over the left scalp. The V+A– response was also significantly more negative-going than the response to non-targets [F(1,13) = 7.1, P < 0.02], and this effect interacted with Hemisphere [F(2,26) = 20.5, P = 0.001], again due to a greater difference over the left scalp. The third ANOVA assessed the presence of multisensory effects on the SN. A significant difference between the V+A+ response and the V+A– response [F(1,13) = 9.99, P = 0.008] supported the notion that visual and auditory feature information interacted to affect processing reflected by the SN (Fig. 6a).

Unexpectedly, the V–A+ response was also negative-going with respect to the response elicited by non-targets over lateral occipital scalp [F(1,13) = 7.2, P = 0.02], although to a somewhat lesser extent (see Fig. 4g). SCD topographic mapping showed that the most likely explanation for this negative-going response was volume conduction of the centrally focused Nd, which is associated with selective processing of auditory features (see next section for analysis of the Nd). Consistent with such an explanation, topographic mapping of the unisensory Nd (the A+ response minus the A– response) revealed similar volume conduction of the Nd to electrodes over posterior scalp. We would also like to point out that a control test, which is reported two sub-sections below, ensured that the multisensory ‘SN-effect’ that is described above was not due to volume conduction of the Nd.

Auditory Selective Processing over Central Scalp: the Negative Difference Wave (Nd)

Over central scalp, starting at ∼180 ms, the response elicited by stimuli that included a target auditory element (i.e. the V+A+ targets and V–A+ targets) became more negative-going than the responses elicited by the stimuli without a target auditory element (i.e. V+A– targets, and the non-targets — see Fig. 4d,e) over central/fronto-central scalp. This difference extended to about ∼350 ms. This response pattern is consistent with elicitation of the auditory selective attention component, the Nd. Since we hypothesized that object-recognition processes for the objects used in the current design would be primarily mediated through the visual system, we did not expect multisensory effects on the auditory Nd component. There was no evidence of a multisensory effect on the Nd, with the V+A+ and V–A+ waveforms overlapping in the timeframe of the Nd over central/fronto-central scalp (see Fig. 3d,e).

Nd effects were evaluated by repeated measures ANOVA, where the dependent measure was the mean ERP amplitude in the latency window from 225 to 255 ms (based on the Nd assessed in the unisensory A+ vs A– response). A repeated measures ANOVA with factors of Stimulus (five), Electrode (three), and Hemisphere (two) resulted in a main effect of Stimulus [F(4,52) = 8.35, P = 0.001]. Three planned comparison three-way ANOVAs with factors of Stimulus (two), Electrode (three) and Hemisphere (two) were conducted to better understand this effect. Two of these ANOVAs compared responses evoked by stimuli that included an auditory target element to non-targets to assess the presence of the Nd. The V+A+ and the V–A+ responses were significantly more negative-going than the non-target response, with both ANOVAs showing a main effect of Stimulus [respectively F(1,13) 15.07. P = 0.002; and F(1,13) = 15.73, P = 0.002], and Stimulus by Hemisphere interactions [respectively, F(1,13) = 5.7, P = 0.03; and F(1,13) = 4.68, P = 0.05)] due to a larger effect over left scalp. There was no evidence that multisensory object-recognition processes interacted with this auditory selective attention effect, with no significant difference between the V+A+ response and the V–A+ response (F < 1).

The Multisensory Object-recognition Effects Are Not Due to Summation Effects

To rule out the possibility that the multisensory ‘N1-effect and ‘SN-effect’ were simply the consequence of the summation of the visual and auditory selective attention effects (e.g. the SN and the Nd) and/or target effects, we performed a control test in which the V+A+ response was compared with the sum of the responses elicited by each of the visual- and auditory-unisensory targets (hereafter ‘Summed’ response). The V+A+ response and Summed response should each include the basic visual and auditory sensory evoked componentry in addition to any visual and auditory selective attention effects (e.g. SN and Nd) or target effects: if the V+A+ response significantly differs from the Summed response, a simple summation explanation cannot account for the multisensory object-recognition effects.

In the latency range of the N1-effect, the V+A+ response was more negative going than the Summed response (see Fig. 7a). Amplitude differences in this latency window (140–160 ms) were examined using a two-way ANOVA with factors of ERP (2: V+A+ target vs Summed) and Electrode (three: over right occipito-temporal scalp from the original test of the multisensory ‘N1-effect’). The V+A+ response was significantly more negative going, with a main effect of ERP [F(1,13) = 5.33, P = 0.038]. This demonstrates that the multisensory ‘N1-effect’ cannot be attributed to the summation of selective attention effects or target effects from the respective unisensory systems.

In the latency range of the SN, the V+A+ response was more negative going than the Summed response over occipital scalp (see Fig. 7b). Amplitude differences in this latency window (265–295 ms) were examined using a two-way ANOVA with factors of ERP (two: V+A+ vs Summed) and Electrode (three: over right left-temporal scalp from the original test of the multisensory ‘SN-effect’). The V+A+ response was significantly more negative going, with a main effect of ERP [F(1,13) = 5.46, P = 0.036]. In this case the more negative going V+A+ response appeared to be due to the earlier evolution and peak of the SN when compared to the Summed response (see Fig. 7b).

This analysis serves as an important control to rule against a summation explanation of the multisensory object-recognition effects. This further supports that the multisensory object-recognition effects were indeed due to the neural interaction of object-information presented in the visual and auditory sensory modalities.

Post hoc analysis

Visual inspection of the grand mean waveforms suggested a multisensory object-recognition effect in the ERPs to the non-targets, with the V–A– incongruent response more negative-going than the V–A– congruent response, starting at 390 ms (see Fig. 8). This difference had a relatively broad distribution over centro-parietal scalp. As we had no specific hypotheses regarding this later effect, the following analysis is exploratory and the findings should be treated with caution until a future study replicates the finding.

A two-way repeated measures ANOVA with factors of Stimulus (two: V–A– congruent and V–A– incongruent) and Electrode (three) was performed on the mean ERP amplitudes over a 100 ms window (400–500) from three electrodes over centro-parietal scalp, where the difference was greatest. A main effect of Stimulus [F(1,13) = 9.68, P = 0.008] was due to the more negative going V–A– incongruent response compared with the V–A– congruent response. This is consistent with a multisensory object-recognition effect on the bisensory non-targets.

Discussion

To our knowledge, this is the first study to establish the benefits of multisensory inputs on object-recognition processes in the visual domain. Multisensory object-recognition was reliably faster than unisensory object-recognition for clearly identifiable visual and auditory stimuli that belonged to the same object. Since this RT facilitation could not be accounted for by statistical facilitation (i.e. the race model was violated), we attribute it, at least in part, to the neural interaction of the visual and auditory inputs. Thus, even when an object is easily identifiable through a single sensory modality, inputs from multiple sensory modalities can interact to facilitate object-recognition.

The neurophysiological data from this study indicate that multiple sensory inputs interacted to affect object processing at a relatively early stage in the information-processing stream, modulating neural processes in what are generally considered to be unisensory cortices. Specifically, we found that visual and auditory inputs interacted to enhance processing in the ventral visual stream, which is known for its role in visual object-recognition (e.g. Ungerleider and Mishkin, 1982; Allison et al., 1999; Doniger et al., 2000). This initial multisensory object-recognition effect may reflect the multisensory modulation of a subset of the neural generators underlying the visual N1, given their similar timeframe and general topography. The visual N1 is thought to reflect, at least in part, the structural encoding of visual objects (e.g. Bentin et al., 1999; Eimer, 2000; Rossion et al., 2000; Murray et al., 2002). The right hemisphericity of this effect was in line with findings from Molholm et al. (2002) and Giard and Peronnet (1999) of apparent multisensory modulation of the visual N1 that was limited to, or greater over, the right hemisphere. Thus, it may be the case that auditory influences on ventral visual stream processes in this timeframe are largely a right hemisphere function. Consistent with multisensory inputs affecting successive stages of object processing in the ventral visual stream, the N1 effect appeared to be passed on to subsequent neural processes that dealt with visual feature level information, so ascertained by the apparent multisensory modulation of the SN.

The multisensory effect on the visual N1 was present in responses to the target stimuli as we predicted, but to our surprise, it was not at all apparent in the responses to non-target stimuli. Based on our original hypothesis — that such an effect would occur as a consequence of the visual and auditory elements belonging to the same object — we fully expected the modulation to be present for both target and non-target stimuli. The restriction of the effect to targets, combined with the lack of evidence of an interference effect in the behavioral data for ‘incongruent’ target trials (trials where the visual and auditory elements were mismatched), suggested an alternative explanation of the data. That is, it is possible that the behavioral and neurophysiological effects that we found resulted from the co-occurrence of task-relevant features in each of the visual and auditory sensory modalities, as opposed to resulting from the co-occurrence of visual and auditory features that were strongly associated with one another through long-term experience. There is some precedence in the literature for such, as Czigler and Balázs (2001) found that the amplitude of the SN was larger when both simultaneously presented visual and auditory elements were task-relevant, compared with when only the visual element was task-relevant. Key here is that the visual and auditory elements were unrelated prior to the task. Further, although not specifically tested, inspection of the waveforms in Czigler and Balázs (2001) suggests that the same effect might also have been present over posterior scalp in the latency range of the visual N1 (see fig. 1 of Czigler and Balázs, 2001). The same argument could well be made for our previous study (Molholm et al., 2002) and the one by Giard and Peronnet (1999) where multisensory N1 effects were also found and again, where there was no ‘natural’ relationship between the visual and auditory elements. However, it is important to point out that both of these previous studies found a significant diminution of the ERP response during the timeframe of the visual N1. In contrast, in the present study, we found a response enhancement during the visual N1.

An alternative account, which preserves the original hypothesis that the effect is due to visual and auditory elements belonging to the same object, is that computationally costly operations such as the early integration of visual and auditory features of the same object were only performed on the elements that were relevant (i.e. targets). For example, subjects could have rejected non-target elements from relatively time-constrained processes, once the absence of a relevant visual or auditory feature (or the presence of a non-target feature) was detected. Thus, reduced processing of the irrelevant non-target stimuli would account for the target specificity of the multisensory object-recognition effect (as well as the lack of behavioral interference effects). In this case, presumably, the multisensory object-recognition effect would be observed in non-targets under conditions where selective attention processes could not so constrain processing. For example, in a task in which targets were ‘wild animals’, non-targets were ‘domestic animals’, and no animal was repeated; since selective attention could not be very effectively instantiated, by this explanation such effects would be seen in both the ‘wild animal’ and ‘domestic animal’ responses.

These data add to the expanding role that visual processes play in object-recognition outside the strictly visual domain. Recent functional imaging studies have shown that when an object is presented for identification in the tactile sensory modality, ‘visual’ object-recognition processes in the ventral-occipital stream are substantially modulated (Amedi et al., 2001; James et al., 2002b). Zangaladze et al. (1999) showed that tactilely based object orientation judgements were negatively affected by the application of transcranial magnetic stimulation (TMS) over occipital scalp (TMS is employed to momentarily disrupt cortical function in relatively localized cortical regions). Of particular note in this study was the latency of maximal interference, which at 180 ms coincided well with the general latency range of the visual N1 component.

Our working hypothesis, that object-recognition processes would be a visually dominated function for the class of stimuli and task that were employed, was supported by the finding of multisensory object-recognition effects on the selective processing of visual, but not of auditory, features. That is, multisensory object-recognition processes modulated the SN but not the Nd. This finding of multisensory modulation specific to visual object-recognition processes complements analogous findings in the domain of speech perception. Functional imaging has shown that the presentation of matching auditory–visual speech results in increased activation of auditory cortical areas involved in the processing of speech within the superior temporal sulcus (STS), when compared with mismatching auditory–visual speech (Calvert et al., 2000). Thus, it seems that not only do multiple sensory inputs interact to modulate neural processes in what is generally considered to be unisensory cortices for the purpose of object-recognition, but also that the cortical locus of such effects depends upon the class of information to be recognized (e.g. for visual–auditory speech, auditory cortical areas, and for visual–auditory animals, visual object-recognition areas). Such a model of the neuronal basis of multisensory effects on recognition processes generates specific testable predictions. For example, a multisensory effect referred to as the ‘parchment-skin illusion’, in which tactile judgements of surface texture are influenced by simultaneous auditory stimulation (Jousmäki and Hari, 1998), would be predicted to be mediated by the neural processes that underlie texture identification, presumably residing in somatosensory cortices. Findings from Foxe et al. (2000) showing early multisensory modulation of processes in somatosensory cortices by auditory stimulation (50–80 ms post stimulus onset) demonstrate the feasibility of such a prediction (see also Lutkenhoner et al., 2002).

In addition to the predicted electrophysiological effects, a post hoc analysis suggested that there was a relatively late occurring congruency effect in the responses elicited by the bisensory non-targets (tested in the 400–500 ms latency window), where non-targets with mismatching visual and auditory elements elicited a more negative going response than non-targets that had matching visual and auditory elements. This effect was not predicted and obviously needs to be replicated before any serious interpretation can be made. However, we suspect that the effect was related to the semantic congruency between the simultaneously presented visual and auditory elements, and propose that it belongs under the class of components encompassed by the N400. The N400 is elicited in a variety of situations by unexpected words or objects as compared with expected words or objects. It is hypothesized to reflect semantic access and integration into the semantic context (Kutas and Federmeier, 2000), and has been proposed to reflect both sensory specific and supramodal processes (e.g. Kutas and Federmeier, 2000). Similar to the timing and topography of the effect observed in the present data, the N400 consists of a negative going response in the latency region of 400 ms, which has a broad central/centro-parietal voltage topography (e.g. Kutas and Hillyard, 1980). Different from the typical N400, the present effect was elicited by simultaneously presented, as opposed to sequentially presented, semantic information (as in Ganis and Kutas, 2003, in the visual sensory modality).

Conclusions

Task relevant visual and auditory features interacted to affect object-recognition processes. The neurophysiological data suggest that the behavioral multisensory object-recognition effect was due to auditory influences on visual information processing, and that this likely occurred at the feature level of representation. This indicates for the first time that visual and auditory information can be integrated at the feature level of information processing in sensory specific cortices. The scalp topography and coincident timing with visual object-recognition processes suggest that the behavioral effect was mediated in sensory cortical areas of the ventral visual processing stream — results from SCD mapping and dipole modeling were consistent with generators in the lateral occipital cortices. These combined behavioral and neurophysiological findings have significant implications for the brain mechanisms that sub-serve object-recognition when there is relevant information from more than one sensory modality — demonstrating that object-recognition processes in unisensory cortex are influenced by information from other cortical areas relatively early in the information processing stream.

This work was supported in part by grants from the NIH — MH63434 (J.J.F.), NS30029 (W.R.), MH49334 (D.C.J.); and the Burroughs Wellcome Fund. Data from this study are from a thesis submitted in partial fulfillment of the requirements of the degree of Doctor of Philosophy in the Department of Psychology of The City University of New York (S.M.). We would like to thank the anonymous reviewers for their constructive comments. We would like to express our sincere appreciation to Beth Higgins and Deirdre Foxe for their excellent technical assistance, and to Dr Micah Murray and Pejman Sehatpour for all their help.

Address correspondence to John Foxe, Cognitive Neurophysiology Laboratory, Program in Cognitive Neuroscience and Schizophrenia, Nathan S. Kline Institute for Psychiatric Research, 140 Old Orangeburg Road, Orangeburg, NY 10962, USA. Email: Foxe@nki.rfmh.org.

Figure 1. Stimulus probabilities. This diagram depicts the four basic stimulus types, the targets and non-targets that were derived from each, and the probabilities of each stimulus type. Probabilities are rounded to the nearest integer.

Figure 1. Stimulus probabilities. This diagram depicts the four basic stimulus types, the targets and non-targets that were derived from each, and the probabilities of each stimulus type. Probabilities are rounded to the nearest integer.

Figure 2. Schematic of the experimental paradigm, for a block where the target was designated ‘cow’. A dash in the center of the box indicates the absence of visual stimulation and a dash placed in the right lower corner indicates the absence of auditory stimulation.

Figure 2. Schematic of the experimental paradigm, for a block where the target was designated ‘cow’. A dash in the center of the box indicates the absence of visual stimulation and a dash placed in the right lower corner indicates the absence of auditory stimulation.

Figure 3. Behavioral data. The left column shows mean reaction-times for the five target types. The middle column shows the cumulative probability of RTs from the 1st to the 85th percentile, for V+A+ targets (solid curve), V+A– targets (curve with triangles), V–A+ targets (curve with squares), and the cumulative probability predicted by the race model (dashed curve). The probability of the RTs to the V+A+ targets exceeds that predicted by the race model when the red curve (V+A+ target) falls to the left of the brown dashed curve (‘predicted’). The right column shows the Miller Inequality: positive values indicate violation of the race model. That is, the cumulative probability of responses for the V+A+ targets was greater than the cumulative probability predicted by the race model, at the specified percentile.

Figure 3. Behavioral data. The left column shows mean reaction-times for the five target types. The middle column shows the cumulative probability of RTs from the 1st to the 85th percentile, for V+A+ targets (solid curve), V+A– targets (curve with triangles), V–A+ targets (curve with squares), and the cumulative probability predicted by the race model (dashed curve). The probability of the RTs to the V+A+ targets exceeds that predicted by the race model when the red curve (V+A+ target) falls to the left of the brown dashed curve (‘predicted’). The right column shows the Miller Inequality: positive values indicate violation of the race model. That is, the cumulative probability of responses for the V+A+ targets was greater than the cumulative probability predicted by the race model, at the specified percentile.

Figure 4. Responses evoked by all bisensory conditions. Traces from the ERP responses elicited by each of the five bisensory stimuli are shown for electrodes over right fronto-central (a), fronto-central (b), left fronto-central (c), right-central (d), left-central (e), right occipital (f), mid-occipital (g), and left occipital sites (h). V+A+ responses are represented by a red trace, V+A– target responses by a green trace, V–A+ responses by a blue trace, V–A– congruent responses by a dashed black trace, and V–A– incongruent responses by a thin black trace. Microvolts (µV) are plotted on the ordinate (1 µV per tick interval) and milliseconds (ms) are plotted on the abscissa (100 ms per tick interval). Positive is plotted up in all waveform figures.

Figure 4. Responses evoked by all bisensory conditions. Traces from the ERP responses elicited by each of the five bisensory stimuli are shown for electrodes over right fronto-central (a), fronto-central (b), left fronto-central (c), right-central (d), left-central (e), right occipital (f), mid-occipital (g), and left occipital sites (h). V+A+ responses are represented by a red trace, V+A– target responses by a green trace, V–A+ responses by a blue trace, V–A– congruent responses by a dashed black trace, and V–A– incongruent responses by a thin black trace. Microvolts (µV) are plotted on the ordinate (1 µV per tick interval) and milliseconds (ms) are plotted on the abscissa (100 ms per tick interval). Positive is plotted up in all waveform figures.

Figure 5. Multisensory object-recognition effects in the latency range of the visual N1 component, over lateral occipital scalp. (a) The waveforms to the V+A+ (red trace) and V+A– (green trace) responses, and their difference (black trace), are displayed, with a black line corresponding to the time-point of the multisensory ‘N1-effect’ difference maps. Below the waveforms, the SCD map and voltage map of the V+A+ minus V+A– difference wave are both shown at 155 ms. The scale per division is 0.01 µV/cm2 for the SCD map, and 0.10 µV for the voltage map. These and all subsequent SCD and voltage topography maps are scaled to optimize visualization of the topographies. The black dot on the voltage map indicates the placement of the electrode from which the depicted waveforms were recorded. (b) For comparison, the SCD map and voltage map of the visual N1 elicited by the unisensory V+ targets at the same latency of 155 ms. The scale per division is 0.03 µV/cm2 for the SCD map, and 0.14 µv for the voltage map. (c) Visualization of the placement of the dipole that accounted for the majority of the variance associated with the multisensory ‘N1-effect’.

Figure 5. Multisensory object-recognition effects in the latency range of the visual N1 component, over lateral occipital scalp. (a) The waveforms to the V+A+ (red trace) and V+A– (green trace) responses, and their difference (black trace), are displayed, with a black line corresponding to the time-point of the multisensory ‘N1-effect’ difference maps. Below the waveforms, the SCD map and voltage map of the V+A+ minus V+A– difference wave are both shown at 155 ms. The scale per division is 0.01 µV/cm2 for the SCD map, and 0.10 µV for the voltage map. These and all subsequent SCD and voltage topography maps are scaled to optimize visualization of the topographies. The black dot on the voltage map indicates the placement of the electrode from which the depicted waveforms were recorded. (b) For comparison, the SCD map and voltage map of the visual N1 elicited by the unisensory V+ targets at the same latency of 155 ms. The scale per division is 0.03 µV/cm2 for the SCD map, and 0.14 µv for the voltage map. (c) Visualization of the placement of the dipole that accounted for the majority of the variance associated with the multisensory ‘N1-effect’.

Figure 6. Multisensory object-recognition effects in the latency range of the SN component, over lateral occipital scalp. (a) The multisensory object-recognition effect in the latency range of the visual selective attention component, the SN, over lateral occipital scalp. The SCD map and voltage map of the V+A+ minus V+A– difference wave are shown at 272 ms. The scale per division is 0.01 µV/cm2 for the SCD map, and 0.10 µV for the voltage map. The waveforms to the V+A+ (red trace) and V+A– (green trace) responses, and their difference (black trace), are displayed to the right. (b) The visual selective attention effect — the SCD map and voltage map of the difference between the unisensory V+ versus V– responses at 272 ms. The scale per division is 0.02 µV/cm2 for the SCD map, and 0.28 µV for the voltage map. The V+ waveform (pink trace), the V– waveform (dark cyan trace), and their difference (black trace) are displayed to the right. For both a and b the black dot on the voltage map indicates the placement of the electrode from which the depicted waveforms were recorded, and the vertical black line through the waveforms corresponds to the time-point of the difference maps.

Figure 6. Multisensory object-recognition effects in the latency range of the SN component, over lateral occipital scalp. (a) The multisensory object-recognition effect in the latency range of the visual selective attention component, the SN, over lateral occipital scalp. The SCD map and voltage map of the V+A+ minus V+A– difference wave are shown at 272 ms. The scale per division is 0.01 µV/cm2 for the SCD map, and 0.10 µV for the voltage map. The waveforms to the V+A+ (red trace) and V+A– (green trace) responses, and their difference (black trace), are displayed to the right. (b) The visual selective attention effect — the SCD map and voltage map of the difference between the unisensory V+ versus V– responses at 272 ms. The scale per division is 0.02 µV/cm2 for the SCD map, and 0.28 µV for the voltage map. The V+ waveform (pink trace), the V– waveform (dark cyan trace), and their difference (black trace) are displayed to the right. For both a and b the black dot on the voltage map indicates the placement of the electrode from which the depicted waveforms were recorded, and the vertical black line through the waveforms corresponds to the time-point of the difference maps.

Figure 7. Waveforms of the V+A+ response (red trace), the summed response (cyan trace), and their difference (black trace), are shown from a right (top panel) and a left (bottom panel) lateral-occipital electrode site. The placement of the electrode is depicted to the left of each of the sets of waveforms. The vertical black line represents the midpoint of the latency window used for the control tests. These were performed on the multisensory ‘N1-effect’ (a) and the multisensory ‘SN-effect’ (b).

Figure 7. Waveforms of the V+A+ response (red trace), the summed response (cyan trace), and their difference (black trace), are shown from a right (top panel) and a left (bottom panel) lateral-occipital electrode site. The placement of the electrode is depicted to the left of each of the sets of waveforms. The vertical black line represents the midpoint of the latency window used for the control tests. These were performed on the multisensory ‘N1-effect’ (a) and the multisensory ‘SN-effect’ (b).

Figure 8. A late occurring congruency effect. (a) The V–A– incongruent response (thin black trace), the V–A– congruent response (dashed black trace), and their difference (thick black trace) are displayed to the right. (b) Voltage map of the difference between the response to the V–A– incongruent versus the V–A– congruent responses at 450 ms. The scale per division is 0.10 µV.

Figure 8. A late occurring congruency effect. (a) The V–A– incongruent response (thin black trace), the V–A– congruent response (dashed black trace), and their difference (thick black trace) are displayed to the right. (b) Voltage map of the difference between the response to the V–A– incongruent versus the V–A– congruent responses at 450 ms. The scale per division is 0.10 µV.

Table 1


 Nomenclature and definitions of the five target and four non-target stimulus types

Stimulus type Definition 
Targets  
 V+A+ A bisensory stimulus in which the picture and sound elements belong to the same animal, and are both targets. 
 V+A– A bisensory stimulus in which the picture and sound elements belong to different animals, and only the picture (visual element) is a target. 
 V–A+ A bisensory stimulus in which the picture and sound elements belong to different animals, and only the sound (auditory element) is a target. 
 V+ A unisensory visual stimulus, i.e. an animal picture, that is a target. 
 A+ A unisensory auditory stimulus, i.e. an animal sound, that is a target. 
Non-targets  
 V–A– congruent A bisensory stimulus in which the picture and the sound belong to the same animal. 
 V–A– incongruent A bisensory stimulus in which the picture and the sound belong to different animals. 
 V– A unisensory visual stimulus, i.e. an animal picture. 
 A– A unisensory auditory stimulus, i.e. an animal sound. 
Stimulus type Definition 
Targets  
 V+A+ A bisensory stimulus in which the picture and sound elements belong to the same animal, and are both targets. 
 V+A– A bisensory stimulus in which the picture and sound elements belong to different animals, and only the picture (visual element) is a target. 
 V–A+ A bisensory stimulus in which the picture and sound elements belong to different animals, and only the sound (auditory element) is a target. 
 V+ A unisensory visual stimulus, i.e. an animal picture, that is a target. 
 A+ A unisensory auditory stimulus, i.e. an animal sound, that is a target. 
Non-targets  
 V–A– congruent A bisensory stimulus in which the picture and the sound belong to the same animal. 
 V–A– incongruent A bisensory stimulus in which the picture and the sound belong to different animals. 
 V– A unisensory visual stimulus, i.e. an animal picture. 
 A– A unisensory auditory stimulus, i.e. an animal sound. 
Table 2


 Mean reaction times and per cent hits for the five target types (ordered from highest to lowest performance levels)

 V+A+  V+A– V+ A+ V–A+ 
RTs (ms) 492 (75) 531 (68) 530 (58) 610 (53) 624 (53) 
% hits  96 (4)  95 (5)  92 (17)  84 (12)  80 (17) 
 V+A+  V+A– V+ A+ V–A+ 
RTs (ms) 492 (75) 531 (68) 530 (58) 610 (53) 624 (53) 
% hits  96 (4)  95 (5)  92 (17)  84 (12)  80 (17) 
Table 3


 Reaction time differences (upper right) and per cent hit differences (lower left)

 V+A+  V+A– V+ A+ V–A+ 
V+A+  38.20 37.60 117.57 131.86 
  ** ** ** ** 
V+A–  1.50   0.57  79.36  93.64 
  –  – ** ** 
V+  4.59  3.06   79.90  94.20 
 –  –  ** ** 
A+ 12.67 11.10  8.08   14.26 
  –  **   – 
V–A+ 16.60 15.13 12.07   3.97  
 ** ** **   –  
 V+A+  V+A– V+ A+ V–A+ 
V+A+  38.20 37.60 117.57 131.86 
  ** ** ** ** 
V+A–  1.50   0.57  79.36  93.64 
  –  – ** ** 
V+  4.59  3.06   79.90  94.20 
 –  –  ** ** 
A+ 12.67 11.10  8.08   14.26 
  –  **   – 
V–A+ 16.60 15.13 12.07   3.97  
 ** ** **   –  

*Significant at P < 0.05.

**Significant at P < 0.01.

Table 4

t-tests comparing the cumulative probability (CP) for the V+A+ targets with the cumulative probability predicted by the Race Model (for V+A– and V–A+ targets), over the first six quantiles

Quantile CP: V+A+ targets CP: predicted t11 P =  
0.006 0.006 0.53 0.410 
0.002 0.004 1.39 0.060 
0.019 0.011 1.90 0.038 
0.050 0.044 2.17 0.024 
0.117 0.092 4.33 0.000 
0.173 0.120 4.70 0.000 
Quantile CP: V+A+ targets CP: predicted t11 P =  
0.006 0.006 0.53 0.410 
0.002 0.004 1.39 0.060 
0.019 0.011 1.90 0.038 
0.050 0.044 2.17 0.024 
0.117 0.092 4.33 0.000 
0.173 0.120 4.70 0.000 

References

Allison T, Puce A, Spencer D, McCarthy G (
1999
) Electrophysiological studies of human face perception I: potentials generated in occipitotemporal cortex by face and non-face stimuli.
Cereb Cortex
 
9
:
415
–430.
Amedi A, Malacz R, Hendler T, Peled S, Zohary E (
2001
) Visuo-haptic object-related activation in the ventral visual pathway.
Nat Neurosci
 
4
:
324
–330.
Andersen RA, Snyder LH, Bradley DC, Xing J (
1997
) Multimodal representation of space in the posterior parietal cortex and its use in planning movements.
Annu Rev Neurosci
 
20
:
303
–330.
Anllo-Vento L, Hillyard SA (
1996
) Selective attention to the color and direction of moving stimuli: electrophysiological correlates of hierarchical feature selection.
Percept Psychophys
 
58
:
191
–206.
Anllo-Vento L, Luck SJ, Hillyard SA (
1998
) Spatio-temporal dynamics of attention to color: evidence from human electrophysiology.
Hum Brain Mapp
 
6
:
216
–238.
Bentin S, Mouchetant-Rostaing Y, Giard MH, Echallier JF, Pernier J (
1999
) ERP manifestations of processing printed words at different psycholinguistic levels: time course and scalp distribution.
J Cogn Neurosci
 
11
:
235
–260.
Berman RA, Colby CL (
2002
) Both auditory and visual attention modulate motion processing in area MT+.
Cogn Brain Res
 
14
:
64
–74.
Calvert GA (
2001
) Crossmodal processing in the human brain: insights from functional neuroimaging studies.
Cereb Cortex
 
11
:
1110
–1123.
Calvert GA, Brammer MJ, Bullmore ET, Campbell R, Iversen SD, David AS (
1999
) Response amplification in sensory-specific cortices during crossmodal binding.
Neuroreport
 
10
:
2619
–2623.
Calvert GA, Campbell R, Brammer MJ (
2000
) Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortex.
Curr Biol
 
10
:
649
–657.
Campbell R, Dodd B (
1980
) Hearing by eye.
Q J Exp Psychol
 
32
:
85
–99.
Czigler I, Balázs L (
2001
) Event-related potentials and audiovisual stimuli: multimodal interactions.
Neuroreport
 
12
:
223
–226.
Dias EC, Foxe JJ, Javitt DC (
2003
) Changing plans: a high density electrical mapping study of cortical control.
Cereb Cortex
 
13
:
701
–715.
Di Russo F, Martinez A, Sereno MI, Pitzalis S, Hillyard SA (
2002
) Cortical sources of the early components of the visual evoked potential.
Hum Brain Mapp
 
15
:
95
–111.
Di Russo F, Martinez A, Hillyard SA (
2003
) Source analysis of event-related cortical activity during visuo-spatial attention.
Cereb Cortex
 
13
:
486
–499.
Doniger GM, Foxe JJ, Murray MM, Higgins BA, Snodgrass JG, Schroeder CE, Javitt DC (
2000
) Activation time-course of ventral visual stream object-recognition areas: high density electrical mapping of perceptual closure processes.
J Cogn Neurosci
 
12
:
615
–621.
Doniger GM, Foxe JJ, Schroeder CE, Murray MM, Higgins BA, Javitt DC (
2001
) Visual perceptual learning in human object based recognition areas: a repetition priming study using high-density electrical mapping.
Neuroimage
 
13
:
305
–313.
Doniger GM, Foxe JJ, Murray MM, Higgins BA, Javitt DC (
2002
) Perceptual closure deficits in schizophrenia: a high-density electrical mapping study.
Arch Gen Psychiatry
 
59
:
1011
–1020.
Duhamel JR, Colby CL, Goldberg ME (
1991
) Congruent representation of visual and somatosensory space in single neurons of monkey ventral intraparietal cortex (VIP). In: Brain and space (Paillard J, ed.), pp.
223
–236. Oxford: Oxford University Press.
Eimer M (
2000
) Effects of face inversion on the structural encoding and recognition of faces. Evidence from event-related brain potentials.
Brain Res Cogn Brain Res
 
10
:
145
–158.
Fabiani M, Kazmerski VA, Cycowicz YM, Friedman D (
1996
) Naming norms for brief environmental sounds: effects of age and dementia.
Psychophysiology
 
33
:
462
–475.
Foxe JJ, Simpson GV (
2002
) Timecourse of activation flow from V1 to frontal cortex in humans: a framework for defining ‘early’ visual processing.
Exp Brain Res
 
142
:
139
–150.
Foxe JJ, Morocz IA, Higgins BA, Murray MM, Javitt DC, Schroeder CE (
2000
) Multisensory auditory–somatosensory interactions in early cortical processing revealed by high density electrical mapping.
Cogn Brain Res
 
10
:
77
–83.
Foxe JJ, Doniger GM, Javitt DC (
2001
) Visual processing deficits in schizophrenia: impaired P1 generation revealed by high-density electrical mapping.
Neuroreport
 
12
:
3815
–3820.
Foxe JJ, Wylie GR, Martinez A, Schroeder CE, Javitt DC, Guilfoyle D, Ritter W, Murray MM (
2002
) Auditory–somatosensory multisensory processing in auditory association cortex: an fMRI study.
J Neurophysiol
 
88
:
540
–543.
Foxe JJ, McCourt ME, Javitt DC (
2003
) Right hemisphere control of visuo-spatial attention: ‘line-bisection’ judgments evaluated with high-density electrical mapping and source-analysis.
Neuroimage
 
19
:
710
–726.
Ganis G, Kutas M (
2003
) An electrophysiological study of scene effects on object identification.
Brain Res
  Cogn
Brain Res
 
16
:
123
–144.
Giard MH, Peronnet F (
1999
) Auditory–visual integration during multimodal object recognition in humans: a behavioral and electrophysiological study.
J Cogn Neurosci
 
11
:
473
–490.
Hansen JC, Hillyard SA (
1980
) Endogenous brain potentials associated with selective auditory attention.
Electroencephalogr Clin Neurophysiol
 
49
:
277
–290.
Hansen JC, Dickstein PW, Berka C, Hillyard SA (
1983
) Event-related potentials during selective attention to speech sounds.
Biol Psychol
 
16
:
211
–224.
Harter MR, Aine C (
1984
) Brain mechanisms of visual selective attention. In: Varieties of attention (Parasiraman P, David DR, eds), pp.
293
–321. New York: Academic Press.
Haxby JV, Ungerleider LG, Clark VP, Schouten JL, Hoffman EA, Martin A (
1999
) The effect of face inversion on activity in human neural systems for face and object perception.
Neuron
 
22
:
189
–199.
Hillyard SA, Hink RF, Schwent VL, Picton TW (
1973
) Evoked potential correlates of auditory signal detection.
Science
 
182
:
177
–180.
Ishai A, Ungerleider LG, Martin A, Schouten JL, Haxby JV (
1999
) Distributed representation of objects in the human ventral visual pathway.
Proc Natl Acad Sci USA
 
96
:
9379
–9384.
James T, Humphrey G, Gati J, Menon R, Goodale M (
2002
) Differential effects of viewpoint on object-driven activation in dorsal and ventral streams.
Neuron
 
35
:
793
–801.
James TW, Humphrey GK, Gati JS, Servos P, Menon RS, Goodale MA (
2002
) Haptic study of three-dimensional objects activates extrastriate visual areas.
Neuropsychologia
 
40
:
1706
–1714.
Jousmäki V, Hari R (
1998
) Parchment-skin illusion: sound-biased touch.
Curr Biol
 
8
:R190.
Kanwisher N, McDermott J, Chun MM (
1997
) The fusiform face area: a module in human extrastriate cortex specialized for face perception.
J Neurosci
 
17
:
4302
–4311.
Kenemans JL, Kok A, Smulders FTY (
1993
) Event-related potentials to conjunctions of spatial frequency and orientation as a function of stimulus parameters and response requirements.
Electroencephalogr Clin Neurophys
 
88
:
51
–63.
Kohler S, Kapur S, Moscovitch M, Winocur G, Houle S (
1995
) Dissociation of pathways for object and spatial vision: a PET study in humans.
Cogn Neurosci Neurophysiol
 
6
:
1865
–1868.
Kutas M, Federmeier KD (
2000
) Electrophysiology reveals semantic memory use in language comprehension.
Trends Cogn Sci
 
4
:
463
–470.
Kutas M, Hillyard SA (
1980
) Reading senseless sentences: brain potentials reflect semantic incongruity.
Science
 
11
:
203
–205.
Lerner Y, Hendler T, Malach R (
2002
) Object-completion effects in the human lateral occipital complex.
Cereb Cortex
 
12
:
163
–177.
Lutkenhoner B, Lammertmann C, Simoes C, Hari R (
2002
) Magnetoencephalographic correlates of audiotactile interaction.
Neuroimage
 
15
:
509
–522.
MacLeod A, Summerfield Q (
1990
) A procedure for measuring auditory and audio-visual speech-reception thresholds for sentences in noise: rationale, evaluation, and recommendations for use.
Br J Audiol
 
24
:
29
–43.
Malach R, Reppas JB, Benson RR, Kwong KK, Jiang H, Kennedy WA, Ledden PJ, Brady TJ, Rosen BR, Tootell RB (
1995
) Object-related activity revealed by functional magnetic resonance imaging in human occipital cortex.
Proc Natl Acad Sci USA
 
92
:
8135
–8139.
McGurk H, MacDonald J (
1976
) Hearing lips and seeing voices.
Nature
 
264
:
746
–748.
Meredith MA (
2002
) On the neuronal basis for multisensory convergence: a brief overview.
Brain Res Cogn Brain Res
 
14
:
31
–40.
Miller J (
1982
) Divided attention: evidence for coactivation with redundant signals.
Cogn Psychol
 
14
:
247
–279.
Molholm S, Ritter W, Murray MM, Javitt DC, Schroeder CE, Foxe JJ (
2002
) Multisensory auditory–visual interactions during early sensory processing in humans: a high-density electrical mapping study.
Brain Res Cogn Brain Res
 
14
:
115
–128.
Murray MM, Wylie GR, Higgins BA, Javitt DC, Schroeder CE, Foxe JJ (
2002
) The spatiotemporal dynamics of illusory contour processing: combined high-density electrical mapping, source analysis, and functional magnetic resonance imaging.
J Neurosci
 
22
:
5055
–5573.
Näätänen R (
1992
) Attention and brain function. Hillsdale, NJ: Lawrence Erlbaum Associates.
Näätänen R, Gaillard AWK, Mantysalo S (
1978
) Early selective attention effect on evoked potential reinterpreted.
Acta Psychol (Amst)
 
42
:
313
–329.
Näätänen R, Simpson M, Loveless NE (
1982
) Stimulus deviance and evoked potentials.
Biol Psychol
 
14
:
53
–98.
Olson IR, Gatenby JC, Gore JC (
2002
) A comparison of bound and unbound audio-visual information processing in the human cerebral cortex.
Brain Res Cogn Brain Res
 
14
:
129
–138.
Picton TW, Hillyard SA, Krausz HI, Galambos R (
1974
) Human auditory evoked potentials. I. Evaluation of components.
Electroencephalogr Clin Neurophys
 
36
:
179
–190.
Puce A, Allison T, Asgari M, Gore JC, McCarthy G (
1996
) Differential sensitivity of human visual cortex to faces, letterstrings, and textures: a functional magnetic resonance imaging study.
J Neurosci
 
16
:
5205
–5215.
Puce A, Allison T, McCarthy G (
1999
) Electrophysiological studies of human face perception. III: Effects of top-down processing on face-specific potentials.
Cereb Cortex
 
9
:
445
–458.
Rossion B, Gauthier I, Tarr MJ, Despland P, Bruyer R, Linotte S, Crommelinck M (
2000
) The N170 occipito-temporal component is delayed and enhanced to inverted faces but not to inverted objects: an electrophysiological account of face-specific processes in the human brain.
Neuroreport
 
11
:
69
–74.
Scherg M, Von Cramon D (
1985
) Two bilateral sources of the late AEP as identified by a spatio-temporal dipole model.
Electroencephalogr Clin Neurophysiol
 
1
:
32
–44.
Schroeder CE, Foxe JJ (
2002
) The timing and laminar profile of converging inputs to multisensory areas of the macaque neocortex.
Brain Res Cogn Brain Res
 
14
:
187
–198.
Schweinberger SR, Pickering EC, Jentzsch I, Burton AM, Kaufmann JM (
2002
) Event-related brain potential evidence for a response of inferior temporal cortex to familiar face repetitions.
Brain Res Cogn Brain Res
 
14
:
398
–409.
Simpson GV, Pflieger ME, Foxe JJ, Ahlfors SP, Vaughan HG Jr, Hrabe J, Ilmoniemi RJ, Lantos G (
1995
) Dynamic neuroimaging of brain function.
J Clin Neurophysiol
 
12
:
432
–449.
Smid HGOM, Jakob A, Heinze H-J (
1999
) An event-related brain potential study of visual selective attention to conjunctions of color and shape.
Psychophysiology
 
36
:
264
–279.
Snodgrass JG, Vanderwart M (
1980
) A standardized set of 260 pictures: norms for name agreement, image agreement, familiarity, and visual complexity.
J Exp Psychol Hum Learn
 
6
:
174
–215.
Stein BE (
1998
) Neural mechanisms for synthesizing sensory information and producing adaptive behaviors.
Exp Brain Res
 
123
:
124
–135.
Stein BE, Dixon JP (
1979
) Properties of superior colliculus neurons in the golden hamster.
J Comp Neurol
 .
183
:
269
–284.
Stein BE, London N, Wilkinson LK, Price DD (
1996
) Enhancement of perceived visual intensity by auditory stimuli: a psychophysical analysis.
J Cogn Neurosci
 
8
:
497
–506.
Sumby WH, Pollack I (
1954
) Visual contribution to speech intelligibility in noise.
J Acoust Soc Am
 
26
:
212
–215.
Tanaka J, Luu P, Weisbrod M, Kiefer M (
1999
) Tracking the time course of object categorization using event-related potentials.
Neuroreport
 
10
:
829
–835.
Thompson LA (
1995
) Encoding and memory for visible speech and gestures: a comparison between young and older adults.
Psychol Aging
 
10
:
215
–228.
Ungerleider LG, Mishkin M (
1982
) Two cortical visual systems. In: Analysis of visual behavior (Engle DJ, Goodale MA, Mansfield RJ, eds), pp.
549
–586. Cambridge, MA: MIT Press.
Vaughan HG, Ritter W (
1970
) The sources of auditory evoked responses recorded from the human scalp.
Electroencephalogr Clin Neurophys
 
28
:
360
–367.
Vaughan HG, Ritter W, Simson R (
1980
) Topographic analysis of auditory event-related potentials.
Prog Brain Res
 
54
:
279
–285.
Vogel EK, Luck SJ (
2000
) The visual N1 component as an index of a discrimination process.
Psychophysiology
 
37
:
190
–203.
Woods DL, Alain C (
2001
) Conjoining three auditory features: an event-related brain potential study.
J Cogn Neurosci
 
13
:
492
–509.
Zangaladze A, Epstein CM, Grafton ST, Sathian K (
1999
) Involvement of visual cortex in tactile discrimination of orientation.
Nature
 
401
:
587
–590.