Abstract

We studied eight normal subjects in an fMRI experiment where they listened to natural speech sentences and to matched simple or complex speech envelope noises. Neither of the noises (simple or complex) were understood initially, but after the corresponding natural speech sentences had been heard, comprehension was close to perfect for the complex but still absent for the simple speech envelope noises. This setting thus involved identical stimuli that were understood or not and permitted to identify (i) a neural substrate of speech comprehension unconfounded by stimulus acoustic properties (common to natural speech and complex noises), (ii) putative correlates of auditory search for phonetic cues in noisy stimuli (common to simple and complex noises once the matching natural speech had been heard) and (iii) the cortical regions where speech comprehension and auditory search interact. We found correlates of speech comprehension in bilateral medial (BA21) and inferior (BA38 and BA38/21) temporal regions, whereas acoustic feature processing occurred in more dorsal temporal regions. The left posterior superior temporal cortex (Wernicke’s area) responded to the acoustic complexity of the stimuli but was additionally sensitive to auditory search and speech comprehension. Attention was associated with recruitment of the dorsal part of Broca’s area (BA44) and interaction of auditory attention and comprehension occurred in bilateral insulae, the anterior cingulate and the right medial frontal cortex. In combination, these results delineate a neuroanatomical framework for the functional components at work during natural speech processing, i.e. when comprehension results from concurrent acoustic processing and effortful auditory search.

Introduction

In a noisy natural environment, comprehension of a given acoustic stimulus may either fail or succeed. Success depends not only on the signal-to-noise ratio of the acoustic stimulus, but also on top-down factors as high-level cognitive assessment of the context, e.g. expectancy of voices in a cocktail party situation and low-level rapid perceptual learning, e.g. tuning to a given voice (Grossberg, 1999; Hines, 1999; Alain et al., 2001; Samuel, 2001).

In the present study, we sought to assess the relative contribution of synergetic components (acoustic processing, auditory search, comprehension) to the distributed cortical system underlying natural speech processing. Previous attempts to target speech intelligibility relied on comparisons of speech with closely matched non-speech noises (Scott et al., 2000; Vouloumanos et al., 2001) but never of identical stimuli. Here, we designed a functional neuroimaging experiment aimed at specifically identifying processing of speech comprehension independent of differences in the acoustic properties of the stimuli. Our setting additionally permitted the identification of cerebral regions engaged in the acoustic analysis of the stimuli and regions where auditory search mechanisms and speech comprehension interact.

Temporally structured noises, namely speech envelope noises (Shannon et al., 1995), appear like noises to a naïve subject and therefore generate no specific expectation or meaningful semantic percept. Yet, after a short perceptual learning period during which noises are presented in alternation with the matching speech sounds, they can be fully understood as speech, i.e. a meaning can be extracted. The effect is stable and noises can be understood thereafter even without the corresponding clear speech stimulus. Comparing the cerebral responses to speech envelope noises before and after perceptual learning could permit to analyse the neural components of speech comprehension, independent from any difference in the acoustic properties of the stimulus. However, comprehension of speech envelope noises remains effortful, i.e. it requires greater search for phonetic cues than listening to natural speech (Liebenthal et al., 2003). Therefore, comparing cerebral responses to the same noise stimulus, before and once it is understood, will pool and thus confound neural correlates of successful speech comprehension (perception) and of specific attention mechanisms that are directed at detecting phonetic indices in noisy stimuli.

This confound can be addressed by studying two other conditions: (i) normal speech that is understood without effort and (ii) speech envelope noises with a less complex temporal structure that are not even understood after pairing with the corresponding natural speech sounds, but can be expected to be processed with attentional effort because they resemble the complex sounds. Our setting thus involved a sequential presentation of the following conditions: simple noises, complex noises, normal speech, a learning phase where normal speech and complex noises were paired and, finally, a repetition of complex and simple noises. This setting permitted to dissociate speech comprehension from auditory search using a factorial design with comprehension (success) and search (effort) as main factors. Comprehension was achieved for normal speech and, after learning, for complex noises while search for phonetic cues was assumed to be active in both noise conditions after the learning period. When probing comprehension, independence from acoustic processing was ensured by constraining the results to those regions that were more active after learning than before for the same complex noises.

Materials and Methods

Subjects

Eight right-handed male subjects without audiological or neurological pathology (age range between 24 and 38 years, mean = 28.4 years) volunteered for an fMRI experiment. Written consent was obtained from all participants.

Stimuli

The stimuli were derived from exemplars of sentences from a German audiometric test (Oldenburger Satztest) read by a male speaker in quiet conditions. Each signal was then digitized via a 16-bit A/D converter at a 44.1 kHz sampling rate. In the ‘speech-envelope noise’ condition, (for details of the procedure, see Shannon et al., 1995; Lorenzi et al., 1999, 2000; Apoux et al., 2001), each digitized signal was lowpass filtered below 5 kHz (1st order butterworth filter, –6 dB per octave roll-off). The signal was then split into four broad frequency bands (3rd order elliptical IIR filters): (A) 20–800 Hz, (B) 800–1500 Hz, (C) 1500–2500 Hz and (D) 2500–5000 Hz. Adjacent filters overlapped at the point where the output from each filter was 15 dB below the level in the band-pass. The scheme of the digital processing for each frequency band is illustrated in Figure 1. The temporal envelope E(t) was extracted by half-wave rectification and low-pass filtering (1st order Butterworth filter, –6 dB per octave roll-off) of the band-pass filtered signal, F(t) at a given cut-off frequency. Cut-off frequency was 2 Hz for stimuli we refer to as narrow-band speech envelope noises (NBSEN), and 256 Hz for stimuli we refer to as broadband speech envelope noises (BBSEN). The resulting envelope E(t) was then used to modulate a white noise. The modulated noise was frequency-limited by filtering with the same band-pass filters used in the original analysis band. The modulated noises from each band were combined, low-pass filtered at 5 kHz. Example waveforms and spectra of stimuli are shown in Figure 2. BBSEN have a complex temporal envelope and therefore can be understood after having listened to the speech stimuli they are derived from. Without such perceptual learning, they remain non-intelligible. Conversely, even after pairing of NBSEN with the corresponding speech stimuli, they remain incomprehensible. All acoustic stimuli were presented binaurally to the subjects through headphones via a custom-made air-conduction system. Signals were calibrated to produce an average output level of 105 dB SPL. Passive attenuation by the headphones was estimated to be ∼10 dB. Acoustic signals were delivered with a signal to noise ratio of ∼+5 dB.

Protocol and Data Acquisition

Functional MRI was performed with a 1.5T Siemens Vision scanner (gradient booster, standard head coil), using an echoplanar imaging sequence (26 slices, no gap, voxel size = 3.6 × 3.6 × 5 mm, TR = 3 s). There were 13 blocks of sentences (or matched SEN noises) per session and two sessions per subject with different speech material. The acoustic items had a duration of 3–3.5 s and were presented with an ISI of 4 s. Five hundred and forty volumes were acquired per subject.

Prior to scanning, subjects were told that auditory stimulation would include noises and speech. They were instructed to report after each item whether it corresponded to meaningful speech or not, by pressing a button with one finger or another of the left hand, respectively. The instruction was minimal in order to let the subjects learn spontaneously throughout the experiment that some noises corresponded to meaningful speech.

Each session was divided into three non-interchangeable periods, two experimental phases with a learning period in between. Phase 1 was meant to recruit minimal auditory search and yield no comprehension. It included BBSEN and NBSEN presented in alternation in small blocks of nine items (sentences or corresponding noises), followed by natural speech and NBSEN using the same block presentation mode (see Fig. 3). During Phase 1, the subjects heard noises without suspecting that they concealed sentences. They made minimal effort to search for a meaning as they ignored that there could be one. Natural speech alone yielded easy recognition and therefore also required minimal auditory search.

Phase 1 was followed by a learning period during which the noises (BBSEN and NBSEN) and the corresponding natural speech stimuli were presented pairwise. During this step, subjects spontaneously perceived and thus learned that some of the noises concealed a sentence.

The next period of the experiment (Phase 2) consisted of small blocks of nine items (either NBSEN or BBSEN) with the identical acoustic material as in Phase 1. Different from Phase 1 and due to the perceptual experience during the learning period, we expected a search for phonetic cues to implicitly occur throughout both noise conditions in Phase 2 and hence increase the attentional load, although only sentences from BBSEN could yield a meaningful percept. In constructing the stimuli we paid attention to the fact that recognizing BBSEN sentences should remain difficult even after the learning period and that they should not be too easily distinguishable from NBSEN stimuli. We thereby sought to ensure (i) an equivalent attentional effort for BBSEN and NBSEN during Phase 2 and (ii) that subjects would not reject NBSEN as unrecognizable simply based on auditory features that differ from BBSEN. The organization of the sequences was identical in all subjects. This experiment was not suitable for a randomized presentation order since comprehension of speech envelope noises required prior perceptual learning using normal speech.

Data Analysis

We used SPM99 (http://www.fil.ion.ucl.ac.uk/spm) for standard spatial pre-processing (realignment, normalization to MNI template and smoothing with a 10 mm Gaussian kernel for group analysis) and statistical block-design analyses comparing the conditions modelled.

To assess the influence of physical features of speech sounds on regional brain activity we computed two simple contrasts between stimuli with different physical properties: (i) natural speech versus complex noises in Phase 2 (BBSEN2), i.e. when both were understood and (ii) noises with complex temporal structure (BBSEN1) versus noises with simpler temporal structure (NBSEN1) during Phase 1, i.e. when neither was understood.

We then addressed comprehension and search for phonetic cues, respectively, as well as their interaction. Accordingly, the conditions BBSEN in Phase 1 (BBSEN1), natural speech, BBSEN2 and NBSEN in Phase 2 (NBSEN2) were analysed in a 2 × 2 factorial design with auditory search effort and comprehension as factors. Comprehension occurred in natural speech and BBSEN2, auditory search in BBSEN2 and NBSEN2, and BBSEN1 was associated with neither comprehension nor search, as subjects did not yet know that the noises concealed sentences. We determined main effects of each factor and the interaction term. Comprehension was assessed by (natural speech + BBSEN2) – (BBSEN1 + NBSEN2), search effort by (NBSEN2 + BBSEN2) – (natural speech + BBSEN1) and the interaction by (BBSEN1 + BBSEN2) – (natural speech + NBSEN2). Such a factorial analysis allowed us to derive each effect from an identical set of experimental conditions and to generate contrasts of balanced statistical power. However, this analysis was performed with acoustically different stimuli and hence is partly confounded. To ensure independence from stimulus physical properties, we restricted the main effect of comprehension to those regions that were more active during BBSEN2 than BBSEN1 (by inclusive masking at P < 0.001, uncorrected). We also excluded regions that were sensitive to acoustic complexity in BBSEN1 > NBSEN1 (by exclusive masking at P < 0.001, uncorrected).

Similarly the effect of a search for phonetic cues during Phase 2 of the experiment was confounded by stimulus properties since the factorial design did not include NBSEN in Phase 1 (NBSEN1). This confound was eliminated by masking our result for the effect of auditory search with the contrast of NBSEN2 with NBSEN1 (at P = 0.001, uncorrected). The finding was thus constrained to an effect that could be observed under identical physical stimulus conditions.

Results

Behavioral Observations

In the presence of the scanner noise all conditions, including the task with natural speech, become ‘speech-in-noise intelligibility’ tasks. We assumed sufficient dynamics and linearity in the cerebral processes underlying auditory search and comprehension as a function of the noise level. Results in accuracy and reaction time (RT) indicate that the noise produced by the scanner was not detrimental to performance. Our behavioral findings are summarized in Figure 4. There was some very scarce speech perception in single subjects during BBSEN1 (mean intelligibility = 2,75%, SD = 9%) and NBSEN2 (mean = 3,66%, SD = 14%), while intelligibility was entirely absent for NBSEN1 (mean = 0%). We found no significant behavioral differences between these three conditions. Comprehension was very good during natural speech and BBSEN2 with a mean accuracy of 99% (SD = 3.3) and 95% (SD = 11), respectively, and no significant difference. Reaction times showed no significant difference between all five conditions, in particular not between Phase 1 and Phase 2. These behavioral results do not indicate an effect of greater search for phonetic cues in Phase 2 of the experiment. However, as the dominant factor determining the behavioral report was a perceptual judgment, our findings can be taken as evidence of a similar degree of certainty across all conditions despite differences in both attentional challenge and perceptual result.

Cerebral Activations

An overview of our findings is provided in Table 1.

Regions Involved in the Processing of Physical Stimulus Properties

When contrasted with speech envelope noises, natural speech activated cortical areas extending from the mid/anterior part of the right superior temporal sulcus (STS) to more ventral temporal regions (Fig. 5). Natural speech also recruited the right temporo-occipital region (BA21/20/37) and the left STS. Stimulus-locked MRI signal changes averaged over all subjects (Fig. 5, lower panel) sampled from homologous STS regions in each hemisphere revealed a strong natural speech effect in the right-sided focus, whereas left STS showed only a small advantage for natural speech. Interestingly, both magnitude and time-course of the responses in both STS were very similar for BBSEN1 and BBSEN2, i.e. for physically identical stimuli of which only BBSEN2 resulted in comprehension. The magnitude of the response to BBSEN stimuli was larger in the left than in the right STS.

To assess areas that were sensitive to the temporal complexity of the stimuli we contrasted broad-band with narrow-band SEN during Phase 1 and found activation in both superior temporal cortices (including STS), both insulae, both middle frontal cortices (BA9) and the anterior cingulate (Fig. 6). Activation in the left hemisphere was larger than in the right and also included a response focus in posterior temporal regions (border of BA21/22/39: Wernicke’s area). Stimulus-locked MRI signal time courses (Fig. 6, lower panel) indicate that Wernicke’s area responded to all sounds tested in the experiment, but that response magnitude was significantly higher for complex stimuli. Wernicke’s area did not respond in a specific way to whether the stimuli were meaningful or not, but for identical stimuli there was a larger (non-significant) response when the stimuli were understood. An equivalent observation can be made with respect to a possible auditory search effect. For conditions with physically identical stimuli, Wernicke’s area responded more after the learning period, i.e. during the effortful conditions, even to non-intelligible stimuli.

Regions Involved in Successful Comprehension

The novelty of our experimental approach lies in comparing two presentations of identical stimuli with only the second presentation yielding comprehension. This setting permits the identification of areas that are associated with comprehension of the meaning of speech without confounding this response with physical differences between the stimuli. As already pointed out, an inevitable drawback of this approach (Dolan et al., 1997) is temporal order confound as BBSEN2 (fully understood) always occurred after BBSEN1 (perceived as meaningless). We controlled for temporal order effects by retaining only those areas where (i) natural speech activated more than NBSEN2, since natural speech was always presented prior to the NBSEN2 condition and (ii) BBSEN2 activated more than BBSEN1. Comprehension was thus found to activate the right anterior STS and a set of regions of the bilateral middle and inferior temporal gyri (BA21/20/38). We did not observe a clear left hemispheric predominance. One of these regions overlapped with a region also responding in the comparisons natural speech>BBSEN2 and BBSEN1>NBSEN1 (Table 1).

Stimulus-locked MRI signal time courses (Fig. 7, Plot 1) revealed responses of higher magnitude for natural speech than BBSEN2. All other conditions did not significantly activate middle/inferior temporal regions.

Regions Involved in Auditory Search

Our setting permitted us to dissociate the effects of comprehension from the effects of auditory attention and search for phonetic cues. The latter was associated with activation in Broca’s area. Stimulus-locked MRI signal time courses from Broca’s area revealed that it responded strongly during natural speech and BBSEN2. However, the response to natural speech dropped rapidly, whereas the response to BBSEN remained sustained up to the end of each epoch with nine sentences. The stimulus-locked responses also show that the response in Broca’s area did not depend on physical stimulus properties. We assessed independently the responses to NBSEN and BBSEN before and after the phase of perceptual learning, i.e. before and after the subjects realized that noises can conceal meaning. For both categories of noises the response in Broca’s area was enhanced after the perceptual learning session (see arrows in Fig. 7, Plot 2) when subjects were searching for a meaning in noises.

Regions Involved in the Interaction of Auditory Search and Comprehension

Finally, we probed which regions might express an effect of auditory search on comprehension and reciprocally. The interaction of speech comprehension and auditory search for phonetic cues in noises revealed a frontal network comprising both anterior insulae, the right inferior prefrontal cortex and the anterior cingulate. Stimulus-locked responses (Fig. 7, plot 3) sampled from the left insula indicated a strong initial response to all stimuli, which collapsed for natural speech and non-intelligible noises but remained sustained for BBSEN stimuli irrespective of whether they were understood or not. Similar response patterns were observed in the other regions detected by this contrast.

Discussion

Because of their temporal structure, SEN constitute high-level auditory baselines. When their temporal envelope is sufficiently complex, as in the BBSEN stimuli, a brief perceptual learning permits stable subsequent comprehension. Our design including natural speech and SEN with simple and complex temporal envelopes permitted to test for distinct components of speech processing.

Extraction of Temporal Envelope

Both superior temporal regions (including STS) were sensitive to the temporal complexity of the acoustic stimuli, since they distinguished between stimuli that have equal long-term spectra (BBSEN and NBSEN, Fig. 2). The left superior temporal region responded more linearly and with a better sensitivity to an increase in temporal complexity than the right (3% versus 1% signal change). Previous studies have also shown that the left superior temporal cortex is sensitive to the temporal structure of sounds (Ahissar et al., 2001; Zatorre and Belin, 2001) and extracts those with a temporal envelope that is relevant for speech (between 4 and 10 Hz; Giraud et al., 2000). Consistent with a functional role in sensory feature analysis, the left superior temporal responses were strictly equivalent for stimuli with identical acoustic properties, irrespective of attention or perceptual experience with this material.

Natural Speech Processing

Our setting permitted us to contrast different sounds yielding equivalent comprehension, natural speech on the one hand and noises with complex temporal structure on the other. Although equally intelligible, these sounds differ by their spectral complexity (presence of harmonics in natural speech, Fig. 2). This difference was reflected in STS activity, with better sensitivity on the right compared to the left side. These findings are in accordance with the observation by Belin et al. (2000) that frequency structure probably plays a more prominent role than temporal modulation in voice-sensitive activation.

Our results are also consistent with prior studies comparing speech with speech-like stimuli (Scott et al., 2000; Vouloumanos et al., 2001). Scott et al. (2000) for instance, contrasted speech and vocoded speech (an equivalent to SEN) with rotated speech and rotated vocoded speech. In that study, intelligible stimuli were compared with stimuli that became non-intelligible by a physical manipulation (rotation). Therefore, the correlates of intelligibility could not be distinguished from those of a physical change, although most of the spectro-temporal structure of the sounds was preserved. Conversely, we compared sounds that differ considerably in their spectral structure but both contain sufficient phonetic cues to yield comprehension. The STS activations that were common to Scott et al. (2000) and both our contrasts ‘speech > BBSEN2’ and ‘BBSEN1 > NBSEN1’ may therefore correspond to the analysis of the spectral structure of the speech sounds or to the analysis of phonetic cues that can be extracted from the detailed temporal structure of sounds, but not to the semantic analysis of speech.

Similar considerations apply to the findings by Vouloumanos et al. (2001). They compared speech with complex non-speech sounds and found related effects in regions of the middle temporal region, but failed to identify more ventral regions that we observed here in relation with semantic processing of speech. These concerns underline that defining the correlates of intelligibility requires comparison of conditions with identical physical stimuli.

Response Properties in Wernicke‘s Area

Another region that responded to the temporal complexity of sounds was the posterior region of the planum temporale. This region corresponds to the more lateral and ventral part of Wernicke‘s area, the posterior temporal language area that is critical for speech perception and has recently been re-defined by Wise et al. (2001). Wernicke’s area responded primarily to the temporal structure of the acoustic stimuli. Over and above activity related to the simple extraction of temporal features, the stimulus-related activity time courses in Wernicke’s area (and only there) showed distinct effects of natural speech, successful speech comprehension and auditory search, although none of these effects achieved statistical significance. This suggests that Wernicke’s area is a non-specialized but highly integrative area that is sensitive to several different factors. We recently showed with PET (Giraud and Price, 2001) that Wernicke‘s area responds more to speech stimuli (words and syllables) than to other categories of complex sounds (environmental sounds, including animal cries). Moreover, this region retains a high degree of functional flexibility since in patients with cochlear-implants it also responds to environmental sounds but not to meaningless noises (Giraud et al., 2001). In conclusion, Wernicke‘s area appears to be potentially sensitive to all categories of meaningful complex sounds. This is compatible with the putative role of this region as an interface between perception and stored representations of familiar words (Wise et al., 2001). Our results also fit with the idea that this region could generate a transient representation of temporally ordered sound sequences, without distinguishing between words and other sounds, so as to allow for processing of novel words (Wise et al., 2001).

Speech Comprehension

The findings discussed so far illustrate the sensitivity of auditory regions to even subtle physical stimulus differences. This observation underlines the limits of a bottom-up approach that compares brain activity during speech perception with the response to the most speech-like stimulus that does not yield comprehension. Instead, we isolated neural correlates of speech comprehension while avoiding the fallacies related to differences in the acoustic properties of the stimuli. This was achieved by identifying regions that were significantly more active when BBSEN were understood compared to a condition where the same stimuli were not understood. With this approach it was particularly important to control for auditory attention effects. The temporal regions we identified here in association with intelligibility are consistent with the results from prior studies that targeted the neural correlates of language comprehension rather than of speech perception (Vandenberghe et al., 1996; Gorno-Tempini et al., 1998; Mummery et al., 1998; Giraud and Price, 2001). They are also congruent with the regions identified by Binder et al. (2000) when they compared words and pseudo-words, although the results in that latter study remained sub-threshold.

The stimulus-locked activity plots in areas related to speech comprehension (as in Fig. 7, plot 1) indicated greater activity during processing of natural speech. This suggests that activation in these regions is also modulated by the physical properties of the stimuli. It is conceivable that a larger effort to achieve speech comprehension during BBSEN2 compared to natural speech reduced the amount of subsequent semantic associations. The observation of a stimulus-dependent modulation of activity in semantic regions is compatible with the idea that successful comprehension is the end-point of bottom-up (stimulus-driven) processing. By assessing comprehension, ventral semantic regions would be strategically well qualified as a starting point for top-down effects and auditory search (Mesulam, 1998).

Auditory Search

To identify the brain regions involved in comprehension without confounding effects of physical stimulus properties we applied a design where identical stimuli are presented before and after a learning period. This design inevitably introduces a temporal order effect. It is well conceivable that once the subjects have learned that random noises might contain verbal information they will in Phase 2 — when the same stimulus material is repeated — specifically search for phonetic cues in the acoustic signals. Our behavioral data showed no significant differences in reaction time and accuracy of responses to the same stimuli in the phase before and that after learning. Hence, on purely behavioral grounds, there is no firm evidence for enhanced attentional mechanisms and search after training. Yet, we did note a trend towards an increased rate of false positive responses during the second presentation of NBSEN (from 0% in Phase 1 to 3.66% in Phase 2) that might be due to more search effort for phonetic cues in noises and also suggests that there was no automatic rejection of NBSEN as non-speech. We addressed this possible confound by dissociating the effects specific to the second phase of the experiment from the effect of comprehension.

We observed a significant temporal order effect in the dorsal part of Broca’s area. As several studies have already pointed to the role of the left inferior frontal cortex (Broca’s area) in monitoring auditory input and auditory attention (Zatorre et al., 1996; Jäncke et al., 2001; Wise et al., 2001), we assume this effect to reflect specific attentional mechanisms that became active once learning had occurred. This finding suggests that an additional brain mechanism is engaged when the processing of noisy acoustic stimuli (SEN correspond to speech temporal envelope filled with noise) is accompanied by tracking specific phonetic cues to detect a verbal content. This effect is equivalent to ‘visual search’ effects occurring after detection of objects hidden in a complex background, as the well-known dalmatian dog (Tallon-Baudry et al., 1997).

Recent computational theories have assumed that speech recognition does not require feedback loops (Norris et al., 2000). Different from this view, our findings are rather consistent with psychophysical observations that activation of abstract representations (words) affects the perception of phonetic units, e.g. phonemes, syllables (Samuel, 2001). The complex response pattern we observed in Broca’s region points to a role in auditory monitoring and search. A transient response to all acoustic stimuli suggests that Broca’s area interacted with auditory regions. This interaction could be relayed via the anatomical connections between posterior (superior temporal plane) and anterior language areas (inferior prefrontal) (Geschwind, 1965; Romanski et al., 1999a,b; Hackett et al., 1999). Response attenuation for natural speech and NBSEN (phase 1) could reflect that the demand for on-line search for phonetic cues was minimal when stimuli were either clearly intelligible or clearly meaningless. Conversely, sustained and strong responses to BBSEN2 would signal the continuous demand for auditory search to decode noisy stimuli. This functional behavior of Broca’s area suggests that it receives direct or indirect input from semantic regions. Along the lines of our interpretation, the sustained but weaker responses to BBSEN1 and NBSEN2 would reflect the complexity of the stimulus (BBSEN1) and the expectancy of potentially meaningful stimuli after the perceptual learning phase, respectively. In sum, this region of Broca’s area appeared to integrate input from memory, auditory and semantic regions to provide on-line feed-back onto auditory processing.

Interaction of Auditory Search and Successful Perception

The anterior cingulate, the right middle frontal cortex and both anterior insulae were the regions that showed a significant interaction of auditory search with successful speech perception. As these regions were also found when contrasting BBSEN with NBSEN, their activity is related to the acoustic complexity of the stimuli, but their response patterns revealed modulations over time that suggest an on-line interaction of comprehension and auditory search rather than mere acoustic analysis. These regions adapted to both task difficulty and stimulus acoustic complexity. Accordingly, the response level dropped once the stimuli were correctly perceived or clearly meaningless and it remained sustained if the stimuli required further auditory search. Interaction of auditory search and speech perception is compatible with the integrative functional role of the insula, which receives input from auditory cortices and projects to the frontal lobe (Augustine, 1996). It is also compatible with the role of the anterior cingulate and right middle frontal gyrus in adaptation of intentional behavior (Paus, 2001), which in our study consisted in enhancing the auditory search for sounds that were possibly meaningful.

Conclusion

We disentangled separate functional aspects that under natural conditions operate in synergy to ensure speech comprehension. Our experimental design permitted to distinguish regions that respond to the physical properties of the stimuli from regions that combine information from multiple sources and are subject to high-order modulation. Superior to middle temporal regions including superior temporal sulci responded to the acoustic complexity of auditory stimuli and are therefore concluded to essentially perform bottom-up processing (analysis of acoustic features). Wernicke’s area was sensitive to several different parameters and its functional role could therefore be to assess the potential comprehensiblity of acoustic stimuli. Ventral temporal regions did not primarily respond to the acoustic properties of the stimuli, but their response to meaningful stimuli was modulated by acoustic features. These regions possibly form a specific substrate of semantic processing of speech. The functional behavior in a set of frontal regions reflects integration of information from auditory and semantic regions, while the response pattern in Broca’s area suggests that it additionally integrates input from regions implicated in perceptual learning and could play a role in the modulation of auditory input.

The authors thank Christian Lorenzi for help in generating the acoustic stimuli. This study was supported by the Alexander von Humboldt Foundation (A.L.G.) and the Volkswagen Foundation (P.S., A.K.).

Figure 1. Procedure for generation of speech envelope noises.

Figure 1. Procedure for generation of speech envelope noises.

Figure 2. Waveform (left) and spectrum (right) of the stimuli employed. Top panels: narrow-band speech envelope noise (NBSEN) corresponding to white noise convolved with the temporal envelope of the original sentence up to 2 Hz. Middle panels: broad band speech envelope noise (BBSEN), corresponding to white noise convolved with the temporal envelope of the above sentence up to 256 Hz. Lower panels: natural digitized sentence. Vertical lines indicate the presence of formants. In the first phase of the experiment, subjects could not understand NBSEN and BBSEN. After a perceptual learning phase, they were able to understand BBSEN but not NBSEN.

Figure 2. Waveform (left) and spectrum (right) of the stimuli employed. Top panels: narrow-band speech envelope noise (NBSEN) corresponding to white noise convolved with the temporal envelope of the original sentence up to 2 Hz. Middle panels: broad band speech envelope noise (BBSEN), corresponding to white noise convolved with the temporal envelope of the above sentence up to 256 Hz. Lower panels: natural digitized sentence. Vertical lines indicate the presence of formants. In the first phase of the experiment, subjects could not understand NBSEN and BBSEN. After a perceptual learning phase, they were able to understand BBSEN but not NBSEN.

Figure 3. Experimental protocol. Each functional imaging session covered three distinct periods. During Phase 1 blocks of speech envelope noises with broad band temporal envelope (<256 Hz, BBSEN) were presented in alternation with blocks of speech envelope noises with narrow-band temporal envelope (<2 Hz, NBSEN), followed by alternation of blocks with natural speech and blocks of NBSEN. Each block comprised nine items of a given condition. The training period was a period of perceptual learning and consisted of alternating presentation of natural speech and corresponding BBSEN and NBSEN items. In Phase 2, the same BBSEN and NBSEN as in Phase 1 were presented again, but this time BBSEN could be clearly understood whereas NBSEN could not.

Figure 3. Experimental protocol. Each functional imaging session covered three distinct periods. During Phase 1 blocks of speech envelope noises with broad band temporal envelope (<256 Hz, BBSEN) were presented in alternation with blocks of speech envelope noises with narrow-band temporal envelope (<2 Hz, NBSEN), followed by alternation of blocks with natural speech and blocks of NBSEN. Each block comprised nine items of a given condition. The training period was a period of perceptual learning and consisted of alternating presentation of natural speech and corresponding BBSEN and NBSEN items. In Phase 2, the same BBSEN and NBSEN as in Phase 1 were presented again, but this time BBSEN could be clearly understood whereas NBSEN could not.

Figure 4. Behavioral data with accuracy and reaction time of the subjects for the different conditions tested.

Figure 4. Behavioral data with accuracy and reaction time of the subjects for the different conditions tested.

Figure 5. Surface rendering and slice display (averaged T1 image) of the brain regions that are significantly more active (P < 0.001, uncorrected) when processing natural speech than broad band speech envelope noises (BBSEN) even when the latter are understood (Phase 2). These findings overlap with the voice sensitive regions previously described by Belin et al. (2000). Stimulus-locked MRI signal time courses averaged over all subjects are shown for the conditions indicated. The black bar indicates the stimulus duration (block of nine sentences).

Figure 5. Surface rendering and slice display (averaged T1 image) of the brain regions that are significantly more active (P < 0.001, uncorrected) when processing natural speech than broad band speech envelope noises (BBSEN) even when the latter are understood (Phase 2). These findings overlap with the voice sensitive regions previously described by Belin et al. (2000). Stimulus-locked MRI signal time courses averaged over all subjects are shown for the conditions indicated. The black bar indicates the stimulus duration (block of nine sentences).

Figure 6. Brain regions activated significantly more (P < 0.001, uncorrected) by stimuli with complex temporal structure (BBSEN1) than by stimuli with simpler structure (NBSEN1). The regions other than Wernicke’s area are not sampled here as they overlap with regions probed by more specific contrasts (see Table 1, Figs 5 and 7). Stimulus-locked MRI signal time courses averaged over all subjects are shown for the different conditions. The black bar indicates the stimulus duration (block of nine sentences).

Figure 6. Brain regions activated significantly more (P < 0.001, uncorrected) by stimuli with complex temporal structure (BBSEN1) than by stimuli with simpler structure (NBSEN1). The regions other than Wernicke’s area are not sampled here as they overlap with regions probed by more specific contrasts (see Table 1, Figs 5 and 7). Stimulus-locked MRI signal time courses averaged over all subjects are shown for the different conditions. The black bar indicates the stimulus duration (block of nine sentences).

Figure 7. Activation in response to speech stimuli (P < 0.001, uncorrected) that is independent from stimulus physical properties. Using a 2 × 2 factorial design with auditory attention and comprehension as factors, regions responding to auditory attention (green) and comprehension (blue) and their interaction (red) were dissociated. In the plot labelled 2 (auditory attention), both baselines (NBSEN) are represented although only one contributed to the factorial design (baseline in Phase 2) to allow comparison of effortful and non-effortful tasks with identical acoustic stimuli. The arrows indicate the direction of the change when the processing of the same stimuli is accompanied with auditory attention (search for phonetic cues in noises). Stimulus-locked responses are averaged over all subjects. The black bar indicates the stimulus duration (block of nine sentences).

Figure 7. Activation in response to speech stimuli (P < 0.001, uncorrected) that is independent from stimulus physical properties. Using a 2 × 2 factorial design with auditory attention and comprehension as factors, regions responding to auditory attention (green) and comprehension (blue) and their interaction (red) were dissociated. In the plot labelled 2 (auditory attention), both baselines (NBSEN) are represented although only one contributed to the factorial design (baseline in Phase 2) to allow comparison of effortful and non-effortful tasks with identical acoustic stimuli. The arrows indicate the direction of the change when the processing of the same stimuli is accompanied with auditory attention (search for phonetic cues in noises). Stimulus-locked responses are averaged over all subjects. The black bar indicates the stimulus duration (block of nine sentences).

Table 1


 Cerebral activations

Anat. Region BA Nat. speech > BBSEN2 coordinates (Z, corr. PBBSEN1 > NBSEN1 coordinates (Z, corr. PAuditory searcha coordinates (Z, corr. PSpeech comprehensionb coordinates (Z, corr. PInt. of search and comprehension coordinates (Z, corr. P
Frontal       
Broca’s area 44   –56 10 30 (3.51, n.s.)   
R. Mid.  46 28 30 (5.08, <0.001)   34 30 34 (4.15, n.s.) 
L. Mid.  –46 28 38 (4.99, <0.05)    
Ant. Cing. 32  6 24 40 (4.90, <0.05)   6 24 36 (4.74, <0.05) 
R. Ant. Ins.   40 26 –16 (7.13, <0.000)   41 24 –16 (5.54, <0.001) 
L. Ant. Ins.   –40 20 –10 (6.23, <0.000)   –40 20 –10 (5.00, <0.05) 
Temporal       
Wernicke’s area 22  –60 –42 2 (4.85, <0.01)    
R. STS (ant.) 21 60 2 –18 (4.77, <0.05) 58 4 –14 (6.16, <0.000)  56 4 –16 (6.59, <0.000)  
L. STS (middle) 21 –62 –10 –8 (3.59, n.s.) –62 –12 –14 (6.64, <0.000)    
L. Mid. (ant.) 21    –58 –6 –14 (4.59, <0.05)  
R. Mid. (mid.) 20    48 –28 –14 (3.70, n.s.)  
R. Mid. (post.) 20/21/37 62 –50 –6 (3.64, n.s.)   56 –42 –14 (3.66, n.s.)  
L. Inf. (mid.) 21    –56 4 –26 (3.5, n.s.)  
L. Inf. 38/20    –34 10 –40 (3.51, n.s.)  
R Inf. 38/20 52 0 –36 (3.61, n.s.)     
  40 14 –40 (3.93, n.s.)     
Anat. Region BA Nat. speech > BBSEN2 coordinates (Z, corr. PBBSEN1 > NBSEN1 coordinates (Z, corr. PAuditory searcha coordinates (Z, corr. PSpeech comprehensionb coordinates (Z, corr. PInt. of search and comprehension coordinates (Z, corr. P
Frontal       
Broca’s area 44   –56 10 30 (3.51, n.s.)   
R. Mid.  46 28 30 (5.08, <0.001)   34 30 34 (4.15, n.s.) 
L. Mid.  –46 28 38 (4.99, <0.05)    
Ant. Cing. 32  6 24 40 (4.90, <0.05)   6 24 36 (4.74, <0.05) 
R. Ant. Ins.   40 26 –16 (7.13, <0.000)   41 24 –16 (5.54, <0.001) 
L. Ant. Ins.   –40 20 –10 (6.23, <0.000)   –40 20 –10 (5.00, <0.05) 
Temporal       
Wernicke’s area 22  –60 –42 2 (4.85, <0.01)    
R. STS (ant.) 21 60 2 –18 (4.77, <0.05) 58 4 –14 (6.16, <0.000)  56 4 –16 (6.59, <0.000)  
L. STS (middle) 21 –62 –10 –8 (3.59, n.s.) –62 –12 –14 (6.64, <0.000)    
L. Mid. (ant.) 21    –58 –6 –14 (4.59, <0.05)  
R. Mid. (mid.) 20    48 –28 –14 (3.70, n.s.)  
R. Mid. (post.) 20/21/37 62 –50 –6 (3.64, n.s.)   56 –42 –14 (3.66, n.s.)  
L. Inf. (mid.) 21    –56 4 –26 (3.5, n.s.)  
L. Inf. 38/20    –34 10 –40 (3.51, n.s.)  
R Inf. 38/20 52 0 –36 (3.61, n.s.)     
  40 14 –40 (3.93, n.s.)     

aMasked with (NBSEN2-NBSEN1) at P = 0.001, uncorrected.

bMasked inclusive with (BBSEN2-BBSEN1) and exclusive with (BBSEN1-NBSEN1) at P = 0.001, uncorrected.

References

Ahissar E, Nagarajan S, Ahissar M, Protopapas A, Mahncke H, Merzenich MM (
2001
) Speech comprehension is correlated with temporal response patterns recorded from auditory cortex.
Proc Natl Acad Sci USA
 
98
:
13367
–13372.
Alain C, Arnott SR, Picton TW (
2001
) Bottom-up and top-down influences on auditory scene analysis: evidence from event-related brain potentials.
J Exp Psychol Hum Percept Perform
 
27
:
1072
–1089.
Apoux F, Crouzet O, Lorenzi C (
2001
) Temporal envelope expansion of speech in noise for normal-hearing and hearing-impaired listeners: effects on identification performance and response times.
Hear Res
 
153
:
123
–131.
Augustine J (
1996
) Circuitry and functional aspects of the insular lobe in primates including humans.
Brain Res Rev
 
22
:
229
–244.
Belin P, Zatorre RJ, Lafaille P, Ahad P, Pike B (
2000
) Voice-selective areas in human auditory cortex.
Nature
 
403
:
309
–312.
Binder JR, Frost JA, Hammeke TA, Bellgowan PS, Springer JA, Kaufman JN, Possing ET (
2000
) Human temporal lobe activation by speech and nonspeech sounds.
Cereb Cortex
 
10
:
512
–528.
Dolan RJ, Fink GR, Rolls E, Booth M, Holmes A, Frackowiak RS, Friston KJ (
1997
) How the brain learns to see objects and faces in an impoverished context.
Nature
 
389
:
596
–599.
Geschwind N (
1965
) Disconnexion syndromes in animals and man. Part II.
Brain
 
88
:
585
–644.
Giraud AL, Price CJ (
2001
) The constraints functional neuroimaging places on classical models of auditory word processing.
J Cogn Neurosci
 
13
:
754
–765.
Giraud AL, Lorenzi C, Wable J, Kleinschmidt A, Frackowiak RSJ (
2000
) Representation of temporal envelope in the human auditory cortex.
J Neurophysiol
 
84
:
1588
–1597.
Giraud AL, Price CJ, Graham JM, Frackowiak RSJ (
2001
) Functional plasticity in language-related brain areas.
Brain
 
124
:
1304
–1316.
Gorno-Tempini ML, Price CJ, Josephs O, Vandenberghe R, Cappa SF, Kapur N, Frackowiak RSJ (
1998
) The neural systems sustaining face and proper-name processing.
Brain
 
121
:
2103
–2118.
Grossberg S (
1999
) The link between brain learning, attention, and consciousness.
Conscious Cogn
 
8
:
1
–44.
Hackett TA, Stepniewska I, Kaas JH (
1999
) Prefrontal connections of the parabelt auditory cortex in macaque monkeys.
Brain Res
 
817
:
45
–58.
Hines T (
1999
) A demonstration of auditory top-down processing.
Behav Res Meth Instrum Comput
 
31
:
55
–56
Jäncke L, Buchanan TW, Lutz K, Shah NJ (
2001
)
Focus
 ed and nonfocused attention in verbal and emotional dichotic listening: an fMRI study.
Brain Lang
 
78
:
349
–363.
Liebenthal E, Binder JR, Piorkowski RL, Remez RE (
2003
) Short-term reorganization of auditory analysis induced by phonetic experience.
J Cogn Neurosci
 
15
:
549
–558.
Lorenzi C, Berthommier F, Apoux F, Bacri N (
1999
) Effects of envelope expansion on speech recognition.
Hear Res
 
136
:
131
–138.
Lorenzi C, Dumont A, Fullgrabe C (
2000
) Use of temporal envelope cues by children with developmental dyslexia.
J Speech Lang Hear Res
 
43
:
1367
–1379.
Mesulam MM (
1998
) From sensation to cognition.
Brain
 
121
:
1013
–1052.
Mummery CJ, Patterson K, Hodges JR, Price CJ (
1998
) Functional neuroanatomy of the semantic system: divisible by what?
J Cogn Neurosci
 
10
:
766
–777.
Norris D, McQueen JM, Cutler A (
2000
) Merging information in speech recognition: feedback is never necessary.
Behav Brain Sci
 
23
:
299
–325; discussion
325
–370.
Paus T (
2001
) Primate anterior cingulate cortex: where motor control, drive and cognition interface.
Nat Rev Neurosci
 
2
:
417
–424.
Romanski LM, Tian B, Fritz J, Mishkin M, Goldman-Rakic PS, Rauschecker JP (
1999
) Dual streams of auditory afferents target multiple domains in the primate prefrontal cortex.
Nat Neurosci
 
2
:
1131
–1136.
Romanski LM, Bates JF, Goldman-Rakic PS (
1999
) Auditory belt and parabelt projections to the prefrontal cortex in the rhesus monkey.
J Comp Neurol
 
403
:
141
–157.
Samuel AG (
2001
) Knowing a word affects the fundamental perception of the sounds within it.
Psychol Sci
 
12
:
348
–351.
Scott SK, Blank CC, Rosen S, Wise RJ (
2000
) Identification of a pathway for intelligible speech in the left temporal lobe.
Brain
 
12
:
2400
–2406.
Shannon RV, Zeng FG, Kamath V, Wygonski J, Ekelid M (
1995
) Speech recognition with primarily temporal cues.
Science
 
270
:
303
–304.
Tallon-Baudry C, Bertrand O, Delpuech C, Pernier J (
1997
) Oscillatory gamma-band (
30
–70 Hz) activity induced by a visual search task in humans.
J Neurosci
 
17
:
722
–734
Vandenberghe R, Price C, Wise R, Josephs O, Frackowiak RS (
1996
) Functional anatomy of a common semantic system for words and pictures.
Nature
 
383
:
254
–256.
Vouloumanos A, Kiehl KA, Werker JF, Liddle P (
2001
) Detection of sounds in the auditory stream: event-related fMRI evidence for differential activation to speech and nonspeech.
J Cogn Neurosci
 
13
:
994
–1005.
Wise RJ, Scott SK, Blank SC, Mummery CJ, Murphy K, Warburton EA (
2001
) Separate neural subsystems within ‘Wernicke’s area’.
Brain
 
124
:
83
–95.
Zatorre RJ, Belin P (
2001
) Spectral and temporal processing in human auditory cortex.
Cereb Cortex
 
11
:
946
–953.
Zatorre RJ, Meyer E, Gjedde A, Evans AC (
1996
) PET studies of phonetic processing of speech: review, replication, and reanalysis.
Cereb Cortex
 
6
:
21
–30.