Abstract

Speech processing in auditory cortex and beyond is a remarkable yet poorly understood faculty of the listening brain. Here we show that stop consonants, as the most transient constituents of speech, are sufficient to involve speech perception circuits in the human superior temporal cortex. Left anterolateral superior temporal cortex showed a stronger response in blood oxygenation level–dependent functional magnetic resonance imaging (fMRI) to intelligible consonantal bursts compared with incomprehensible control sounds matched for spectrotemporal complexity. Simultaneously, the left posterior superior temporal plane (including planum temporale [PT]) exhibited a noncategorical responsivity to complex stimulus acoustics across all trials, showing no preference for intelligible speech sounds. Multistage hierarchical processing of speech sounds is thus revealed with fMRI, providing evidence for a role of the PT in the fundamental stages of the acoustic analysis of complex sounds, including speech.

Introduction

Spoken stop consonants such as /d/ or /k/ are fleeting auditory percepts, with the actual onset (consonantal burst) lasting less than 40 ms. They are temporally much more transient and acoustically more variable than vowels. Nevertheless, consonant identification works surprisingly well in running speech. In fact, their robust decoding in everyday speech is one of the most admirable accomplishments of the human auditory system, relying on the system's fine spectrotemporal resolution. Consonantal bursts thus lie along a virtual demarcation line between nonspeech acoustic perception and language-related speech perception (Dehaene-Lambertz et al. 2005; Obleser, Scott, et al. 2006). Previous neuroimaging work has shown auditory brain areas in anterolateral superior temporal gyrus (STG) and superior temporal sulcus (STS) to be activated vigorously by listening not only to whole utterances, such as sentences (Scott et al. 2000; Narain et al. 2003) or words (Binder et al. 2000), but also to syllables (Liebenthal et al. 2005) or vowels (Obleser, Boecker, et al. 2006). Here, we asked whether even stop consonants in isolation would tap these speech-processing areas.

Using event-related functional magnetic resonance imaging (fMRI), we compared a set of natural, acoustically diverse consonant stimuli (speech) with acoustically well-controlled manipulated analogues thereof (nonspeech) and studied the corresponding brain activations in a temporally sparse-sampling paradigm. Two questions were of main interest: First, would the distinction between native stop consonants and their unintelligible analogues also be most prominent in the anterolateral STG/STS and would this region be activated by any categorical distinction of native consonants, such as voiced versus voiceless? Second, if multiple representations of a single stop consonant category would be evident along the central auditory processing hierarchy, can we track them and associate them with different structures of the auditory cortex? More specifically, is there a role for the posterior STG in the processing of such speech sounds?

Materials and Methods

Subjects

Thirteen monolingual speakers of American English (6 males; mean age 23.5 ± 7 years, mean ± standard deviation) were recruited for this study. The total duration of the experiment was 55 min, and subjects were reimbursed with 20 USD.

Stimulus Material

Stimuli were edited recordings of the following 4 naturally uttered stop consonants: [d] (front voiced), [t] (front voiceless), [g] (back voiced), [k] (back voiceless). We used 16 acoustically diverse exemplars for each category, namely, recordings of 4 different speakers of American English (2 males) in 4 different vowel contexts (i.e., consonants were digitally extracted from consonant–vowel combinations with ensuing vowels [i], [e], [u], or [o]). All edited audio files were sampled with 20 kHz and were 100 ms in duration (including voice onset time and prevoicing common in American English). A linear fade-out of 10 ms and file edits at zero crossing were applied; all audio files were equalized in terms of root-mean–squared energy). An additional set of spectrally rotated analogues (Blesser 1972) was created from the consonant stimuli. This method (which has been used previously as a control condition in imaging studies: Scott et al. 2000; Obleser, Scott, et al. 2006) leaves the temporal envelope unaltered and preserves the spectrotemporal complexity (Fig. 1). However, the signal is rendered unintelligible by delivering the spectral information in a rotated/inverted format. For the control condition trials (a sixth of the total number of trials), a rotated analogue was randomly drawn from the full set of 64 exemplars.

Figure 1.

Speech sound stimuli and effects of spectral rotation. Spectral rotation renders a stop consonant [d] (upper panels) into an unintelligible version [d]' (lower panels), while preserving the spectrotemporal complexity (cf., spectrograms, left) and the temporal envelope structure (cf., magnitude of the signal, middle panels). Resulting differences in power spectra (rightmost panel) are alleviated by applying a preemphasis filter prior to spectral inversion.

Figure 1.

Speech sound stimuli and effects of spectral rotation. Spectral rotation renders a stop consonant [d] (upper panels) into an unintelligible version [d]' (lower panels), while preserving the spectrotemporal complexity (cf., spectrograms, left) and the temporal envelope structure (cf., magnitude of the signal, middle panels). Resulting differences in power spectra (rightmost panel) are alleviated by applying a preemphasis filter prior to spectral inversion.

Unlike vowels, which are characterized by an array of multiple formant frequencies, acoustics of a consonantal burst can well be reduced to a single value, the spectral peak (Stevens and Blumstein 1978; Lahiri et al. 1984). The spectral peak reflects the frequency of the center of gravity occurring at the very onset of the syllable. In the present study, the spectral peak for each stimulus (consonants as well as controls) was calculated as follows: A linear predictive coding analysis was performed, whereby a 160-point hamming window was shifted over the first 30 ms of the stimulus to measure the spectral content of the initial burst. It is important to note that the focus of the current study was on the features of the actual stop consonant (burst and release); therefore, the acoustic analysis focused on the first 30 ms of each stimulus and does not reflect the formant frequency pattern influence of the ensuing steady-state vowel (the periodicity of which becomes evident after 30 ms or so and which is seen in the rightmost parts of the spectrograms shown in Fig. 1). The outcome of this acoustic analysis is one single value for each individual stimulus we used and it reflects the spectral center of mass. Notably, this spectral peak of the consonantal burst was independent of stimulus category as we used a wide variety of acoustically different consonant recordings (i.e., spectral peak was not a function of consonant category, P > 0.50). Highly relevant to the actual results, the spectral peak was also not a function of the stimulus actually being a speech or nonspeech exemplar (P > 0.20). It therefore serves as an acoustic measure for each stimulus exemplar that does not convey the actual stimulus category.

Experimental Procedure

Subjects were positioned in a 3-T Siemens Trio scanner equipped with a standard volume head coil for radio frequency pulse transmission, fitted with a customized auditory stimulus delivery system (air conduction; sound pressure level at the subjects' ear approximately +65 dB). After a short familiarization period, all subjects were able to tell human speech (i.e., original stop consonants) from nonspeech (i.e., spectrally rotated consonants) instantaneously (for behavioral testing results on this ability, see Obleser, Scott, et al. 2006).

A series of 252 magnetic resonance (MR) volume scans (echo-planar imaging) was performed, whereby trials of all 6 conditions were presented (4 consonant categories plus spectrally rotated analogues and off-trials) in a pseudorandomized and interleaved fashion, with 42 volumes per condition. A volume consisted of 25 axial slices, obliquely rotated to cover all of temporal cortex and inferior frontal cortex (128 mm2 matrix, voxel size 1.5 × 1.5 × 1.9 mm3, no gap). Volume scans were acquired using temporal sparse sampling (Hall et al. 1999) with a repetition time of 10 s, an acquisition time of 2.48 s, and a single stimulus being presented 5 s prior to next volume scan in silence. The exact time of stimulus presentation was jittered randomly ±500 ms in order to sample the blood oxygenation level–dependent (BOLD) response more robustly (Fig. 2).

Figure 2.

Schematic display of event-related sparse-sampling fMRI paradigm. Timing of auditory stimuli, scanner pulses, and expected BOLD responses in auditory brain regions are shown. Trials were pseudorandomized to allow for an intermixed presentation of all experimental conditions (4 stop consonant categories, spectrally rotated analogues, silence) and all acoustically variant exemplars thereof.

Figure 2.

Schematic display of event-related sparse-sampling fMRI paradigm. Timing of auditory stimuli, scanner pulses, and expected BOLD responses in auditory brain regions are shown. Trials were pseudorandomized to allow for an intermixed presentation of all experimental conditions (4 stop consonant categories, spectrally rotated analogues, silence) and all acoustically variant exemplars thereof.

Data Analysis

Images were realigned, coregistered, normalized, and smoothed (4 × 4 × 6 mm3 kernel) off-line using SPM2 (Wellcome Department of Imaging, London, UK). Two SPM models were set up for each subject. First, a standard SPM model with 4 regressors for each of the consonant categories, a fifth for the spectrally rotated condition, and a sixth for baseline (scanner noise only) trials was estimated and then used to assess categorical contrasts such as “speech greater nonspeech” or “voiced greater voiceless.” Additionally, in order to look for brain areas sensitive to the acoustic parameter of spectral peak in our stimuli irrespective of stimulus category, a second SPM model with each stimulus exemplar's spectral peak as single regressor was performed in every subject individually. The randomized sequence of stimulus exemplars used in each subject during the 252 MR volumes allowed us constructing a regressor from the corresponding spectral peak values of each trials' stimulus exemplar (silent trials in the acoustic regressor time series were replaced with the mean of all nonsilent trials, thereby avoiding the considerable leverage of off-trials onto the regression slope). For matters of illustration (Fig. 4A), we additionally extracted fMRI time courses from some subjects' individual peak clusters of the both categorical and correlation analysis using the Marsbar for SPM toolbox and correlated those time series with the acoustic regressor using Matlab.

For both the categorical and the regression analysis, resulting contrast images from each subject were submitted to second-level (random-effects) one-tailed t-tests to assess group statistics. All statistical inference reported is based on random-effects analyses thresholded at P < 0.005 (uncorrected) and a cluster extent of k >25.

Results

Passive listening to single isolated consonantal bursts evoked bilateral activation of superior temporal cortical structures (STG) extending anteriorly and laterally into the STS, as well as posterolaterally into the middle temporal gyrus (Fig. 3). Interestingly, the direct comparison of activations yielded a region on the verge of left lateral STG folding into anterolateral STS (Brodmann area 21) that was significantly more activated by intelligible and natural speech sounds as short and transient in nature as these stop consonantal bursts (Table 1). As shown in Figure 3, a slight left preponderance was seen.

Figure 3.

fMRI responses in superior temporal cortex to stop consonants and their unintelligible analogues. Comparisons of natural speech (vs. scanner noise only; intelligible, shown in white) and nonspeech sounds (vs. scanner noise only; unintelligible, shown in black) are shown. fMRI responses are overlaid onto mean structural brain of all 13 subjects.

Figure 3.

fMRI responses in superior temporal cortex to stop consonants and their unintelligible analogues. Comparisons of natural speech (vs. scanner noise only; intelligible, shown in white) and nonspeech sounds (vs. scanner noise only; unintelligible, shown in black) are shown. fMRI responses are overlaid onto mean structural brain of all 13 subjects.

Table 1

Overview over significant clusters in random-effects analysis (P < 0.005 corresponding to Z > 3.09; cluster extent >25 voxels)

Site Montreal Neurological Institute coordinates Z Extent (mm3
Intelligible consonants > spectrally rotated analogues (N = 13) 
    L anterolateral STS −78, −18, 0 3.15 312 
    R putamen 26, 2, 2 3.51 280 
    Anterior cingulate 0, 50, −2 3.10 2624 
Voiced consonants [d,g] > voiceless consonants [t,k] (N = 13) 
    L lateral STG −68, −16, 4 4.05 3584 
    R anterior STS 78, −4, −4 3.12 272 
Spectrally rotated analogues > intelligible consonants (N = 13) 
    R STG (BA 42) 64, −26, 8 3.36 656 
Correlation with spectral peak of consonantal burst (exemplar silent trials, N = 12) 
    L inferior supramarginal gyrus (BA 40) −66, −34, 22 4.19 272 
    L STG (BA 22) −66, −50, 16 3.51 440 
    R STG (BA 22) 60, 14, −4 3.39 304 
Site Montreal Neurological Institute coordinates Z Extent (mm3
Intelligible consonants > spectrally rotated analogues (N = 13) 
    L anterolateral STS −78, −18, 0 3.15 312 
    R putamen 26, 2, 2 3.51 280 
    Anterior cingulate 0, 50, −2 3.10 2624 
Voiced consonants [d,g] > voiceless consonants [t,k] (N = 13) 
    L lateral STG −68, −16, 4 4.05 3584 
    R anterior STS 78, −4, −4 3.12 272 
Spectrally rotated analogues > intelligible consonants (N = 13) 
    R STG (BA 42) 64, −26, 8 3.36 656 
Correlation with spectral peak of consonantal burst (exemplar silent trials, N = 12) 
    L inferior supramarginal gyrus (BA 40) −66, −34, 22 4.19 272 
    L STG (BA 22) −66, −50, 16 3.51 440 
    R STG (BA 22) 60, 14, −4 3.39 304 

Note: Specifications refer to peak voxels. L = left, R = right, BA = Brodmann area.

Further analysis showed that trials with voiced sounds ([d] or [g]) accounted mostly for the left anterolateral STS activation compared with unvoiced sounds ([t] or [k]; Table 1). Direct comparisons among single consonant categories did not yield any significant activations.

An important additional value of these event-related data arises from additional acoustic parameters tested as explanatory variables. Because we employed a vast variety of acoustically different exemplars (edited from recordings of different speakers and different vowel contexts), it is possible to test for stimulus properties influencing the brain activation other than a posteriori factorial comparisons, such as the speech versus nonspeech or voiced versus voiceless categories reported above. Although these categorizations are truly related to the percepts of these sounds, they do not exclude the possibility that a more unspecific, noncategorical processing of the consonant acoustics occurs at some stage along the auditory cortical processing hierarchy. A variety of recent models of auditory processing suggest that the posterior superior temporal plane (including planum temporale [PT]) might be a candidate structure to reflect such more basic (or upstream) processing of spectrotemporally complex sounds (Rauschecker and Tian 2000; Zatorre et al. 2002; Poeppel and Hickok 2004; Warren, Wise, et al. 2005). We therefore performed an additional regression analysis in all subjects, in which a purely acoustic parameter (i.e., the spectral peak of the consonantal burst) for each trial was used as a predictor variable (see Materials and Methods). Interestingly, in 8 of 12 subjects this analysis yielded strongest activation in posterior STG (left N = 5, right N = 3, respectively), with a mean correlation coefficient of r = 0.34. Figure 4 shows exemplary data from a single subject to illustrate the differential responses seen in left posterior STG (exhibiting a significant correlation with the spectral peak of the stimulus exemplars) and left anterolateral STS (showing no such correlation).

Figure 4.

Differential activation in anterior and posterior superior temporal cortex. (A) Data from a single subject illustrating the regression analysis using each single trial's spectral peak as a regressor. Correlation of spectral peak of the consonantal burst (for off-trials this value was replaced by the mean; normalized) with the fMRI signal (first eigenvariate extracted from significant cluster in the indicated region; normalized) is highly significant in bilateral posterior STG, whereas no relationship with this basic acoustic parameter is found in bilateral anterolateral STS. (B) Random-effects group data showing the differential responsiveness to stimulus categories in posterior (blue) and anterior (red) superior temporal cortex. Clusters shown are the peak clusters for the random-effects estimates of speech versus nonspeech (red) and the correlation with the spectral peak (blue). Note the differentiations of speech versus nonspeech as well as voiced versus voiceless in the anterior cluster (red bars), which are absent in the posterior cluster (blue bars; confidence limits include zero, thus, no clear-cut category differences are evident).

Figure 4.

Differential activation in anterior and posterior superior temporal cortex. (A) Data from a single subject illustrating the regression analysis using each single trial's spectral peak as a regressor. Correlation of spectral peak of the consonantal burst (for off-trials this value was replaced by the mean; normalized) with the fMRI signal (first eigenvariate extracted from significant cluster in the indicated region; normalized) is highly significant in bilateral posterior STG, whereas no relationship with this basic acoustic parameter is found in bilateral anterolateral STS. (B) Random-effects group data showing the differential responsiveness to stimulus categories in posterior (blue) and anterior (red) superior temporal cortex. Clusters shown are the peak clusters for the random-effects estimates of speech versus nonspeech (red) and the correlation with the spectral peak (blue). Note the differentiations of speech versus nonspeech as well as voiced versus voiceless in the anterior cluster (red bars), which are absent in the posterior cluster (blue bars; confidence limits include zero, thus, no clear-cut category differences are evident).

We therefore performed a random-effects group analysis using every subject's regression contrast file as input (one subject was omitted as no suprathreshold activation in the regression analysis was seen). This confirmed that overall the strongest responsivity to the acoustic regressor was evident in the left posterior STG, extending into the supramarginal gyrus (Z = 4.19, N = 12; Montreal Neurological Institute coordinates −66, −34, 22). A somewhat weaker corresponding peak in right posterior STG was also found (Z = 3.52; Table 1).

Discussion

At least 2 tentative conclusions can be drawn from studying the cortical responses to a variety of stop consonants and their comparison with unintelligible analogues in our event-related fMRI design. First, isolated consonants that are recognizable as speech activate brain structures of the anterolateral processing stream in STG and STS, as shown previously for larger chunks of speech (Binder et al. 2000; Scott et al. 2000; Narain et al. 2003; Liebenthal et al. 2005). The fact that the activation is less widespread may be due to the usage of a fully event-related design. This, in turn, enabled the revealing parametric regression analysis of the present data. Second and most intriguingly, categorical and parametric analyses taken together provide insight into the multistaged processing of speech sounds in the human brain: 1) spectral peaks of stimulus exemplars appear to be tracked in posterior STG irrespective of intelligibility; 2) more speech-specific mapping takes place in anterolateral STG/STS.

It was also this latter region that showed enhanced BOLD contrast changes when directly comparing different classes of consonantal bursts. This differential response to consonant categories differentiated by the voicing feature is intriguing. Whereas voiced consonants (especially if edited to a fixed length as in the current study) [d] and [g] with a short voice onset time are characterized by a smooth onset of vocal fold vibration and therefore by spectral envelope/formant-like spectral patterns being readily available to the auditory system, voiceless consonants [t] and [k] exhibit an abrupt and later onset of vocal fold vibration, and accordingly information about the spectral envelope and a formant-like spectral pattern are less readily available. Therefore, our results are in line with previous research on voiced consonant–vowel syllables (Liebenthal et al. 2005; where voiceless consonants were not studied), and it is also concordant with previous findings of the anterolateral STS decoding highly familiar complex spectrotemporal patterns, such as in human speech (Warren, Jennings, et al. 2005).

With respect to the speech/nonspeech comparison, the middle and anterior STS, especially the left homologue, has been known to be activated more vigorously by speech signals compared with close acoustic control stimuli. However, this has mainly been shown using larger chunks of speech also conveying semantic and/or syntactic information (e.g., Binder et al. 2000; Scott et al. 2000; Davis and Johnsrude 2003). It is intriguing to see that its activation seems to depend merely on the recognition as a familiar (native) speech sound category: No involvement of semantic content whatsoever is necessary; therefore, this region of lateral STG/STS should not be tied too closely to higher order language-related processes. It should also be acknowledged that no strict line between speech versus nonspeech or voiced versus voiceless distinctions in STS and distinctions based on the presence or absence of voice perception (Belin et al. 2000; or spectral envelope perception, Warren, Jennings, et al. 2005) can be drawn based on this experiment.

Remarkably, the speech versus nonspeech also activated 2 extratemporal clusters in the basal ganglia and in the anterior cingulate cortex. Although we can only speculate upon the specific contributions of these regions, both are known players in active language processing and their involvement might reflect in part the cascade of processing steps inevitably taking place once speech is recognized as such (again, in the present study on the mere basis of short consonantal burst) and which may be rather independent of an explicit task imposed on the listener; for example, Liebenthal et al. (2005), although using syllables and an active discrimination task, also observed substantial activation of the anterior cingulate cortex in a highly comparable contrast.

The interesting picture resulting from the categorical analyses on the one hand and the regression analyses on the other hand is that distinct anatomical structures are revealed, which are involved at different levels of speech sound analysis. A tracking of basic acoustic characteristics is found in the left posterior STG. By contrast, anterolateral regions in STG/STS prove more responsive to natural speech sounds than to well-controlled nonspeech analogues and prefer one class of stop consonants (voiced [d,g]) over another (voiceless [t,k]), most likely for reasons discussed above. This defines posterior STG as “earlier” (more upstream) and anterolateral STG/STS as “later” (more down stream) structures. Strictly speaking, fMRI prevents any conclusion on temporal order of processing stages, of course, and the terms earlier/later here are used in a hierarchical processing sense.

This evidence for multiple stages of acoustical or perceptual analysis along the central auditory pathway also concisely shows that speech sounds are not treated entirely differently from nonspeech sounds from the outset of processing (Price et al. 2005). Figure 4A illustrates this in single-subject data, where correlations with the acoustic information (normalized) are plotted for both sites in PT, most responsive to acoustic detail across all categories, and in STS, distinguishing strongest between voiced and voiceless intelligible speech sounds. A synoptic look at the random-effects group data (Fig. 4B) complements the dichotomy of noncategorical response behavior in PT and posteriomedial STG on the one hand and categorical selectivity in anterolateral STS, at least for the left hemisphere, on the other hand. Categorical responses are seen in the anterolateral STS peak (red bars; differentiation of speech vs. nonspeech as well as voiced vs. voiceless) that are not evident in the posterior peak identified in the regression analysis (blue bars; all confidence limits include zero, thus, no clear-cut category differences are evident).

Our findings are highly consistent with current ideas of auditory processing streams (Rauschecker and Tian 2000; Zatorre et al. 2002; Poeppel and Hickok 2004; Warren, Wise, et al. 2005) and auditory object analysis (Griffiths and Warren 2004; Zatorre et al. 2004). Most of these hierarchical models assume that at a relatively early stage (corresponding to the lateral belt areas in nonhuman primates; Rauschecker et al. 1995; Rauschecker 1998; Kaas and Hackett 2000) characteristic features of complex sounds are extracted, whereas later stages (equivalent to parabelt and beyond) become more and more specialized for the processing of particular categories of complex sound. This would include communication sounds and speech. The current analysis clearly identifies the PT as an early, lateral belt–like structure (most likely equivalent to the middle lateral area; Rauschecker et al. 1995; Tian et al. 2001) with as yet nonspecialized functions. This matches well available data from recent human imaging studies, which have found an involvement of the PT in all kinds of complex-sound processing, including processing of music, speech, spectral motion, etc., and have therefore termed it a “computational hub” (Griffiths and Warren 2002). These conclusions at the same time reject a particular role of the PT in speech-specific processing high along a language-related hierarchy, as had been postulated in older theories (Geschwind and Levitsky 1968).

We use the term “non-categorical” to imply that the BOLD response seen in posterior STG does not indicate the broad distinction involuntarily made by the listener between known speech sound categories and also between speech and nonspeech. This is not to say, of course, that the posterior STG is not involved in critical steps of analyzing such sounds. We know from temporally much more precise magnetoencephalography that the relatively early N100m response to consonants arises from lateral Heschl's gyrus and PT (Obleser, Scott, et al. 2006). It has also been shown that electrocortical disruption in posterior STG leads to deficits in consonant perception (Boatman and Miglioretti 2005), as implied not at last by the well-known findings on word deafness induced by lesions to posterior STG (e.g., Pinard et al. 2002; interestingly, the latter study is in accordance with our results in that it shows the patient's deficits to affect also nonspeech perception). However, all these results do not allow us to conclude which processing steps are associated with posterior STG activity. Moreover, assessment across studies is compromised by task and design differences. It may well be the case that active categorization tasks imposed on the subject shape the responsivity of the PT (or even hemispheric balance) to certain aspects of the auditory stimuli as it has been found in certain categorical perception tasks (Jacquemot et al. 2003; Brechmann and Scheich 2005; but see also Binder et al. 2004; Liebenthal et al. 2005, both of which report effects of categorical decision processes in anterolateral STG). Instead, our study avoided a direct task and used information inherent in the stimulus material to tease apart the responsivity of the posterior and the anterolateral STG to this information using the regression and categorical analyses reported.

Although compatible with most recent neuroimaging-based models of auditory speech processing as well as nonhuman primate work, the present findings contrast with the assumption of an auditory processing device for speech that is specialized or segregated from the outset (Whalen et al. 2006). Our findings are also not entirely compatible with suggestions of quantitatively stronger BOLD responses in PT or supramarginal gyrus when processing or perceiving native speech sounds compared with nonspeech or nonnative phonemic sounds (e.g., Jacquemot et al. 2003; Dehaene-Lambertz et al. 2005; Meyer et al. 2005). It should be reiterated, however, that the present data were collected under passive listening conditions without any additional, attention-demanding task and with speech and nonspeech segments randomly interwoven trial by trial. Therefore, any task- or set-dependent upregulation in auditory cortex taking place with slower time constants might not be reflected in our data.

In conclusion, our results show that the automatic perception of an isolated stop consonant is sufficient to drive cortical structures in the anterolateral STG/STS, particularly in the left hemisphere, which have been shown previously to be involved in the processing of complex spectrotemporal patterns, including those related to speech and language. They also demonstrate that the multiple stages of acoustic–phonetic analysis are detectable in the same set of BOLD responses and that the superior temporal region immediately posterior to Heschl's gyrus and including the PT is rather involved in a noncategorical prelinguistic step of analysis for speech sounds. By contrast, the anterolateral STG/STS operates on a categorical basis, preferring overlearned spectrotemporal structures of, for example, speech sounds over unknown, meaningless noise. Our findings add to a refined understanding of the relationship between processing areas within the superior temporal cortex, and they yield compelling evidence for a functional hierarchy governing the processing of simple stop consonants as examples for the elements of speech.

This research was supported by grants from the Cognitive Neuroscience Initiative of the National Science Foundation (BCS 0350041; JPR), the German Science Foundation (DFG, SFB 471; JO), and a postdoctoral elite grant from the Landesstiftung Baden-Württemberg Germany (JO). Juma Mbwana helped acquire the data, and we are also grateful to 2 anonymous reviewers for their helpful comments. Conflict of Interest: None declared.

References

Belin
P
Zatorre
RJ
Lafaille
P
Ahad
P
Pike
B
Voice-selective areas in human auditory cortex
Nature
 , 
2000
, vol. 
403
 (pg. 
309
-
312
)
Binder
JR
Frost
JA
Hammeke
TA
Bellgowan
PS
Springer
JA
Kaufman
JN
Possing
ET
Human temporal lobe activation by speech and nonspeech sounds
Cereb Cortex
 , 
2000
, vol. 
10
 (pg. 
512
-
528
)
Binder
JR
Liebenthal
E
Possing
ET
Medler
DA
Ward
BD
Neural correlates of sensory and decision processes in auditory object identification
Nat Neurosci
 , 
2004
, vol. 
7
 (pg. 
295
-
301
)
Blesser
B
Speech perception under conditions of spectral transformation. I. Phonetic characteristics
J Speech Hear Res
 , 
1972
, vol. 
15
 (pg. 
5
-
41
)
Boatman
DF
Miglioretti
DL
Cortical sites critical for speech discrimination in normal and impaired listeners
J Neurosci
 , 
2005
, vol. 
25
 (pg. 
5475
-
5480
)
Brechmann
A
Scheich
H
Hemispheric shifts of sound representation in auditory cortex with conceptual listening
Cereb Cortex
 , 
2005
, vol. 
15
 (pg. 
578
-
587
)
Davis
MH
Johnsrude
IS
Hierarchical processing in spoken language comprehension
J Neurosci
 , 
2003
, vol. 
23
 (pg. 
3423
-
3431
)
Dehaene-Lambertz
G
Pallier
C
Serniclaes
W
Sprenger-Charolles
L
Jobert
A
Dehaene
S
Neural correlates of switching from auditory to speech perception
Neuroimage
 , 
2005
, vol. 
24
 (pg. 
21
-
33
)
Geschwind
N
Levitsky
W
Human brain: left-right asymmetries in temporal speech region
Science
 , 
1968
, vol. 
161
 (pg. 
186
-
187
)
Griffiths
TD
Warren
JD
The planum temporale as a computational hub
Trends Neurosci
 , 
2002
, vol. 
25
 (pg. 
348
-
353
)
Griffiths
TD
Warren
JD
What is an auditory object?
Nat Rev Neurosci
 , 
2004
, vol. 
5
 (pg. 
887
-
892
)
Hall
DA
Haggard
MP
Akeroyd
MA
Palmer
AR
Summerfield
AQ
Elliott
MR
Gurney
EM
Bowtell
RW
“Sparse” temporal sampling in auditory fMRI
Hum Brain Mapp
 , 
1999
, vol. 
7
 (pg. 
213
-
223
)
Jacquemot
C
Pallier
C
LeBihan
D
Dehaene
S
Dupoux
E
Phonological grammar shapes the auditory cortex: a functional magnetic resonance imaging study
J Neurosci
 , 
2003
, vol. 
23
 (pg. 
9541
-
9546
)
Kaas
JH
Hackett
TA
Subdivisions of auditory cortex and processing streams in primates
Proc Natl Acad Sci USA
 , 
2000
, vol. 
97
 (pg. 
11793
-
11799
)
Lahiri
A
Gewirth
L
Blumstein
SE
A reconsideration of acoustic invariance for place of articulation in diffuse stop consonants: evidence from a cross-language study
J Acoust Soc Am
 , 
1984
, vol. 
76
 (pg. 
391
-
404
)
Liebenthal
E
Binder
JR
Spitzer
SM
Possing
ET
Medler
DA
Neural substrates of phonemic perception
Cereb Cortex
 , 
2005
, vol. 
15
 (pg. 
1621
-
1631
)
Meyer
M
Zaehle
T
Gountouna
VE
Barron
A
Jancke
L
Turk
A
Spectro-temporal processing during speech perception involves left posterior auditory cortex
Neuroreport
 , 
2005
, vol. 
16
 (pg. 
1985
-
1989
)
Narain
C
Scott
SK
Wise
RJ
Rosen
S
Leff
A
Iversen
SD
Matthews
PM
Defining a left-lateralized response specific to intelligible speech using fMRI
Cereb Cortex
 , 
2003
, vol. 
13
 (pg. 
1362
-
1368
)
Obleser
J
Boecker
H
Drzezga
A
Haslinger
B
Hennenlotter
A
Roettinger
M
Eulitz
C
Rauschecker
JP
Vowel sound extraction in anterior superior temporal cortex
Hum Brain Mapp
 , 
2006
, vol. 
27
 (pg. 
562
-
571
)
Obleser
J
Scott
SK
Eulitz
C
Now you hear it, now you don't: transient traces of consonants and their nonspeech analogues in the human brain
Cereb Cortex
 , 
2006
, vol. 
16
 (pg. 
1069
-
1076
)
Pinard
M
Chertkow
H
Black
S
Peretz
I
A case study of pure word deafness: modularity in auditory processing?
Neurocase
 , 
2002
, vol. 
8
 (pg. 
40
-
55
)
Poeppel
D
Hickok
G
Towards a new functional anatomy of language
Cognition
 , 
2004
, vol. 
92
 (pg. 
1
-
12
)
Price
CJ
Thierry
G
Griffiths
TD
Speech-specific auditory processing: where is it?
Trends Cogn Sci
 , 
2005
, vol. 
9
 (pg. 
271
-
276
)
Rauschecker
JP
Cortical processing of complex sounds
Curr Opin Neurobiol
 , 
1998
, vol. 
8
 (pg. 
516
-
521
)
Rauschecker
JP
Tian
B
Mechanisms and streams for processing of “what” and “where” in auditory cortex
Proc Natl Acad Sci USA
 , 
2000
, vol. 
97
 (pg. 
11800
-
11806
)
Rauschecker
JP
Tian
B
Hauser
M
Processing of complex sounds in the macaque nonprimary auditory cortex
Science
 , 
1995
, vol. 
268
 (pg. 
111
-
114
)
Scott
SK
Blank
CC
Rosen
S
Wise
RJ
Identification of a pathway for intelligible speech in the left temporal lobe
Brain
 , 
2000
, vol. 
123
 (pg. 
2400
-
2406
)
Stevens
KN
Blumstein
SE
Invariant cues for place of articulation in stop consonants
J Acoust Soc Am
 , 
1978
, vol. 
64
 (pg. 
1358
-
1368
)
Tian
B
Reser
D
Durham
A
Kustov
A
Rauschecker
JP
Functional specialization in rhesus monkey auditory cortex
Science
 , 
2001
, vol. 
292
 (pg. 
290
-
293
)
Warren
JD
Jennings
AR
Griffiths
TD
Analysis of the spectral envelope of sounds by the human brain
Neuroimage
 , 
2005
, vol. 
24
 (pg. 
1052
-
1057
)
Warren
JE
Wise
RJ
Warren
JD
Sounds do-able: auditory-motor transformations and the posterior temporal plane
Trends Neurosci
 , 
2005
, vol. 
28
 (pg. 
636
-
643
)
Whalen
DH
Benson
RR
Richardson
M
Swainson
B
Clark
VP
Lai
S
Mencl
WE
Fulbright
RK
Constable
RT
Liberman
AM
Differentiation of speech and nonspeech processing within primary auditory cortex
J Acoust Soc Am
 , 
2006
, vol. 
119
 (pg. 
575
-
581
)
Zatorre
RJ
Belin
P
Penhune
VB
Structure and function of auditory cortex: music and speech
Trends Cogn Sci
 , 
2002
, vol. 
6
 (pg. 
37
-
46
)
Zatorre
RJ
Bouffard
M
Belin
P
Sensitivity to auditory object features in human temporal neocortex
J Neurosci
 , 
2004
, vol. 
24
 (pg. 
3637
-
3642
)