Functional imaging studies of language have shown bilateral superior temporal activations in response to ‘passive’ perception of speech when the baseline condition did not control for the acoustic complexity of speech. Controlling for this complexity demonstrates speech-specific processing lateralized to the left temporal lobe, and our recent positron emission tomography study has emphasized a role for left anterolateral temporal cortex in speech comprehension. This contrasts with the more usual view that relates speech comprehension to left temporal-parietal cortex, the ill-defined area of Wernicke. This study attempted to reconcile these differences, using a more sensitive 3 T functional magnetic resonance imaging system, and a sparse sampling paradigm. We found left lateralized activations for intelligible speech with two distinct foci, one in the anterior superior temporal sulcus and the other on the posterior temporal lobe. Therefore, the results demonstrate that there are neural responses to intelligible speech along the length of the left lateral temporal neocortex, although the precise processing roles of the anterior and posterior regions cannot be determined from this study.
Historically, the posterior left temporal gyrus (STG) has been identified as crucial for understanding speech (Wernicke, 1874). From a psychoacoustic perspective, mapping the spectral and temporal properties of the speech signal on to long term mental representations of meaning involves a range of processes which may be organized in series, in parallel or both (Scott and Johnsrude, 2003; Scott and Wise, 2003b) and which may occur prior to later semantic and syntactic analysis. The study of the acoustic basis of speech perception is commonly referred to as speech intelligibility or speech reception (Miller, 1951). Studies using this methodology typically use the number of key words repeated from heard simple sentences as an index of the degree of accuracy with which the acoustic speech signal has been decoded (Blamey et al., 2001; Keidser and Grant, 2001). Thus speech intelligibility has been used to characterize the effects of numbers of channels in noise-vocoded speech (Shannon et al., 1995), the effects of speech presented in different noise contexts (Brungart et al., 2001; van Wijngaarden et al., 2002), and the impact of age-related and pathological hearing loss (Peters et al., 1998). This term thus describes a signal in which there is sufficient acoustic detail for a listener to decode whole words. It makes no assumptions about the specific phonetic information needed to do this, not least because there is no simple linear mapping between acoustic cues and phonetic identity (Bailey and Summerfield, 1980). The term intelligibility is not aimed at dissecting the relative contributions of semantic and syntactic information to the decoding of the speech signal and is construed within an acoustic rather than linguistic framework.
Speech perception is also intimately linked with speech production, and the ability to repeat is to some degree dissociable from comprehension. Thus non-words can be repeated but are not understood. There is evidence that the implicit rehearsal of heard speech occurs during speech perception, generally conceived as a sound-to-articulation pathway that operates in parallel with a sound-to-meaning pathway (Hickok and Poeppel, 2000). This has also been described in neural models of speech perception (Hickok and Poeppel, 2000; Wise et al., 2001; Scott and Johnsrude, 2003; Scott and Wise, 2003b).
Functional imaging of passive speech perception (where no overt response is required) will demonstrate all the implicit processes associated with comprehension and rehearsal (repetition), but the power of any particular study may emphasize one route of speech processing and not the other. This is particularly so when statistical thresholding of the images excludes physiologically relevant signal, or the methodology used to obtain the images is not equally sensitive across the whole region of the brain under study (Devlin et al., 2000). Functional magnetic resonance imaging (fMRI) provides a non-invasive method of studying language processing, with the potential for greater sensitivity than positrion emission tomography (PET) (due to more possible scans per condition) but with the potential loss of signal in anterior temporal lobe structures (Devlin et al., 2000).
Previous functional imaging studies of the acoustic basis of speech perception have typically shown bilateral responses to speech in the temporal lobes (Wise et al., 1991; Zatorre et al., 1992; Mummery et al., 1999; Benson et al., 2001; Vouloumanos et al., 2001; Wong et al., 2002) for passive listening. This is in contrast to the indication, from the clinical literature, that left temporal lobe lesions result in sensory aphasic problems (Wernicke, 1874; Turner et al., 1996; Kuest and Karbe, 2002). One possible reason for this discrepancy may be the choice of the control condition, which often have not appropriately accounted for the acoustic complexity of speech. When addressing the acoustic processing of speech, this is a crucial aspect of the study design, since speech is an immensely complex acoustic signal. Speech contains quasi-periodic and aperiodic sections (due to the presence or absence of voicing), amplitude modulation, frequency modulation and considerable spectral structure (formants) due to the movements of the articulators, as well as periods of silence. Previous studies have therefore frequently shown bilateral activation that is probably due to a lack of control for the acoustic structure in the baseline condition. No one cue determines the intelligibility of speech (Miller, 1951), but a certain degree of spectral-temporal modulation is essential (Drullman, 1995; Shannon et al., 1995; Faulkner et al., 2001). In order to delineate areas only involved in processing the correlates of speech intelligibility, separately from more general acoustic processing, the control stimulus should ideally have all the acoustic properties of speech, without being intelligible. However, it is difficult to design a stimulus as acoustically complex as speech which is not intelligible (Blesser, 1972). Skilled listeners are able to understand speech with very degraded spectral and temporal input (Drullman, 1995; Shannon et al., 1995).
A second limitation of many previous imaging studies is that they have used a simple subtraction paradigm, assuming that differences between two cognitive states have been well controlled for all but the variable of interest. However, this assumption may not be valid (Sartori and Umilta, 2000). An alternative approach, which gets around many of the problems of the subtraction method, is to use more than one contrast isolating the same function, and to identify overlapping regions of activation as those important for a common processing function (Price and Friston, 1997).
Two recent publications have used evidence from PET to explicitly emphasize two streams of speech processing in the left temporal lobe: a route directed towards anterolateral temporal cortex associated with the intelligibility of the stimulus (Scott et al., 2000) and another directed posteriorly that, it was proposed, was associated with repetition (Wise et al., 2001). The present study has used fMRI to investigate further the anterior–posterior extent of activation of left lateral temporal neocortex to speech that can be both implicitly understood and temporally sequenced for repetition. We used a technique called sparse sampling that limited the influence of scanner noise on the physiological response (Hall et al., 1999). We also used two forms of intelligible speech: normal speech and noise-vocoded speech (Shannon et al., 1995), and baseline stimuli that controlled for spectral and temporal structure in the acoustic signals. These two sets of stimuli were contrasted against each other using a conjunction analysis, in order to avoid the shortcomings of a simple cognitive subtraction paradigm.
Materials and Methods
The study was conducted in accordance with the guidelines of the Central Oxford Regional Ethics Committee, and written consent was obtained from all subjects in accordance with the declaration of Helsinki. Data was collected from 11 right-handed subjects (9 male, 2 female), all of whom had English as their first language. The mean age was 27 years (age range 20–50).
Four different stimuli were used: normal speech (Sp), six-channel noise-vocoded speech (VCo), spectrally rotated normal speech (RSp) and spectrally rotated noise-vocoded speech (RVCo). In each case, the stimuli were sentences of ∼2 s duration, which had been appropriately transformed. Sentences were taken from the Bamford–Kowal–Bench (BKB) standard sentence list (Bench et al., 1979), and were simple, unconnected statements (e.g. ‘The clown has a funny face’), with imageable, concrete words and very simple syntax. Short sentences like these are commonly used to determine speech intelligibility thresholds clinically and experimentally, and this also has the advantage that more signal is present than in word lists, increasing the power of the study. The use of short sentences also enhances the intelligibility of the noise-vocoded speech. All stimuli were presented using MR-compatible electrostatic headphones built by the Institute of Hearing Research in Nottingham (http://www.ihr.mrc.ac.uk/) and designed specifically for use in an MRI system.
Noise-vocoded speech breaks the speech signal down into amplitude modulation at difference frequency bandwidths. The more frequency channels that are used, the more intelligible the speech is (Shannon et al., 1995), and below eight channels the relationship between number of channels and intelligibility is logarithmic (Faulkner et al., 2001). With training, subjects can easily learn to understand speech with only four channels. This is because the speech has not been distorted, instead the original signal is being presented with reduced amount of information, and the subjects quickly learn what sort of information that has been preserved.
The noise-vocoded stimuli (VCo and RVCo) thus differ from the speech stimuli in three ways. First, spectral variation in these stimuli is conveyed by band passed filtered noise, rather than the quasi periodic vibrations of the human vocal folds, and therefore subjectively these stimuli sound like a harsh whisper. Second, the temporal and spectral profile of noise-vocoded speech is smeared, so that these stimuli can be said to be acoustically less complex than natural speech. However, VCo stimuli are readily intelligible after a short training session (on the order of 15 min). Third, the sense of pitch is very reduced with six-channel noise-vocoded speech, so the intonation of these stimuli is very attenuated.
The rotated speech (RSp) stimuli can be thought of as mirror images of untransformed stimuli, as high and low frequencies are inverted around a single chosen frequency (here, 2 kHz). These stimuli are distinct from reversed speech commonly used as unintelligible stimuli in functional imaging studies of speech perception. For the details of the transformations involved see Scott et al. (2000). Unlike reversed speech, the temporal and spectral structure has been largely preserved. Reversed speech differs from normal speech in a number of ways, overall intelligibility being just one of them. For example, whereas much of normal speech has fast onsets (e.g. plosives) and long decays, reversed speech has slow onsets and rapid decays. The phonotactic structure is affected, and sequences are generated that could not be articulated. Thus, a comparison between speech and reversed speech would pick up differences related to processing the temporal structure as well as intelligibility.
RSp stimuli, on the other hand, contain phonetic features (Blesser, 1972): thus manner of articulation is often preserved (e.g. frication is identifiable although the fricatives themselves are changed, and silence before obstruents is unaltered). All the original acoustic information is still available, though it is now in the wrong frequency channels, for example the low frequency amplitude modulations are at higher frequencies. Blesser (1972) described spectrally rotated speech as sounding like an alien speaking your language with a completely different set of articulators (for examples of stimuli, see http://www.phon.ucl.ac.uk/home/brain/). Rotated speech thus has the potential to become intelligible, although this requires extensive training (on the order of weeks and months) (Blesser, 1972; Rosen et al., 2002). The RVCo stimuli sound like intermittent static, and are not at all speech-like, although they contain the same amount of acoustic information as the noise-vocoded speech. To date, there is no evidence that such stimuli can ever be understood.
Subjects in this study were pretrained on the six-channel noise-vocoded speech, to a level where they were performing at ceiling on their repetition of noise-vocoded sentences (see procedure). Therefore both the Sp and VCo stimuli are both fully intelligible (i.e. they could be understood and repeated) and the two sets of rotated stimuli are fully unintelligible (they could not be either understood or repeated) (Rosen et al., 2002). The Sp stimuli are readily comprehensible on first presentation, while the VCo stimuli are comprehensible after a brief training session.
The experiment consisted of ten blocks of each of the four stimulus types. Each block consisted of presentations of five consecutive sentences, each of 2 s duration, giving a total block length of 10 s. Blocks of sentences were used in preference to single sentences, as pilot work indicated that blocks of sentences of the same type produce a greater BOLD signal, which reaches a maximum at ∼10 s after the start of the stimulus (unpublished data). Block presentation was randomized, so that no two consecutive blocks contained the same stimulus type.
Using a ‘sparse’ sampling MRI acquisition design (Hall et al., 1999), a multislice volume covering the temporal lobes was collected at the end of each block. No scans were collected during the presentation of the stimuli themselves, so that at no time did the stimuli overlap with the scanner noise (see Fig. 1). Each block was followed by a silent period of 10 s to allow recovery of the hemodynamic response (Hall et al., 1999). As the HRF was sampled at its peak, there was sufficient signal-to-noise even with the reduced number of HRF acquisitions per stimulus type.
Subjects were pretrained on VCo stimuli prior to scanning. This was performed in an interactive manner: subjects were played a single VCo sentence and asked to repeat it. If subjects could not understand the sentence, or gave an incorrect or partial response (an incorrect response was defined as any word being repeated incorrectly), the sentence was played again. The subject was then asked to repeat the sentence, and if the response was still not correct, the experimenter repeated the sentence in normal speech, and played the VCo speech sentence again, at which point all subjects were able to repeat the sentence correctly. This procedure was repeated until subjects were able to understand VCo sentences clearly on first presentation, for 15 successive sentences. This ensured that the subjects were fully able to understand the noise-vocoded speech. Unlike sinewave speech (Remez et al., 1981), subjects trained on VCo speech cannot reverse their perception and hear the signal as noise (Shannon et al., 1995). This measure of intelligibility is thus identical to research mentioned in the introduction, and avoids confounds introduced when subjects are required to transcribe sequences (not using phonetic transcription). This method has been used to determine the ‘phonetic’ information in reversed speech (Binder et al., 2000) but suffers from the problem that the perception and written expression of a sequence, even with phonetic transcription, is not a unitary cognitive process and distortions are introduced, not least the regularization of sounds into expressible symbols (Scott and Wise, 2003a). Using accurate repetition as an index of intelligibility has the benefit of being ecologically valid, though it may not be sensitive to other factors influencing speech perception, e.g. the newly acquired skill of decoding noise-vocoded speech might be more easily disrupted by a concurrent task. Examples of Sp, RSp and RVCo speech also were presented, to ensure that they were (i) familiar with the stimuli, and (ii) unable to understand the rotated speech and the rotated noise-vocoded speech. This training session took ∼20 min, and none of the material used during training was repeated during the experiment.
Before the start of the experiment, subjects were told that they were to simply lie in the scanner, listen to the auditory stimuli that would be presented to them, and try and understand their meaning (Mummery et al., 1999; Scott et al., 2000). Subjects were told that it was important to pay attention to all that they heard, but they should make no explicit attempt to try and remember any of the sentences.
The study was performed on a Varian Innova 3 T MRI/MRS system with a purpose-built birdcage radiofrequency (RF) coil. Functional scans were collected using a echoplanar imaging (EPI) sequence. A ‘sparse’ sampling design (Hall et al., 1999) was employed, with a single multislice volume acquired over 3 s, every 20 s. Acquisition began immediately after the offset of each auditory stimulus presentation (see Fig. 1). Each volume consisted of 21 axial slices, including the entire temporal lobes, with a notional single voxel resolution of 4 × 4 × 5 mm.
Data analysis was carried out using statistical parametric mapping (SPM 99, Wellcome Department of Cognitive Neurology, London, UK), implemented in Matlab (Mathworks Inc. Sherborn, MA, USA). Data was regrouped so that all scans belonging to one stimulus type were treated as a single epoch. All volumes were realigned to the first collected volume, and then resliced using a sinc interpolation. They were then smoothed using a Gaussian filter of 6 mm, normalized to the MNI template in SPM99, and a group analysis was carried out using a fixed-effects model. We used the conjunction analysis option in SPM to show areas activated in common when two contrasts were designed to show the same underlying process. This is a statistically conservative test, and has been shown to have several advantages over a simple cognitive subtraction approach (Price and Friston, 1997).
Condition and subject effects were estimated using the general linear model. To test hypotheses about regionally specific effects of different stimuli, these estimates were compared using linear contrasts. The resulting set of voxel values for each contrast is an SPM of the t-statistic. Effects are reported as significant above a threshold of P < 0.05 corrected. As the results were assessed using a statistically conservative conjunction analysis, type I errors were avoided.
After the experiment, the subjects were asked how many types of stimuli they had heard, and of these, which ones they understood. Finally, they were asked to rank the stimulus types according to how intelligible they were perceived to be. This was an informal confirmation of the intelligibility seen in the pretraining session. All (11) subjects reported that there were four different types of stimuli, and that they could understand two of the different types (typically referred to as ‘the normal speech’ and ‘the one we trained with’). They rated the Sp and VCo conditions as equally intelligible, and the RSp and RVco conditions as equally unintelligible. This is consistent with the pretraining and with the behavioural literature (Blesser, 1972).
Two linear contrasts were performed to determine the neural correlates of intelligible speech, while controlling for stimulus complexity. Thus Sp was contrasted with RSp and VCo was contrasted with RVCo. In both of these contrasts, an intelligible stimulus was contrasted with an unintelligible stimulus of equal acoustic complexity. A conjunction analysis was then carried out using the conjunction option in SPM, and only voxels that were significant at a threshold of P < 0.05 corrected for the conjunction contrast were accepted as regions responding specifically to intelligible speech across both speech and Vco conditions.
Three significant clusters (>10 voxels) were observed, all in the left temporal lobe. Areas activated included the dorsal posterior margin of the left temporal lobe (Wernicke’s area) (Bogen and Bogen, 1976; Wernicke, 1874). Anteriorly, we also found activation centred on the mid and anterior superior temporal sulcus. There was no significant activation on the right hemisphere at the thresholds we used (0.05 corrected) (see Fig. 2 and Table 1).
Our study was aimed at a reconciling our previous PET study, which indicated that the anterior STS was the most significant area involved in processing intelligibility, with the large body of research which assigns this role to the posterior temporal-parietal area. Most previous PET and fMRI studies of language have shown bilateral activation with passive listening tasks, despite clinical data that clearly shows that the left but not the right hemisphere is crucial in processing language. While brain imaging could identify regions modulated but not critical to speech processing, other explanations may be more tenable. Our earlier report (Scott et al., 2000) provided direct evidence supporting the hypothesis that a principal reason for the relative lateralization discrepancy was the failure to use baseline conditions that adequately controlled for the complex physical properties of speech in contrasts aimed at defining processing of intelligibility. While this study demonstrated clear left lateralization for processing intelligible speech, the activity was seen in the anterior STS, not in the posterior regions emphasized by clinical lesion studies (Turner et al., 1996). This result is consistent with the changes seen in semantic dementia (characterized by a progressive deterioration in the comprehension of words), which is associated with grey matter loss in the left temporal lobe (Chan et al., 2001) and previous functional imaging studies which have implicated left anterior temporal lobe regions in the processing of intelligible speech. Thus Mazoyer et al. (1993) and Schlosser et al. (1998) showed anterior temporal lobe involvement in the perception of connected speech. The baseline conditions used were foreign languages, leaving open the possibility that these results could be driven by acoustic (including phonotactic) differences between the intelligible and unintelligible speech conditions.
However, the smaller number of scans used (four per condition) in the Scott et al. (2000) PET study means that this study may have missed subthreshold activation in the posterior regions which are known to have a role in speech comprehension, and which have been identified as speech specific regions of asymmetry in previous studies (Mummery et al., 1999). Our current study therefore attempted to exploit the complementary sensitivity (Devlin et al., 2000) and higher spatial resolution of fMRI to further explore this problem.
We confirmed the results from our PET study (Scott et al., 2000) showing a strongly left-lateralized activation for intelligibility in a passive language-listening task. We also found a region within the posterior superior temporal lobe (Wernicke’s area) (Bogen and Bogen, 1976) responding to intelligible speech. This study thus reconciles imaging data with the large body of evidence that aphasia arises most typically from left but not right hemisphere damage, and that Wernicke’s area has a crucial role in language comprehension.
We found two distinct regions in the left temporal lobe showing greater activity associated with intelligible speech relative to a complex acoustic stimulus with speech-like characteristics: posteriorly on the superior temporal gyrus, and anteriorly on the STS. Previous studies (Zahn et al., 2000) have also found activation in the STG and STS to be specific to meaningful speech, but only when subjects were required to carry out a higher order conceptual semantic task. These activations were also much more anterior to those seen in the current study. In a recent study, Friederici et al. (2003) demonstrated responses to semantic and syntactic violations in the left anterior and posterior STG. However, direct contrasts of the two violation conditions showed no significant differences in these regions.
Binder et al.’s (2000) very comprehensive study aimed to isolate brain regions activating specifically to intelligibility from those involved in lower level sound processing. Our study is comparable in terms of aims and methodology, but we found a more left lateralized network. We attribute these differences to several factors. An important reason for the difference in results is the control stimuli used. Reversed speech, as used by Binder et al. (2000), represents a substantial improvement over the use of simple stimuli such as tones, but this may still not be an ideal control condition. Reversed speech differs from normal speech on a number of acoustic features other than intelligibility: it has comparatively slower onsets and more rapid acoustic decays, leading to a distortion of the temporal code, which can be hard to characterize. We suggest that using a baseline condition which does not adequately control for these acoustic features may result in increased right hemisphere activation. Spectrally rotated speech is an improved control stimulus, as it still retains the temporal code and spectral complexity inherent in normal speech, but is at the same time meaningless to an untrained listener (in the absence of extensive training on the order of weeks) (Rosen et al., 2002). The use of sentences rather than words also contributes to the power of the present study. Further, we present a conjunction of two different forms of speech, a statistically conservative procedure which also increases the power to detect significant activations. These factors together explain why unlike Binder et al. (2000) study, we find clear differences in the processing of intelligible versus unintelligible speech above the dorsal STS, a difference not uncovered by their comparison of speech and reversed speech.
Finally, auditory studies using fMRI suffer from the constant background noise due to scanner gradient switching. The potential problems due to this background noise are acknowledged by Binder et al. (2000), who worked to reduce this confound by minimizing slice coverage and using a relatively long interscan interval. We used an alternative data acquisition strategy called ‘sparse sampling’ (Hall et al., 1999, which integrates these features, and at the same time ensures that the stimuli are presented in silence and that the obtained activations are not affected due to activation caused by the scanner noise. Even though the number of multi-slice volumes collected per unit time is reduced using such a paradigm, we believe that the advantages of this method may make it more suitable for auditory experiments using fMRI.
Our results showing clear left lateralization for processing intelligibility extend the Scott et al. (2000) PET study, on which this study was based. However, our results differ from the earlier work in that the posterior activation is stronger than the anterior (STS) activation, in direct contrast to the PET study. This is potentially due to greater sensitivity in the current study: PET is limited as to the number of volumes acquired, and thus subthreshold activations may be lost. fMRI, in contrast, affords many more data points to be collected for each condition (even with sparse sampling), which can improve the power of analysis. To test this hypothesis, we reanalysed our previous PET data and found that there was subthreshold activation [z = 4.33, P(corrected) = 0.29] in the posterior STG (MNI coordinates: –52 –58 8) (S.K. Scott and R.J.S. Wise, unpublished data).
The argument in the Scott et al. (2000) study was based around the strong activation seen in the anterior STS, especially since this ran forward from other STG regions that were activated by stimuli with the acoustic correlates of phonetic cues and features. This result is seen only weakly in the present study, and a number of other studies using fMRI have tended to emphasize a temporoparietal (Wernicke’s) area (Binder et al., 1997, 2000; Calvert et al., 1997; Zahn et al., 2000) as being important in semantic processing. However, the results of fMRI studies of language may be influenced by the fact that the anterior STS is relatively more susceptible to dramatic signal loss due to magnetic susceptibility artifacts (Devlin et al., 2000). The anterior STS activation described in this paper may have been affected by such artefacts.
In terms of functional significance, a reanalysis of several PET studies (Wise et al., 2001) has identified activity in posterior STS associated with a variety of language based tasks, not solely speech comprehension. A PET study (Crinion et al., 2003) suggests that Wernicke’s area activates more significantly in an intelligibility task related to processing stories rather than isolated sentences. An ERP study by Abdullaev and Posner (1998) found that activation in the Wernicke’s area occurred later than the behavioural response in a semantic decision task. Therefore, we suggest that temporal-parietal junction may form the part of a short-term memory network specialized for language. We could therefore expect it to be more activated with processing connected sentences (as in stories) than by single sentences, and more by sentences than by single words. This would also explain its relatively late activation after stimulus presentation, and its reduced involvement in studies using single word stimuli (though see Mummery et al., 1999). Our hypothesis is supported by suggestions that language may have evolved from such a working memory network, and that Wernicke’s area (posterior STG) may be the focus of a multi-modal network associated with language comprehension (Aboitiz and Garcia, 1997).
Based on the converging evidence from our previous PET study and the present study, we therefore suggest that the areas we have uncovered (the anterior STS and the posterior STG) are part of a distributed system of regions associated with the comprehension of speech. Further, the anterior STS and Wernicke’s area may be serving different functions in such a system.
This work was conducted at the Oxford Centre for the Functional Magnetic Resonance Imaging of the Brain (FMRIB). P.M.M. thanks the MRC for personal support and support for the Centre. R.J.W., S.K.S. and C.N. gratefully acknowledge support from the Wellcome Trust. C.N. would also like to thank all the subjects who took part in the study.
|Contrast||x||y||z||t-statistic||No. of voxels||Anatomical area|
|Conjunction (see text)|
|1||(Sp+ Vco) – (RSp – RVCo)||–54||–18||–18||4.20||49||Mid STS|
|Contrast||x||y||z||t-statistic||No. of voxels||Anatomical area|
|Conjunction (see text)|
|1||(Sp+ Vco) – (RSp – RVCo)||–54||–18||–18||4.20||49||Mid STS|
The x, y and z coordinates are in talaraich space, and refer to the peak voxel activated in each contrast. Peak voxels are reported for all major clusters (>10 voxels).All contrasts are thresholded at P = 0.05 (corrected). STS, superior temporal sulcus.