To identify the brain regions preferentially involved in environmental sound recognition (comprising portions of a putative auditory ‘what’ pathway), we collected functional imaging data while listeners attended to a wide range of sounds, including those produced by tools, animals, liquids and dropped objects. These recognizable sounds, in contrast to unrecognizable, temporally reversed control sounds, evoked activity in a distributed network of brain regions previously associated with semantic processing, located predominantly in the left hemisphere, but also included strong bilateral activity in posterior portions of the middle temporal gyri (pMTG). Comparisons with earlier studies suggest that these bilateral pMTG foci partially overlap cortex implicated in high-level visual processing of complex biological motion and recognition of tools and other artifacts. We propose that the pMTG foci process multimodal (or supramodal) information about objects and object-associated motion, and that this may represent ‘action’ knowledge that can be recruited for purposes of recognition of familiar environmental sound-sources. These data also provide a functional and anatomical explanation for the symptoms of pure auditory agnosia for environmental sounds reported in human lesion studies.
Different aspects of the sounds we hear are thought to be processed along different regions of the brain. Similar to the visual system, a growing body of evidence would suggest that the human and non-human primate auditory systems are at least roughly organized along two major cortical streams or networks (Mishkin et al., 1983; Rauschecker, 1998; Romanski et al., 1999; Rauschecker and Tian, 2000; Clarke et al., 2002). One includes a dorsally directed network in both hemispheres that is involved in processing spatial information about sound (a ‘where is it?’ pathway), such as for sound-source localization, motion perception and spatial attention (Griffiths et al., 1996; Baumgart et al., 1999; Bushara et al., 1999; Weeks et al., 1999; Lewis et al., 2000; Maeder et al., 2001; Zatorre and Penhune, 2001; Warren et al., 2002). The other stream involves a relatively more ventrally located network that is involved in aspects of sound recognition (a ‘what is it?’ pathway), which generally includes the processing of speech sounds or species specific vocalizations, and natural or environmental sounds. In humans, spoken language recognition involves a widespread cortical system, much of which is lateralized to the left hemisphere (Binder et al., 1997; Belin et al., 2000; Price, 2000). As for the processing of environmental (non-verbal) sounds, candidate regions have been reported for both hemispheres (Engelien et al., 1995; Giraud and Price, 2001; Humphries et al., 2001; Maeder et al., 2001). However, we still have only a fragmentary understanding of the precise brain regions and processing pathways that constitute a system for non-verbal sound recognition.
Although rare, lesions to portions of temporal or temporo-parietal cortex can lead to a ‘pure’ auditory agnosia for environmental sounds, which is defined as an impaired capacity to recognize auditory information (such as a doorbell ring or typewriter sounds) despite adequate hearing and speech comprehension (Spreen et al., 1965; Albert et al., 1972; Vignolo, 1982; Fujii et al., 1990; Schnider et al., 1994; Engelien et al., 1995; Clarke et al., 2000; Clarke et al., 2002; Saygin and Moineau, 2002). Such auditory agnosia has been observed after right, left, and bilateral lesions, though left hemisphere (and bilateral) lesions tend to produce additional deficits in verbal comprehension. Thus far, no precise anatomical locations have been correlated with producing auditory agnosia. Nonetheless, these lesion studies do suggest that the cortical pathways for processing environmental sounds and spoken language are at least partially separable at some level, though closely linked with one another, especially in the language dominant hemisphere (Vignolo, 1982; Schnider et al., 1994; Saygin et al., 2003).
Current neurological and cognitive models for how spoken language is processed include input, intermediate and output processing stages (Grabowski and Damasio, 2000; Price, 2000; Binder and Price, 2001; Wise et al., 2001; Binder, 2002). This model serves as a starting point for assessing how non-verbal, environmental sounds might be processed and subsequently recognized. Primary auditory cortex and some of the surrounding cortex represent input stages, which are thought to be involved in processing physical features of the spoken word sounds. Intermediate processing stages include lexical-semantic and other associative processes, which involve a wide range of cortices predominantly in the left hemisphere (of right-handed subjects), including the classically defined Wernicke’s area among other cortical regions. Output stages involve phonological access and articulatory planning (whether vocalizations are produced or not), for which the left inferior frontal cortex is widely implicated, including the cortex of and surrounding the classically defined Broca’s area.
In contrast to spoken words, whose sounds tend to have an arbitrary relationship with the concepts they represent, most environmental sounds bear a natural and physical correspondence to the visible and sometimes tangible object movements that produce the sound. When environmental sounds are first experienced and learned, information regarding the identity of the sound-source is typically obtained in the context of other sensory information (e.g. hearing the sound of a basketball bouncing while viewing the bouncing motion and/or making arm and hand movements to dribble the ball). Presumably, these separate streams of relevant sensory information (sound, sight, touch) can merge and integrate in the cortex to help provide a unified percept of an object and its functional dynamics. An important step in understanding the complexities of such multimodal integration would be to identify which brain regions participate preferentially in environmental sound-source recognition. Such regions, in contrast to some of the lateralized structures that appear to be specialized for spoken language processing in humans, may represent part of a more general or rudimentary system for sound recognition.
We used functional magnetic resonance imaging (fMRI) to reveal brain regions involved in the recognition of common environmental sounds. In an effort to maximally activate a sound recognition system, we included a diverse range of non-verbal, environmental sound categories (see Appendix). To reveal which regions were more sensitive to the process of ‘recognition’ per se, we also presented temporally reversed (backwards) versions of the same sounds, which served as control stimuli that were comparably complex and matched on many acoustical features, but were generally judged as unrecognizable. Some of these sound pairs can be heard in the online Supplementary Material. The results indicate that the posterior portions of the middle temporal gyrus (pMTG) in both hemispheres are primary loci involved with the retrieval of knowledge (‘recognition’) associated with a variety of environmental sounds. Some of these fMRI data can be viewed at http://brainmap.wustl.edu/vanessen.html, which contains a database of surface-related data from various brain mapping studies. Portions of these data have been reported previously (Lewis et al., 2001).
Materials and Methods
Subjects and Task
Twenty-four right-handed adults (aged 21–47 years, 13 women) with no history of neurological, psychiatric or auditory symptoms participated in the imaging study. Informed consent was obtained following guidelines approved by the Medical College of Wisconsin Human Research Review Committee.
Environmental (non-verbal) sound samples were compiled from a CD collection (General 6000 series, Sound Ideas) and from various websites. Samples (44.1 kHz, 16-bit, monophonic) were trimmed to ∼2 s duration (1.1–2.5 s range) and temporally reversed for the backward presentations (Cool Edit Pro, Syntrillium Software Co.). The temporally reversed sounds were chosen as a baseline control as they were typically judged to be unrecognizable, yet matched the physical features of the original sounds in five important aspects, including overall intensity, duration, spectral shape or content, spectral variation or motion (Thivard et al., 2000), and acoustic complexity.
Six subjects, not included in the fMRI experiments, screened numerous pairs of backward- and then forward-played sound samples, retaining 105 sounds that could generally be verbally identified when played forward but typically not identified when played backward (see Appendix for complete list, and in the online Supplementary Material to hear some of these stimuli). Across seven separate fMRI scans, subjects (with eyes closed) were presented with 350 sound trials (105 forward, 105 backward, 140 silent) in a pseudo-random order: since recognition of a backward sound would be facilitated by previous experience with the corresponding forward sound, a given backward sound always preceded its forward presentation by at least two non-silent trials.
During each fMRI scan, subjects indicated by right hand button press whether they (1) could recognize or identify the sound (i.e. verbalize, describe, visualize or have a general sense of familiarity about the likely sound-source); (2) were uncertain; or (3) could not recognize the sound. Here we define ‘recognition’ as a sense of familiarity or implicit knowledge about a sound, whereas ‘identification’ additionally involves a verbal or semantic labeling of the sound-source. Button responses (and reaction times relative to sound onset) were collected during scanning both to help engage the listener to the sounds and to subsequently model the resulting fMRI data relative to each individual’s judgment as to whether each sound was recognizable or not, which varied across subjects. Twelve subjects used their right index finger to button press for recognizable sounds, middle for uncertain, and ring finger for unrecognizable sounds. For the other 12 subjects, the fingering order was reversed in order to control for possible differences in response output.
Imaging and Data Analysis
Scanning was conducted at 1.5 Tesla on a General Electric (GE Medical Systems, Milwaukee, WI) Signa scanner, equipped with a commercial head coil (Medical Advances Inc., Milwaukee, WI) suited for whole-head, echo-planar imaging of blood-oxygenation level dependent (BOLD) signals (Bandettini et al., 1993; Ogawa et al., 1993). Subjects wore earplugs and were presented with binaural sound stimuli, which could easily be heard via custom electrostatic headphones (Koss Inc., Milwaukee, WI). To compensate for frequency specific attenuations by the earplugs, subjects listened to several cycles of six sine wave tones (128, 256, 1024, 2048, 4096 and 8192 Hz) just prior to scanning. The sound intensity for each tone was adjusted using a nine-band equalizer (CFX-12, Mackie Co.) until all tones were perceived to be at roughly the same loudness (typically 70–85 dB, L-weighted through ear plugs).
We used a ‘silent’ clustered-acquisition fMRI design that allowed stimulus events to be presented during scanner silence. The scanning cycle, schematized in Figure 1, was repeated every 10 s, and consisted of presentation of a sound or silent event (∼2 s), followed by silence during which time the subjects responded. The collection of BOLD signals (brain images, 1.8 s slice package) started 7.5 s after onset of each sound stimulus (or silence), and was the only time that the scanner made noise. In each scanning run, we acquired 52 gradient-recalled image volumes (TE = 40 ms, TR = 10 s), which included 16 axial slices of 6 mm thickness, with in-plane voxel dimensions of 3.75 × 3.75 mm. For most subjects, this volume covered nearly the entire brain, originating at the temporal pole (Talairach coordinate z ≈ –41) (Talairach and Tournoux, 1988), and extending up to or within ∼1 cm of the dorsal-most portions of the brain (range z = +55 to +65). T1-weighted anatomical MR images were collected using a spoiled GRASS pulse sequence (1.1 mm slices, with 0.9375 × 0.9375 mm in-plane resolution).
Data were viewed and analyzed using the AFNI software package (Cox, 1996) and related plug-in software (available at http://afni.nimh.nih.gov/afni/index.shtml). For each subject, the seven scans were concatenated into one time series, with the exception that for two subjects we retained six of seven scans and for one subject we retained five of seven scans (due to technical difficulties or excessive head motion during data acquisition). The first acquired brain volume in each scan (always a response to silence) was discarded, and the remaining 51 brain volume images were motion corrected by re-registering them to the 20th brain volume of the last scan (closest to the time of anatomical image acquisition). This 3D motion registration accounted for global head translations (x, y, z) and rotations (yaw, pitch, roll) using a least-squares fit algorithm (AFNI plug-in). We then performed multiple linear regression analyses based on the button responses modeling the sound stimuli relative to silent events. With the clustered acquisition design, the whole-brain BOLD response to each sound stimulus could be treated as an independent event. Consequently, we could disregard or censor particular events from the regression model in accordance with each subject’s button responses (e.g. including only those sound stimulus pairs for which the forward-played version was judged as recognizable and the corresponding backward-played version was not recognizable).
For the group results, individual anatomical and functional brain maps were transformed, using AFNI, into standardized Talairach coordinate space (Talairach and Tournoux, 1988). Functional data (multiple regression coefficients) were spatially low-pass filtered (4 mm rms Gaussian filter), then merged by combining coefficient values for each interpolated voxel across all subjects. For the main data in Figure 2, the combination of individual voxel probability threshold (t-test, P < 0.001) and the cluster size threshold (3 voxel minimum) yielded the equivalent of a whole-brain corrected significance level of α < 0.05. A split-half (or ‘cross-validation’) correlation test (Binder et al., 1997) was used to estimate how well the environmental sound recognition pattern of activation would generalize to other subject samples. Using the same threshold setting as in the full data set, a voxel-by-voxel correlation between two subgroups (roughly matched for age, gender and button response fingering order) was 0.60, yielding a Spearman–Brown estimated reliability coefficient of ρXY = 0.75 for the entire sample of 24 subjects. This indicates the level of correlation that would be expected between the activation pattern of our sample of 24 subjects and activation patterns from other random samples of 24 subjects matched for age and gender. Public domain software packages SureFit and Caret (http://brainmap.wustl.edu:8081/sums) were used to project data onto the Colin Brain atlas model in Talairach coordinate space, and were used to display the data (Van Essen et al., 2001; Van Essen, 2004).
For each subject, we initially analyzed only those sound stimuli that were judged as both recognizable when played forwards and unrecognizable when played backwards (on average 56 of 105 sound pairs were retained, see Appendix numbers in parentheses). Our first analysis, illustrated in Figure 2a, examined the group-averaged (n = 24) pattern of activation (yellow hues) evoked by the recognizable, forward-played sound stimuli relative to silence. Similarly, Figure 2b shows the pattern for the corresponding backward-played, unrecognizable sound stimuli relative to silence. As expected, in both comparisons the strongest and most extensive activation (75% or greater of maximum intensity MR response) included primary auditory cortex (PAC) and the immediately surrounding cortex on the superior temporal plane (collectively termed PAC+) (Engelien et al., 1995; Giraud and Price, 2001; Maeder et al., 2001; Wessinger et al., 2001), located bilaterally within the lateral sulcus (LaS) and superior temporal gyrus (STG). Moderate activation (∼25–75% of maximum intensity) was present bilaterally in inferior frontal cortex, and in anterior cingulate cortex (not visible in lateral views). Additionally, there was activity within a large swath of cortex including the left central sulcus (CeS) and left post-central sulcus (PoCeS), which was most likely related to planning and/or production of the right hand button presses (Burton et al., 1999; Burton, 2002). Finally, in both conditions, a few regions (dark blue) showed either a depression below baseline in response to sound presentation or relatively greater activation during the silent periods. This included bilateral portions of extrastriate visual cortex, the right precentral cortex, and left orbital frontal cortex (not visible in lateral views).
To reveal brain regions preferentially involved in the process of recognizing environmental sounds, we effectively subtracted (via multiple linear regression analysis) the activation pattern for the unrecognized, backward-played sounds (Fig. 2b) from that for the recognized, forward-played (Fig. 2a) sounds (prior to thresholding). The resulting group-averaged pattern of cortical activity is illustrated on a three-dimensional model in Figure 2c, on the corresponding flat map representation in Figure 2d, and on select axial slices from the brain of one subject in Figure 2e. The Talairach coordinates of specific cortical foci are indicated in Table 1. Since the forward- and backward-played sounds were matched on many physical attributes (refer to Materials and Methods), they comparably activated auditory input stages, including the PAC+ and the STG bilaterally. A predominant difference between the forward and backward sounds in this analysis was that only the forward-played sounds were judged as recognizable. Thus, the activation foci revealed after the subtraction (in Fig. 2c–e, yellow) should largely reflect high-level processing associated with the recognition (and/or identification) of environmental sounds.
The main novel finding of this study was that environmental sound recognition evoked activity bilaterally in and surrounding posterior portions of the middle temporal gyri (pMTG) and superior temporal sulci (pSTS), including a single robust but isolated site in the right hemisphere, and a larger, more ventrally directed focus in the left hemisphere. Several of the other cortical foci revealed by this contrast were strongly left lateralized (inferior frontal cortex, anterior fusiform gyrus, angular gyrus, and to some extent the posterior cingulate cortex), and in locations consistent with their involvement in retrieving and selecting semantic and verbal information (Binder et al., 1999; Price, 2000; Martin and Chao, 2001; Binder et al., 2003). These activation sites are specifically addressed further in the Discussion.
Regardless of the fingering scheme used for button responses, the group-averaged reaction time to the sounds judged as recognizable was 2.6 s (after sound onset), compared to 3.0 s for the corresponding backward-played, unrecognizable sounds. The significantly longer reaction times to the backward sounds [F(1, 22) =16.1, P < 0.0006, cross-nested three-factor ANOVA] may imply that there were greater processing demands required to judge a backward-played sound as unrecognizable. Consequently, this would suggest that those regions preferentially engaged by the recognizable environmental sounds (yellow in Fig. 2c–e) were not simply modulated by overall greater task difficulty or task demands (Barch et al., 1997).
Regional Analyses Involving ‘Miss-trials’
The backward-played sounds were chosen as a baseline condition primarily to address the issue of ‘recognition’ while controlling for numerous physical features of sound. Although the two sound datasets were ideally matched for overall duration, intensity, spectral variation, spectral content and acoustic complexity, they did differ in that the backward sounds were distorted in the temporal domain, thereby altering the temporal phase and onset of events. Thus, the activation in Figure 2 could conceivably reflect differences in the temporal acoustic properties of the forward- versus backward-played sounds, rather than differences in recognition. To investigate this issue, we focused specifically on the process of recognition itself by comparing the ‘miss-trials’ (i.e. the forward-played sounds judged as unrecognizable).
Figure 3 illustrates a comparison of BOLD signal differences within seven different regions of interest (ROIs). These ROIs were derived from the data shown in Figure 2c–e (yellow), and column 1 depicts the relative MR signals for the Recognized Forward (RF) versus Unrecognized Backward (UB) sounds, all being significantly positive in sign. Column 2 shows a similar Recognized versus Unrecognized comparison, but only includes sounds that were Forward-played (RF versus UF). If any of the ROIs were not sensitive to recognition per se, then the sign of the differential MR activity in column 2 should ideally drop to zero or be negative. However, in this comparison all of the cortical regions of interest were significantly positive in sign, suggesting that responses in these ROIs did indeed reflect recognition rather than stimulus differences. This was further supported by another analysis whose results are shown in column 3. This analysis directly compared responses among only the Unrecognized sounds: Forward versus Backward (UF versus UB). For such trials, no judgment of recognition was reported, so any potential differences should be due to differences in the stimulus properties of forward versus backward sounds. In this case, five of the seven ROIs showed no significant difference from zero (t-test, P < 0.05). In the remaining two ROIs (posterior cingulate and left angular gyrus) the small differences were negative in sign and are likely to reflect task-unrelated factors (Binder et al., 1999, 2003). Together, these data indicate that the acoustical differences between forward and backward sounds did not account for our main results. [We also compared the ‘false positive’ trials, examining the Backward-played sounds judged as Recognizable (RB). The results from a full complement of miss-trial cross-comparisons did qualitatively support the findings illustrated in Figure 3, in that the ROIs were primarily involved in the process of ‘perceived recognition’. However, establishing on what grounds a subject ‘recognized’ a backward-played sound was more difficult to interpret than with ‘not recognizing’ a forward-played sound. Thus, we opted to focus on the Unrecognized Forward (UF) type of miss-trial.]
We compared the whole-brain activation patterns evoked by the men relative to the women for the main paradigm. Overall, both of these group-averaged data sets showed activation patterns comparable to that shown in Figure 2, with the notable exception of a focus in the right inferior frontal cortex (IFC) indicated by dotted yellow outline in Figure 2d. In response to recognizable sounds, this right IFC focus showed a significant increased BOLD signal in males, and a decreased signal in females (peak activation at Talairach x = 44, y = 18, z = 11; α < 0.05 in a two-sample t-test for means). However, this region was no longer statistically significant after merging data across genders.
We used fMRI to measure brain responses to a wide range of recognizable environmental (non-verbal) sounds in contrast to comparably complex, yet unrecognizable, backward versions of the same sounds. The results revealed a network of brain regions that were preferentially involved in the process of recognizing environmental sounds. In contrast to earlier studies that examined environmental sounds, the present study revealed strong bilateral activation in posterior portions of the middle temporal gyri (pMTG, left > right). Moreover, the pMTG foci, in addition to the other regions of this network, were shown to represent a hierarchical stage beyond the early processing of the physical features of spectrally and temporally complex sounds, since earlier sound processing stages (e.g. primary and nearby surrounding auditory cortex, and the STG), were comparably activated by both the recognizable sounds and unrecognizable control sounds.
To interpret the environmental sound recognition data further, Figure 4 shows a direct comparison of the present findings with three other datasets illustrating cortical processing networks described in previous publications from our institution (colored cortex), all superimposed onto one brain model. These studies used the same scanning equipment and similar processing techniques, thereby providing a relatively accurate and direct comparison. Together, these data support distinctions between cortical regions engaged in (i) input stages of acoustic processing (blue hues); (ii) phonetic and semantic aspects of spoken language (red); and (iii) recognizing non-verbal, environmental sounds (yellow, from Fig. 2c,d). A system involved in visual motion processing is also illustrated (green), which may share some later stage processing mechanisms in common with environmental sound recognition. Furthermore, we have included the Talairach coordinate foci reported in several earlier studies germane to present findings (Fig. 4a symbols). Together, these data are discussed in the context of (i) current sound processing models; (ii) semantic knowledge; (iii) visual object recognition and multimodal pathways; and (iv) human lesion studies.
A Cortical Model for Environmental Sound and Spoken Language Processing
Several of the cortical regions activated by recognizable environmental sounds in the present study are largely consistent with, and appear to parallel, current neurological and cognitive models for how spoken language is processed. This includes input, intermediate and output processing stages (Grabowski and Damasio, 2000; Price, 2000; Binder and Price, 2001), which largely apply to words depicting a wide range of categories (Démonet et al., 1992; Vandenberghe et al., 1996; Price et al., 1997; Bookheimer et al., 1998; Lebrun et al., 1998; Mummery et al., 1998; Binder et al., 1999; Pulvermuller, 2001; Roskies et al., 2001; Grossman et al., 2002; Noppeney and Price, 2002).
The blue hues in Figure 4 illustrate some of the input stages of acoustical processing reported in a study by Binder et al. (2000). Light blue shows cortex more responsive to tones than white noise, and dark blue represents cortex more responsive to passively heard words than tones. These data (n = 28) show a dorsal to ventral progression along the bilateral STG related to the processing of increasingly complex acoustic structure. Much of this bilateral cortex was also activated by both the recognized and unrecognized environmental sounds of the present study (compare Fig. 2a,b with Fig. 4a blue).
Previous imaging studies involving environmental sounds, which were contrasted with a variety of different baseline conditions, have also reported activity along what might be considered input cortical sites in or near the STG. For instance, Engelien et al. (1995) compared passive listening to environmental sounds to a silent rest condition and similarly revealed strong bilateral PAC+ and STG activity. Giraud and Price (2001) found that middle portions of the left and right STG were more responsive to environmental sounds produced by living and nonliving things (including speech sounds) relative to white noise. Similarly, Maeder et al. (2001) observed middle portions of the left and right STG regions, among other sites, that responded more to environmental sound recognition than to localization of white noise stimuli. Consistent with these earlier studies, the present study showed that large extents of the bilateral PAC+ and STG were strongly activated in response to hearing and attending to a wide range of recognizable environmental sounds. Our data suggest that these various STG foci were activated in the previously mentioned studies because of differences in the acoustic complexity of the sounds (or differences in the degree of attention paid to the stimuli) relative to the control sounds (e.g. white noise). However, these input cortical STG sites appear to be relatively insensitive to the recognition of the sounds we presented, as they were comparably activated by the corresponding unrecognizable backward control sounds.
Intermediate stages of speech processing include lexical-semantic and other associative processes. Red regions in Figure 4, from a study by Binder et al. (1997), depict cortex at intermediate and output processing stages, being more responsive to the comprehension of spoken words (recalling knowledge pertaining to animal names) than to processing simple tone patterns (n = 30). Many of the brain regions involved in recognizing the environmental sounds of the present study, outside the bilateral pMTG and left SMG, showed a large degree of overlap with those involved in comprehending spoken words (Fig. 4, orange). This also included portions of subcortical structures such as the medial thalamus and caudate nuclei (not shown). Common to both studies, subjects were required to recognize the sounds and, to varying degrees, their meaning, thereby placing demands on semantic knowledge retrieval. Additionally, subjects in the present study indicated that they would ‘internally’ name (subvocalize in their head) many of the environmental sounds, perhaps giving them a better sense that they had accurately recognized the sounds. Together, these common task demands are likely to account for much of the overlap in terms of activation of lexical retrieval and phonological planning or short-term verbal memory processes (Price et al., 1994; Paus et al., 1996; Hickok et al., 2000; Wise et al., 2001; Binder, 2002).
Curiously, we did not observe significant differential activity for recognized environmental sounds along the middle portions of the MTG (mMTG) in either hemisphere (near or overlapping the progression of blue to purple to red cortex in Fig. 4a). Previous auditory studies of both human and non-human primates suggest that the cortex in this vicinity constitutes a major part of the ventrally directed ‘what’ stream (Binder et al., 2000; Rauschecker and Tian, 2000; Maeder et al., 2001). One possibility for this lack of differential activation was that subjects were not required to explicitly ‘identify’ each sound, such that there were insufficient task demands to modulate these regions. Alternatively, the lack of mMTG differential activity may reflect the stimulus properties or the category of sounds we ultimately retained in our analysis, which were biased by sound-sources that depicted manipulated objects and objects that typically have strong visual motion associations (see Semantic knowledge section below). Preliminary data pertaining to the processing of tool versus animal sounds support this latter hypothesis (Lewis and DeYoe, 2003), indicating that animal vocalizations (which were only sparsely represented in the present study) do preferentially activate the bilateral mMTG foci. Thus, animal sounds and speech sounds may be more effective stimuli for evoking the ventral temporal processing stream.
The present data appear to support the placement of the pMTG foci at intermediate, as opposed to output, processing stages. This is based in part on the location of the left pMTG, being situated between other intermediate regions for spoken language processing (Grabowski and Damasio, 2000; Price, 2000; Binder and Price, 2001). Additionally, the pMTG foci overlap cortex previously implicated in other aspects of semantic processing and in complex visual motion processing, both supporting an associative role with visual information, which is discussed in greater detail in the sections below.
In contrast to earlier studies involving environmental sounds, the bilateral activation of the pMTG foci appears to be much more pronounced in, if not unique to, the present study. Engelien et al. (1995) observed greater activity in middle portions of the left MTG region when subjects categorized, as opposed to passively listened to, a variety of environmental sounds. However, their left MTG focus was located well anterior to the left pMTG focus of the present study. Their focus may relate more to the semantic processes of explicitly categorizing the sounds, which was not required of the subjects in the present study. The sound recognition study by Maeder et al. (2001), wherein subjects attended to animal cries amidst complex auditory scenes, revealed foci in the left and right angular gyri that appear to have included extensions of weaker activation into the vicinity of the left and right pMTG. The wider range of environmental sounds and actions that were specifically ‘attended to’ in the present study (e.g. manipulated tools, fluid movement and dropped objects) may explain this greater degree and extent of activation in the bilateral pMTG regions.
The left supramarginal gyrus (SMG) together with neighboring parietal cortex was also prominently activated in the environmental sound recognition paradigm (Fig. 4a), and may also represent an intermediate processing stage. Ventral portions of the left SMG have been implicated in tasks that require maintenance and manipulation of phonological information (Paulesu et al., 1993; Mummery et al., 1998; Binder and Price, 2001). However, in the present study the left SMG activation was situated more dorsally than in the above studies. Rather, this focus was contiguous with the large swath of activation along the left pre- and post-central cortex (cf. Fig. 2a versus 2c), which is perhaps more closely associated with the production of button responses (Burton et al., 1999; Burton, 2002). Alternatively, preliminary data suggest that the left SMG activation (Brodmann area 44, or a possible homologue to monkey area 7b) may be related to audio-tactile or audio-motor associations (Lewis and DeYoe, 2003), being evoked by sounds produced by objects that are typically manipulated with the dominant (right) hand. Thus, it remains unclear whether the dorsal SMG activity of the present study was related to (i) covert phonological processing; (ii) slight differences in tactile attention, planning, or production of right hand button presses; or (iii) audio-tactile and audio-motor associations with some of the recognizable environmental sounds.
The large expanse of activity in the left IFC evoked by recognizable environmental sounds (including the pars opercularis and triangularis of the inferior frontal gyrus) was consistent with representing output stages of processing. This stage involves phonological access and articulatory planning (whether vocalizations are subsequently produced or not), which can engage the left frontal operculum, left anterior insula and parietal operculum (Thompson-Schill et al., 1997; Grabowski and Damasio, 2000; Price, 2000; Binder and Price, 2001). Portions of the left IFC were also preferentially activated in earlier environmental sound studies that explicitly or implicitly involved sound recognition. This includes studies where subjects would actively categorize as opposed to passively listen to environmental sounds (Engelien et al., 1995), or attend to environmental sounds in contrast to white noise (Giraud and Price, 2001; Maeder et al., 2001). Additionally, portions of the left IFC have also been activated with successful recognition of visual objects (Bar et al., 2001). This process of identifying or naming environmental sounds or visual objects was potentially common to all the above recognition studies, and is consistent with the language processing ‘output’ role for the left IFC.
The right IFC may also have a role in sound recognition. In the present study, a small activation focus was present in the ventral-most portion of the right IFC (Fig. 2d; along the orbital sulcus), though its function remains unclear. In a study by Humphries et al. (2001), a larger expanse of the right IFC (and portions of the left dorsal prefrontal cortex) showed a greater response to sequences of environmental sound stimuli (e.g. a gunshot and then the sound of footsteps quickly fading into the distance) in contrast to hearing spoken sentences that described the same event. Their relatively strong activation in the right IFC may be explained by the nature of the ‘nonverbal versus verbal’ contrast, which is qualitatively different from the ‘recognized versus unrecognized’ contrast we performed. Interestingly, a separate analysis by gender of the present data did reveal significant activity along a larger extent of dorsal portions of the right IFC region, but only for males (Fig. 2d, yellow dotted outline). This finding appears to be at odds with earlier gender studies that suggest more bilateral processing in females than males (Shaywitz et al., 1995; Jaeger et al., 1998). Although issues of laterality and gender remain to be resolved, the present data do support a role for the left, and possibly right, IFC regions in environmental sound recognition and/or identification.
The posterior cingulate was activated in both the environmental sound recognition and spoken word semantics paradigm (Fig. 4b,c, orange). This region has been proposed to function in the retrieval of information from long-term memory (Valenstein et al., 1987; Vogt et al., 1992; Binder et al., 1999), which may be part of a mechanism for judging whether or not a sound is recognizable or ‘familiar’. Others have proposed a role for the posterior cingulate cortex in the spatial distribution of attention (Shulman et al., 1997; Raichle et al., 2001; Corbetta et al., 2002) and processing of emotional state (Maddock, 1999). Presently, the actual role(s) and placement of the posterior cingulate within a cognitive model of sound recognition remains unclear.
Portions of the left pMTG focus of the present study overlapped cortex implicated in storing knowledge associated with processing and identifying different object categories, notably including tools or artifacts. Spoken words, in addition to written words, photographs, drawings, and videos depicting different object categories have been used extensively to address issues pertaining to categorical knowledge and whether or not they are preferentially processed along different cortical pathways (Warrington and Shallice, 1984; Hillis and Caramazza, 1991; Perani et al., 1995; Damasio et al., 1996; Martin et al., 1996; Tranel et al., 1997; Mummery et al., 1998; Spitzer et al., 1998; Chao et al., 1999; Moore and Price, 1999; Perani et al., 1999; Martin, 2001; Martin and Chao, 2001; Beauchamp et al., 2002; Devlin et al., 2002; Grossman et al., 2002). Several of these studies suggest that depictions of tools and artifacts preferentially activate a network including cortex near the left pMTG, left premotor (inferior precentral), and fusiform gyrus (for review, see Martin, 2001). The open triangles in Figure 4a show the Talairach coordinates for several of the reported foci (projected laterally to the outer surface for visibility) implicated in tool-related knowledge (Martin et al., 1996; Mummery et al., 1998; Chao et al., 1999; Moore and Price, 1999; Perani et al., 1999; Beauchamp et al., 2002; Grèzes and Decety, 2002; Grossman et al., 2002). Most of the environmental sound stimuli retained in our analysis depicted objects and events that were associated with or produced by implements and tools. Though our data did not fully address whether different categories of sound-sources (e.g. tools versus animals) were processed by different networks or sub-networks, they do demonstrate that at least portions of the ‘tool-related network’ defined in visual- and word-related studies can also be activated by recognizable environmental sounds.
Portions of the left pMTG have also been implicated in the retrieval of ‘action’ knowledge. In a study by Phillips et al. (2002) subjects were instructed to indicate if a visually presented object (such as a tool) could, for instance, be manipulated by a twisting motion. This was in contrast to making a judgment as to the relative size of the same stimulus (‘perceptual’ knowledge). They reported activity in the left pMTG region (the ‘X’ in Fig. 4a) that was specific to action knowledge retrieval. Thus, activity in the left pMTG may be not be related so much to the category or categories of objects being depicted per se, but rather to the type of semantic knowledge (the task) that is being engaged. This finding is consistent with the present results, in that most of the environmental sounds that we presented were associated with complex movements or manipulations (ostensibly including tools and implements in action). This may account for why the left pMTG did not show much overlap with the spoken word paradigm in Figure 4 (red). The process of recognizing the wide variety of environmental sounds, in contrast to the spoken animal names, may have been placing greater demands on knowledge pertaining to how the sound was likely to have been produced (e.g. the visual or motor actions associated with the sound production).
Relation to the Visual System
In contrast to many of the semantic and language processing studies mentioned above, the present study revealed significant activity evoked by environmental sound recognition in both the left and right pMTG regions. The bilateral representation of these activation foci together with their functional characteristics show some parallels to the visual recognition system. Additionally, their physical location, situated between auditory and visual cortex proper, was highly suggestive of a possible role in audio-visual or multimodal processing.
Parallels between Sound and Visual Recognition Pathways
The sound recognition network revealed in the present study appears to parallel an object recognition or ‘what is it?’ pathway in the visual system. In humans, visual object processing is thought to follow a hierarchical progression bilaterally, from low level cues in early visual areas (i.e. V1, V2, hMT, etc.), to general object shape processing in the lateral occipital complex (near the LOS) and occipito-temporal sulcus (OTS) region, to more category-specific processing in the ventral temporal cortex, such as for faces or common objects (Malach et al., 1995; Bar et al., 2001; Grill-Spector et al., 2001; Haxby et al., 2001), and possibly to other structures along the temporal lobe (Murray and Richmond, 2001). In a similar manner, and largely consistent with earlier auditory studies (Engelien et al., 1995; Clarke et al., 2000; Rauschecker and Tian, 2000; Giraud and Price, 2001; Maeder et al., 2001), a pathway for environmental sound recognition appears to follow a progression starting from low level input cues in early auditory cortex bilaterally (PAC, PAC+ and STG). This presumably leads to the higher-level sound recognition processing in the bilateral pMTG plus a variety of mostly left lateralized structures implicated in semantic and/or linguistic processing. However, the actual processing hierarchy of the pMTG foci relative to these other semantic-related structures remains to be established.
The pMTG foci and OTS/LO complex may be at comparable processing levels in the auditory and visual systems, respectively, both being involved in the process of recognition. Analogous to the pMTG foci in sound recognition, the LO complex (e.g. Fig. 4a, asterisks), is preferentially activated by a wide range of identifiable visual objects and palpated objects as opposed to unidentifiable scrambled objects or palpated textures (Malach et al., 1995; Amedi et al., 2001; Grill-Spector et al., 2001). Additionally, portions of the OTS are known to be modulated more by visual recognition success than by simple stimulus parameters (Bar et al., 2001). Based on their close cortical proximity, the pMTG and OTS/LO complex in both hemispheres may be among the first cortical sites where the auditory and visual (and possibly tactile) recognition pathways can interact, and is addressed below. However, specifying the true extent of overlap and verifying multimodal response properties awaits further study.
The pMTG Foci and Multimodal Processing
The location and activation characteristics of the bilateral pMTG foci are consistent with their involvement in processing audio-visual or multimodal (or possibly supramodal) motion information for purposes of sound-source (‘object’) recognition. For instance, upon hearing and seeing an audio-visual event (such as a ping pong ball bouncing to rest) the temporal dynamics of the sound and motion attributes of the sight covary in time between the two sensory pathways. Based on non-human primate studies, this sensory information would, at least initially, be processed in the respective primary sensory cortices in both hemispheres, and then propagate along higher-level modality-specific regions (Van Essen et al., 1990; Rauschecker, 1998; Kaas et al., 1999). In the macaque monkey, the information from both modalities may then converge in ‘association’ cortices, such as cortex near the left and right posterior STS (Leinonen et al., 1980; Bruce et al., 1981). Homologous cortical regions in humans, possibly including the bilateral pMTG foci, may be involved in processing such stimulus-driven (‘bottom up’) covariance, and these cortical representations might a priori be expected to be present in both hemispheres.
Further illustrating a possible audio-visual link, the green regions in Figure 4 show cortex involved in visual motion discrimination during a task in which subjects assessed the speed of coherently moving dot arrays in contrast to randomly moving dots (n = 7) (Lewis et al., 2000). This paradigm activated the hMT+ (or V5+) visual motion complex, which in both hemispheres showed only a small degree of overlap with the pMTG sound recognition foci (Fig. 4c, chartreuse). However, cortical regions activated by more complex visual motion stimuli are known to be located anterior to the hMT+ complex, and thus partially overlap the pMTG foci in both hemispheres. For instance, the open circles in Figure 4a depict the Talairach coordinates reported in several studies for cortex involved in processing biological motion. This included the viewing of point-light displays defining movements of the human body or hand (Bonda et al., 1996; Grossman et al., 2000; Grèzes et al., 2001; Grossman and Blake, 2002), and video clips showing eye, mouth, or whole body movements (Puce et al., 1998; Beauchamp et al., 2002). Furthermore, portions of the left and right pMTG foci overlapped cortex activated by visual lip-reading (Fig. 4a, open squares) that was greatly enhanced when the corresponding speech sounds were present (Calvert et al., 2000), suggesting that at least portions of the bilateral pMTG are involved in audio-visual integration.
With regard to multimodal (or supramodal) integration, normally sighted and hearing people typically become familiar with environmental sounds while simultaneously viewing and/or manipulating the sound-source (object) itself, such as pounding with a hammer or removing the cork from a bottle of wine. When subjects were asked to identify some of the forward-played sounds (replayed to them after the scan session), in some instances they would physically gesture and/or indicate that they could visualize how the sound was produced before they could provide an accurate verbal label (akin to the tip-of-the-tongue phenomenon) (Brown, 2000). One possibility is that the pMTG foci are involved in learning or mediating the stimulus-driven multimodal correlations between the sight and sound (and possibly touch and motor actions) associated with object movements, consistent with a ‘sensory-functional’ hypothesis for how feature information is encoded (Ettlinger and Wilson, 1990; Grossman et al., 2002). Consequently, upon hearing environmental sounds in isolation the pMTG foci may be recruited, evoking a sense of recognition of the sound-source and, or by way of, its multimodal associations.
A related possibility is that the activity in the bilateral pMTG reflects processing pertaining more to multimodal associations or visual imagery subsequent to the ‘initial recognition’ of the sound. A comparison of the incoming sounds with stored representations may be taking place at earlier processing stages, such as in the bilateral STG regions. Any lack of differential BOLD activation (i.e. in response to the recognized versus unrecognized sounds) within such early recognition stages could have been due to processes beyond the resolution of our fMRI paradigm. For instance, differential processing may be taking place at a local neural circuit level (e.g. interspersed subpopulations of neurons responding differentially to sounds). Nonetheless, our results do suggest that the bilateral pMTG foci are involved in associating an auditory percept with stored knowledge pertaining to the likely sound-source. Thus, by this broader definition, the pMTG foci should be considered as having a role in the process of sound recognition.
Human Lesions and Sound Recognition
Lesions to the right, left, or bilateral temporal or temporal-parietal cortex can lead to severe impairments in the recognition of non-verbal, environmental sounds, while largely sparing verbal comprehension (Spreen et al., 1965; Albert et al., 1972; Vignolo, 1982; Fujii et al., 1990; Schnider et al., 1994; Engelien et al., 1995; Clarke et al., 2000; Saygin and Moineau, 2002). The presence of distinct bilateral pMTG foci involved in environmental sound recognition may, in part, provide a specific anatomical and functional explanation for the reported symptoms. Namely, that right hemisphere damage more commonly leads to a ‘pure’ auditory agnosia associated with discriminative or acoustic errors, while left hemisphere damage tends to produce more semantic-associative errors and is more likely to produce additional deficits in spoken language comprehension (Vignolo, 1982; Schnider et al., 1994). The right pMTG focus for sound recognition (Fig. 4, yellow) is sufficiently well isolated from most spoken language systems (red) that a lesion of widely varying size could disrupt environmental sound processing without also disrupting language functions. In the left hemisphere, however, only a focal lesion to portions of the pMTG cortex might selectively disrupt environmental sound processing without also disrupting the greater expanse of immediately surrounding language-related structures.
Environmental sounds listed in the order presented. Parentheses indicate the number of subjects for which that particular sound (forward plus backward presentation) was retained for analysis.
ice dropped into glass (14)
glass breaking 1 (19)
rotary phone dialing (16)
wood dropped on floor 1 (17)
wood dropped on floor 2 (20)
hammer hitting anvil (21)
creaking door 1 (8)
audience clapping (18)
bubbles in water (13)
hammer hitting nail 1 (20)
door opening (19)
boxing bell (12)
glass breaking 2 (21)
Polaroid camera (6)
ping pong rally (20)
card shuffle (14)
striking match (4)
air horn (17)
scissors cutting hair (3)
puppy barking (10)
cannon fire (14)
sonar ping (8)
keys jingle & toss (7)
machine gun (19)
drum accent (17)
pulling tape from dispenser (6)
golf putt into cup 1 (16)
tennis rally (19)
typing on computer (16)
cracking an egg (15)
liquid filling vessel (9)
bowling strike 1 (19)
chopping vegetables (14)
toasting with glass (22)
beating eggs (8)
car crash (0)
slot machine payout (8)
forest fire (4)
gasoline sloshing in can (19)
door open & close (23)
pouring beer (8)
billiards break shot (22)
revolver gun shot (19)
draining bathtub (16)
removing a cork (21)
manual typewriter 1 (10)
whisky pouring into glass (19)
hammering nail 2 (23)
creaking door 2 (9)
horse galloping (15)
applying handcuffs (5)
opening can of soda (8)
knocking on door (23)
locking door with key (11)
grandfather clock (6)
small gun fire (18)
ping pong rally 2 (23)
pool hall shot (21)
manual typewriter 2 (13)
glass breaking 3 (12)
ascending chimes (21)
bongo drums (22)
coins falling into drawer (11)
water dripping in tub (22)
chopping carrot (8)
flipping through magazine (4)
stapling paper (10)
ping pong bounce to rest (21)
push in coin slot (12)
manual typewriter 3 (13)
dribbling basketball (20)
water dripping in metal dish (21)
toilet flushing (5)
paper cutter cutting (7)
bird chirping (17)
covering garbage can (16)
bowling strike 2 (22)
woman coughing (7)
child laughter (12)
ricocheting bullet (17)
billiards shot into pocket (21)
dive into swimming pool (15)
golf putt into cup 2 (21)
typewriter carriage return (7)
filling bathtub (13)
bite & chew chips (10)
rooster call (3)
tennis serve & return (14)
shaking dice (1)
tiger growl (3)
basketball thru net & bounce (14)
drink and swallow (13)
metal file cabinet closing (14)
racquetball rally (21)
pig oinking (7)
unscrew & remove jar lid (12)
parking meter (6)
Supplementary material can be found at: http://www.cercor.oupjournals.org/.
We thank Doug Ward for assistance with paradigm design and statistical analyses, Jennifer Junion-Dienger for assistance with acquiring sound samples, Jon Wieser, Dr David Van Essen, Donna Hanlon, and John Harwell for assistance with cortical flattening and data presentation, and Wendy Huddleston for comments on the manuscript. This work was supported by grant R03 DC04642 to J.W.L., and grants EY10244 and MH51358 to E.A.D., and grant RR00058 to MCW.
|Anatomical location||Talairach coordinates||Volume (mm3)|
|pMTG (and pSTS)||50||–49||13||1892|
|Inferior frontal g. (orbital s.)||31||32||–3||533|
|pMTG (and pSTS)||–55||–52||5||9925|
|Inferior frontal cortex (IFC)|
|Inferior precentral sulcus||–41||8||28||5235|
|‘Inferior frontal gyrus, dorsal’||–48||32||13||6217|
|‘Inferior frontal gyrus, ventral’||–42||30||–2||4908|
|Supramarginal gyrus (SMG)||–57||–38||38||549|
|Anatomical location||Talairach coordinates||Volume (mm3)|
|pMTG (and pSTS)||50||–49||13||1892|
|Inferior frontal g. (orbital s.)||31||32||–3||533|
|pMTG (and pSTS)||–55||–52||5||9925|
|Inferior frontal cortex (IFC)|
|Inferior precentral sulcus||–41||8||28||5235|
|‘Inferior frontal gyrus, dorsal’||–48||32||13||6217|
|‘Inferior frontal gyrus, ventral’||–42||30||–2||4908|
|Supramarginal gyrus (SMG)||–57||–38||38||549|
The inferior frontal cortex was comprised of three major foci, evident at higher threshold settings (P < 0.00001, uncorrected). The reported volumes, however, are all based on the data shown in Figure 2 (P < 0.001, corrected to α < 0.05). IT, inferotemporal cortex; pMTG, posterior middle temporal gyrus; pSTS, posterior portions of superior temporal sulcus.