The human brain is thought to process auditory objects along a hierarchical temporal “what” stream that progressively abstracts object information from the low-level structure (e.g., loudness) as processing proceeds along the middle-to-anterior direction. Empirical demonstrations of abstract object encoding, independent of low-level structure, have relied on speech stimuli, and non-speech studies of object-category encoding (e.g., human vocalizations) often lack a systematic assessment of low-level information (e.g., vocalizations are highly harmonic). It is currently unknown whether abstract encoding constitutes a general functional principle that operates for auditory objects other than speech. We combined multivariate analyses of functional imaging data with an accurate analysis of the low-level acoustical information to examine the abstract encoding of non-speech categories. We observed abstract encoding of the living and human-action sound categories in the fine-grained spatial distribution of activity in the middle-to-posterior temporal cortex (e.g., planum temporale). Abstract encoding of auditory objects appears to extend to non-speech biological sounds and to operate in regions other than the anterior temporal lobe. Neural processes for the abstract encoding of auditory objects might have facilitated the emergence of speech categories in our ancestors.
Two questions are at the heart of theories concerning the cortical processing of naturally occurring auditory objects: 1) Which low-level features drive neural processing and 2) how do computations lead to abstract semantic categories robust to large variations in the low-level features? These questions continue to stir a lively debate within the domain of auditory neuroscience. For example, there is a lack of consensus concerning which low-level features are represented in the cortex (Schönwiesner and Zatorre 2009; Recanzone and Cohen 2010), and processing models based on the concept of spectrotemporal receptive fields have had only partial success in accounting for the cortical responses to naturalistic sounds (Machens et al. 2004; Bar-Yosef and Nelken 2007). Further, the empirical evidence for abstract cortical encoding independent of low-level structure is limited to a very specific class of stimuli, harmonic sounds such as human vocalizations or musical tones (Hasson et al. 2007; Leaver and Rauschecker 2010; Kilian-Hütten et al. 2011), whereas studies of highly diverse naturalistic sounds (e.g., animal vocalizations vs. human-action sounds such as sawing wood; Lewis et al. 2005) did not test for abstract encoding. As a consequence, it is still unknown whether the abstraction of cortical representations is a general functional principle that operates for all classes of naturalistic auditory objects. We addressed these questions by analyzing the extent to which the fine-grained spatial functional magnetic resonance imaging (fMRI) patterns measured for highly heterogeneous environmental sounds selectively encoded information about the low-level and category-membership features (Fig. 1).
There is a wide consensus on the hierarchical nature of the auditory cortex, which relies on modules that progressively abstract from the low-level structure to optimize the analysis of auditory objects (e.g., “what/where model,” Romanski et al. 1999). However, 1) There is currently a large disagreement on the location of such abstract sound-processing modules and 2) the majority of the empirical evidence on the existence of such modules has been collected by focusing on a restricted number of the auditory objects that our brain analyzes during our daily life, that is, human vocalizations and, more specifically, speech. Focusing on the location of abstract-processing modules, it has, for instance, been observed that the primary auditory cortex (A1) retains less information about the spectrotemporal structure but more information about abstract properties such as stimulus identity, than does inferior colliculus (Chechik et al. 2006). Consistently, a recent study of perceptually ambiguous speech argues for abstract object-like representations in the early auditory cortex (Kilian-Hütten et al. 2011). Abstract-processing aspects have also been attributed to the planum temporale (PT), an area hypothesized to be involved in matching stored spectrotemporal templates to the incoming sound information (Griffiths and Warren 2002) and to implement a process that abstracts from the fine-grained spectral shape of the incoming signal (Warren et al. 2005; Kumar et al. 2007). Other influential studies locate abstract modules in the anterior temporal cortex, part of a ventral pathway involved in the recognition of auditory objects (Romanski et al. 1999; Rauschecker and Tian 2000; Scott et al. 2000; Davis and Johnsrude 2003; Hasson et al. 2007; Rauschecker and Scott 2009; Goll et al. 2010; Leaver and Rauschecker 2010, for studies involving speech stimuli). Other empirical investigations observe instead abstract categorical processes in the posterior superior temporal sulcus (pSTS; e.g., Desai et al. 2008; Okada et al. 2010, for speech stimuli and Warren et al. 2006, for voices), or even argue for abstract representation in the entire sound-sensitive cortex, including areas traditionally assumed to be involved in low-level processing (Formisano et al. 2008; Staeren et al. 2009). An important aspect of previous studies further obscures our understanding of abstract sound encoding in the cortex: They have almost exclusively focused on a relatively homogeneous class of stimuli, that is, highly harmonic vocalizations (e.g., speech or animal vocalizations) and musical instrument tones (Leaver and Rauschecker 2010). This aspect of past studies not only reduces the general validity of current stances on abstract sound processing in the cortex (i.e., it is unknown whether the cortex represents abstractly many non-speech naturalistic auditory objects), but it is also associated with potential methodological problems. Indeed, given the high plasticity and context-dependence of auditory cortical computations (Ulanovsky et al. 2003; Jääskeläinen et al. 2007; Asari and Zador 2009), it is likely that the presence of a large number of speech-like stimuli within the experimental context triggers the activation of language-related abstract-processing modules. More importantly, reverse hierarchy theories of perceptual processing (Ahissar et al. 2009) predict a stronger emphasis on abstract representation for sets of sounds that are relatively similar in the low-level structure (e.g., speech and crying baby, both of which are highly harmonic) than for sets of sounds that are highly heterogeneous in the low-level structure (e.g., inharmonic crackling fire and crying baby). For these reasons, it is currently unknown whether abstract processing constitutes a general functional principle implicated in the cortical processing of all classes of naturalistic sounds. To address these issues, we used a condition-rich design with a stimulus set that is heterogeneous in both low-level and category structures. The stimulus set included distinctions between living and non-living, human and non-human and vocal and non-vocal sounds, very few human vocalizations, and no speech stimuli (Supplementary Table S1).
The cortical encoding of highly diverse categories of naturalistic non-speech auditory objects has been investigated by several fMRI and electroencephalography studies (Belin et al. 2000, 2002; Fecteau et al. 2004, 2005; Lewis et al. 2005, 2006, 2010; Pizzamiglio et al. 2005; Kraut et al. 2006; Murray et al. 2006; Altmann et al. 2007; Kaplan and Iacoboni 2007; Doehrmann et al. 2008; Galati et al. 2008; Engel et al. 2009; De Lucia et al. 2010; Leaver and Rauschecker 2010). Despite showing a cortical sensitivity for object categories, this literature is unable to prove the existence of abstract cortical encoding modules because it has not assessed the extent to which category sensitivities can be accounted for by systematic between-category differences in the low-level structure (e.g., a neural network that is selectively activated by vocalizations and not by tool-action sounds might simply be processing a reliable low-level difference between these categories, namely, harmonicity [HNR]). For this reason, it is largely unknown whether the previous category-sensitivity results were the product of abstract cortical encoding based on high-level features that optimize the categorization of sound stimuli even in the absence of systematic between-category differences in the low-level structure. A notable exception to this trend is the recent study by Leaver and Rauschecker (2010), which considered 6 different low-level alternative hypotheses for the cortical encoding of auditory-object categories. Importantly, however, this study investigated only highly harmonic sounds (human and animal vocalizations and musical instrument tones), and is, for this reason, characterized by the same general validity and methodological caveats that we noted for speech-based investigations of abstract cortical encoding. In the present study, we extended the approach of Leaver and Rauschecker by considering a larger number of low-level features describing both the long-term and the time-varying sound structure (e.g., Giordano and McAdams 2006; Giordano, Rocchesso, et al. 2010 for the psychophysical relevance of the temporal structure of naturalistic sounds and, e.g., Zatorre and Belin 2001; Poeppel 2003; Boemio et al. 2005; Schönwiesner et al. 2005; Zatorre and Gandour 2008, for their cortical processing). The characterization of the low-level features adopted in this study is, to date, the most extensive among those carried out in previous brain imaging studies of naturalistic auditory objects. Our category-encoding tests thus take into account a comparatively large number of low-level alternative hypotheses about the nature of the cortical sensitivity to categories (Fig. 4 for the cortical encoding of the low-level features considered in this study).
We measured the encoding of object-category and low-level features in the stimulus-specific fine-grained spatial fMRI patterns. For this purpose, we adopted the multivariate method of representational similarity analysis (RSA; Kriegeskorte, Mur, Bandettini 2008), previously applied only in the study of visual object processing (Kriegeskorte, Mur, Ruff, et al. 2008; Fig. 1 for our analysis pipeline). The RSA method assesses the encoding of stimulus-feature information in representational dissimilarity matrices (RDMs), measuring the dissimilarity of the spatial fMRI patterns for different stimuli. The RSA method combines the high statistical sensitivity of multivariate classification methods (Kriegeskorte et al. 2006; Kriegeskorte and Bandettini 2007; Staeren et al. 2009), with a greater flexibility in testing the encoding of feature structures whose representation cannot be easily verified with classical massively univariate analyses. For example, within a univariate framework, category encoding is assumed when the blood oxygenation level dependent (BOLD) response for the exemplars of 1 category differs significantly from that for a different reference category. Importantly, this “activation”-based method can measure between-category differences, but cannot detect within-category effects (e.g., the neural responses for different face pictures are more similar to each other than to the neural responses for pictures of cats; Haxby et al. 2001). The “information-based” RSA approach made it possible to assess a larger number of abstract category-membership effects: 1) Cortical representation of the distinction between categories (between-category differentiation); 2) cortical enhancement of the diversity of the stimuli within the same category (within-category differentiation); and 3) cortical suppression of the differences of stimuli within the same category (within-category compression; Figs 2 and 5).
During scanning, participants were presented with highly identifiable environmental sounds (Supplementary Table S1) and carried out a 1-back repetition-detection task. Analyses relied on the measurement of the association between RDMs and stimulus-feature dissimilarity matrices (SDMs; Figs 1 and 2, and see Material and Methods). Twelve low-level SDMs were derived from the time-varying: 1) Loudness; 2) spectral centroid, a measure of the perceived brightness of a sound; 3) pitch; and 4) harmonic-to-noise ratio (HNR) or, in short, harmonicity, a measure of the amount of periodicity. For each of the 4 time-varying features, we computed 3 SDMs by considering: 1) The median value across time; 2) the amount of temporal change (interquartile range dissimilarity, IQR); and 3) the overall pattern of temporal variation (cross-correlation dissimilarity). Twelve object-category SDMs were derived from the following distinctions: 1) Living versus non-living; 2) human versus non-human; 3) vocal versus non-vocal. One group of object-category SDMs assessed between-category differences in spatial fMRI patterns (Fig. 2A); another group assessed within-category effects such as a large similarity of the fMRI patterns for same-category stimuli (e.g., highly similar fMRI patterns for living sounds; Fig. 2B,C). Variance-decomposition methods made it possible to measure the selective encoding of each of the stimulus features independently of their covariation with non-target features (both low-level and object category). Thus tested, cortical selectivities for object-category features were taken as evidence of abstract cerebral encoding of auditory objects. Based on this methodology, we were able to measure the cortical encoding of several low-level features (Fig. 4) and, most importantly, we observed the abstract cortical encoding of sound-object categories (Fig. 5).
Materials and Methods
Sound stimuli were selected from those investigated by Giordano, McDonnell, et al. (2010). Following standard practices, sounds were equalized in root mean square (RMS) level. Note, however, that cortical activation does not appear to follow the physical intensity of a sound but rather its loudness (Langers et al. 2007), and that RMS equalization does not guarantee constant loudness because it does not take into account the changes in sensitivity across spectral frequencies (Moore 2003). Sounds were 3s in duration. Sounds from Giordano, McDonnell, et al. (2010) shorter than 3 s were replaced by an alternate excerpt generated by a similar event selected from a variety of commercial databases of sound effects (e.g., Sound Ideas 2004). We selected 32 stimuli: 16 living sounds, generated by the vibration of an object that is part of the body of a living being, such as hands in “clapping hands” and 16 non-living sounds; 16 human-action sounds, generated as a consequence of the motor activity of a human being (such as in “hammering nail”) and 16 non-human sounds; 8 vocal sounds, generated as the consequence of the vibrations in the larynx or syrinx (e.g., “croaking frogs”) or which included such vibrations (e.g., “panting man”) and 24 non-vocal sounds. The human–non-human categorical distinction was perfectly orthogonal to the living–non-living distinction. The vocal–non-vocal distinction was orthogonal to the human–non-human distinction within the category of living sounds (by definition, all vocal sounds are living sounds). Given these stimulus selection constraints, we randomly extracted 20 million stimulus sets from the available samples. Among these random selections, we chose for the final set sounds that: Maximized the average identification performance and had a minimum identification performance score of 50% correct, as measured by Giordano, McDonnell, et al. (2010); minimized the across-sound standard deviation [SD] of the peak of the time-varying level in dB SPL; did not include significant between-category differences (e.g., living versus non-living) in peak dB SPL and in identifiability (P ≥ 0.1). The measures of identification performance considered during the sound selection process were collected by Giordano, McDonnell, et al. (2010). Identification performance was also measured with the participants in this experiment, subsequent to the scanning session. Based on these measures, all sounds were very accurately identified (average correct = 94%; SD = 6%; minimum correct = 79%), and no differences in identification performance emerged between living and non-living, human and non-human, and vocal and non-vocal sounds (t ≤ 1.74; P ≥ 0.09). Supplementary Table S1 reports the properties of the selected stimuli.
Stimulus-Feature Dissimilarity Matrices
We computed 12 matrices quantifying the pairwise dissimilarity of the sound stimuli relative to various category attributes. The strategy followed to compute category-feature dissimilarity matrices is exemplified in Fig. 2. The first 6 category dissimilarities focused on each of the following categorical dimensions in turn: Living versus non-living; human versus non-human; vocal versus non-vocal. The living–non-living dissimilar matrix equaled 1 if 2 sounds did not belong to the same category (i.e., one sound living, the other non-living) and 0 if the 2 sounds belonged to the same category (i.e., both sounds living or both sounds non-living). The living-dissimilar matrix equaled 1 if the 2 sounds were living and 0 otherwise. A third non-living-dissimilar matrix, equal to 1 if the 2 sounds are non-living and 0 otherwise was not considered to avoid problems with the partial-correlation analyses (see below). Indeed, the sum of this third matrix with the living-nonliving dissimilar and living-dissimilar matrices is a constant. As such, the correlation of any of these 3 matrices with a fourth dependent variable would equal 0 after the other 2 matrices are partialed out of the correlation. We adopted the same approach to compute the following matrices: Human–non-human dissimilar; human dissimilar; vocal–non-vocal dissimilar; vocal dissimilar. The final 6 dissimilarities considered the intersection of the 3 main categorical distinctions. Because all vocal sounds are, by definition, living sounds, the intersection of the 3 main categorical distinctions defined 6 independent classes of sound stimuli (e.g., living–human–vocal sounds). The dissimilarity corresponding to each of these 6 intersection classes equaled 1 if both sounds were members of the same intersection class (e.g., both were living–human–vocal sounds) and 0 otherwise. It should be noted that among these matrices, only the living–non-living, human–non-human, and vocal–non-vocal matrices were capable of assessing the differentiation between 2 sound categories. All the other matrices could instead model effects specific to a single category, specifically either a comparatively higher differentiation or a comparatively higher similarity of the sounds within the category of interest.
We computed 12 matrices quantifying the dissimilarity of different low-level properties of the sound stimuli. We initially quantified the time-varying profile of 4 different low-level features: Loudness in sones, defined for each frame of analysis as the sum of the specific loudness for the different cochlear filters; spectral centroid in Equivalent Rectangular Bandwidth-rate units (ERB; Moore and Glasberg 1983), defined as the specific-loudness–weighted average of the spectral frequency; HNR, defined as the ratio of the periodic-to-non-periodic energy in the sound signal (HNR) in dB; pitch in ERB-rate units. Time-varying loudness and spectral centroid were derived from the time-varying specific loudness of the sound signals, as computed according to the model of Glasberg and Moore (2002). Time-varying HNR and pitch were computed using the Praat software (Boersma and Weenink 2009). The temporal resolution of each of the time-varying features was 1 ms. We derived 3 dissimilarity matrices for each of the 4 time-varying sound features by using 1 of 3 different mathematical operators. 1) The first 4 dissimilarity matrices measured the absolute value of the difference in the median of the time-varying feature between each pair of sounds. Median dissimilarities focus on the time-independent scale of the sound features. 2) The next 4 matrices measured the absolute value of the difference in the interquartile range of the time-varying feature between each pair of sounds. The interquartile range dissimilarities focus on the amount of temporal change of the sound features. 3) The last 4 dissimilarity matrices measure the between-sounds diversity in the entire pattern of temporal variation, independently of scale (e.g., high- vs. low median pitch). To this purpose, dissimilarity was defined as 1 minus the maximum cross-correlation between the time-varying feature measured on sounds A and B (e.g., time-varying loudness for both sounds). The cross-correlation was normalized so as to yield a value of −1 for the cross-correlation between 1 signal and its negative at lag 0, and a value of 1 for the cross-correlation between 1 signal and its replica (i.e., autocorrelation) at a temporal lag of 0. In order to yield a scale-independent measure of the dissimilarity between the time-varying profiles, time-varying features were range normalized between 0 and 1 before being analyzed with the cross-correlation algorithm. Importantly, the cross-correlation measures of dissimilarity are independent of onset-time differences between 2 sounds. Finally note that the number of possible low-level measures of sound dissimilarity is in principle infinite because multiple basic representations of the acoustical structure and multiple mathematical or statistical operators can be adopted to quantify the differences between 2 sounds with 1 single number (Peeters et al. 2011, for an extensive list of acoustical features). In this study: 1) We equated the number of low-level and category-membership dissimilarities to avoid skewing the likelihood of observing significant encoding of features belonging to 1 of these 2 groups (e.g., consider a study with 100 low-level and 1 category-membership dissimilarity); 2) we considered plausible models of how the auditory system computes the temporal variation of 4 basic sensory attributes; 3) we applied the same set of (simple) statistical operators to each of the 4 features.
Twenty subjects took part in this study (10 females, 10 males; age = 23.8 years, SD = 4.8 years; average years of experience with English language = 20.6 years, SD = 5.6 years; number of native English speakers = 11). All participants had limited musical training (years of music performance experience = 2.6, SD = 4.8 years), had normal hearing as assessed with a standard audiometric procedure (Martin and Champlin 2000; ISO 2004), and were right handed (average laterality quotient = 74.3, SD = 17.6) as assessed with an Edinburgh handedness inventory (Oldfield 1971). Informed consent was obtained from all individuals, and the protocol was approved by the Ethics Committee of the University of Glasgow.
Participants performed a 1-back repetition-detection task, that is, were requested to press a key when they heard 2 subsequent presentations of the same stimulus. On each block of trials, participants were presented with the 32 stimuli in random order and with 1 repetition of 2 of the 32 stimuli, for a total of 34 stimulus presentations per block. At the end of each block, participants were presented with 6 subsequent silent stimuli of 3s duration each. Throughout the experiment, each participant carried out 16 blocks of trials. Throughout the experiment, we had a total of 32 subsequent repetitions of 1 sound, 1 for each of the stimuli. The entire scanning session lasted approximately 60min.
fMRI Data Acquisition
Participants were scanned with a Siemens 3 Tesla Tim Trio scanner (Siemens, Erlangen, Germany), using a 12-channel head coil. Sound stimuli were presented through electrostatic headphones (Nordic Neuro Lab, Bergen, Norway) at a level of 68 dB SPL. The time to repetition (TR) was 5 s, composed of a 2-s acquisition time and a 3-s silent period during which sound stimuli were played on a silent background. No stimulus-onset jittering was used, and the silent period of 3s between acquisitions was occupied in its entirety by the auditory stimulation (i.e., inter-stimulus interval = TR and stimulus duration = TR − acquisition time). Each brain volume contained 31 slices of 2.2-mm thickness with an inter-slice distance of 2.75 mm in an axial orientation along the direction of the temporal lobe, providing near full-brain coverage (part of the superior prefrontal cortex and the posterior part of the occipital cortex were not acquired in several subjects and were thus excluded from analysis). The in-plane voxel size was 2 × 2 mm2 (64 × 64 matrix). A whole-brain, high-resolution, structural T1-weighted MP-RAGE image (192 sagittal slices, 256 × 256 matrix size, 1 × 1 × 1 mm3 voxel size) was also acquired to characterize the subjects’ anatomy.
fMRI Data Analysis
All analyses were carried out using SPM8 and custom Matlab code. Functional images were slice-time corrected to the onset of the first slice and spatially realigned using a 6-parameter affine transformation. High-resolution T1 images for each of the participants were coregistered to the average functional image and segmented into gray matter, white matter, and cerebrospinal fluid.
The first step of the analysis pipeline involved fitting a first-level native-space generalized linear model (GLM) to the unsmoothed functional images for each participant (Kriegeskorte, Mur, Bandettini 2008; Kriegeskorte, Mur, Ruff, et al. 2008). The GLM focused on gray matter voxels, as identified on the basis of the segmented T1 scan, and included 33 conditions, 1 for each of the 32 sound stimuli and 1 for sound repetitions and key presses. The GLM also included head-motion parameters estimated during the spatial realignment step, and an intercept term modeling activation during the implicit silent baseline condition. Stimulus-specific BOLD effects were estimated by convolving the sound-stimulus onsets with the canonical hemodynamic response function.
The second step aimed at extracting RDMs (Kriegeskorte, Mur, Bandettini 2008; Kriegeskorte, Mur, Ruff, et al. 2008), measuring the (scale-independent) dissimilarity between the fine-grained spatial distribution of the BOLD effect for each pair of stimuli. Given a target center voxel, we extracted the stimulus-specific BOLD estimates from the contrast images of each of the 32 sounds (sound–silence) inside a spherical volume or searchlight (Kriegeskorte et al. 2006). We chose a searchlight radius of 6.25 mm (89 voxels), because previous simulation studies showed that searchlight radii containing a similar number of voxels optimize the discrimination of experimental conditions (Kriegeskorte et al. 2006). The dissimilarity between the spatial pattern of activation for 2 sounds was thus computed as 1 minus the Pearson correlation between the voxel-specific BOLD estimates for the 2 sounds within the searchlight (Kriegeskorte, Mur, Bandettini 2008; Kriegeskorte, Mur, Ruff, et al. 2008). RDMs were extracted for each gray matter center voxel, provided that at least 50% of the spherical volume included gray matter voxels. Participant-specific RDMs were normalized to MNI space using the normalization parameters obtained from the segmentation procedure and were smoothed using a Gaussian kernel (6-mm full-width at half-maximum, FWHM). In practice, for each of the participants, the normalization and smoothing algorithms were applied independently to each of the 496 maps measuring the dissimilarity between each of the 496 pairs of the 32 sound stimuli. The normalization and smoothing steps were necessary for carrying out random-effects analyses. After normalization voxels were 2 mm3 in size.
The third and final step of the analysis pipeline adopted a random-effects approach to test the association between RDMs and SDMs (Fig. 1), independently of variance common to the different SDMs (both low-level and category-feature dissimilarities). At the first level, we thus computed the Spearman rank correlation between the RDMs on the one hand and each of the SDMs on the other. For each SDM, this procedure yielded 1 rank- correlation map for each participant. Correlation maps were transformed into Z-maps by applying the variance-stabilizing Fisher Z-transform. For each SDM, we then computed 1 random-effects t-test to assess whether the correlation between the RDMs and the SDMs was significantly different than 0 at the group level [degrees of freedom (df) = 19; P < 0.0001, extent threshold = 20 voxels]. This initial correlational test ignored the variance common to the different SDMs. We thus repeated the same random-effects analysis approach by considering the partial Spearman correlation between RDMs and each of the SDMs (df = 19 for t-test; P < 0.0001, extent threshold = 20 voxels). The partial correlation analysis discarded all sources of variance shared between the particular SDM and all the other SDMs. Note that the partial-correlation analysis can potentially reveal significant effects even in the absence of a significant correlation, and that the sign of correlations and partial correlations can potentially differ. Both of these potential results have an unclear functional interpretation. We thus finally considered for each of the SDMs the conjunction of the correlation and partial-correlation group-level t-tests (P < 0.0001, and extent threshold = 20 voxels for both correlation and partial correlation, Nichols et al. 2005), where the correlation and partial correlation had the same sign. The conjunction analysis thus revealed that those areas where both the correlation and partial correlation between the RDMs and the SDM were significantly different than zero and had the same sign (Figs 4 and 5). Note that an alternative analysis approach based on the smoothed and normalized correlation maps computed in native space yielded highly similar results to those presented in this study, based on smoothed and normalized RDMs.
An initial analysis measured the anatomical overlap in the sensitivity to category and low-level features. This analysis did not discard the variance that was common among all the stimulus features and was meant to illustrate some problematic aspects of studies of the encoding of object categories that do not consider low-level explanations (Fig. 3). The second and third analyses assessed the cortical selectivity for low-level and object-category features, respectively (Figs 4 and 5). For each of the features, significant selectivities were measured by a simultaneous correlation and partial correlation between the RDMs and the SDM (both correlation and partial correlation with same sign). The partial correlation discarded the variance common between the SDM for the target feature and non-target category and low-level SDMs. The third analysis thus assessed the presence of cortical modules that encode categories of auditory objects abstractly. Table 1 summarizes the results of analyses 2–3.
|Left hemisphere||Right hemisphere|
|Spectral centroid (interquartile range)|
|Left hemisphere||Right hemisphere|
|Spectral centroid (interquartile range)|
Note: The ρ columns report the Spearman correlation between group-average RDMs on the 1 hand and 1 specific SDM on the other.
BA = Brodmann area; Z = Z-score; Vox = number of voxels in cluster; pSTG/pSTS = posterior superior temporal gyrus/sulcus; mSTG = middle STG; HG = Heschl's gyrus; lHG = lateral HG; HS = Heschl's sulcus; medHG = medial HG; ACC = anterior cingulate cortex; medFG = medial frontal gyrus; SPL = superior parietal lobule; PT = planum temporale; aPT = anterior planum temporale.
Large Extents of the Temporal Cortex are Sensitive to Both Object-Category and Low-Level Features
Figure 3 shows part of the cortical regions in which we observed a significant correlation between the RDMs and at least 1 of the object-category or low-level features (P < 0.0001, uncorrected). In particular, areas marked in yellow are sensitive to both low-level and object-category structure (significant RDM correlation with at least 1 category SDM and at least 1 low-level SDM). Large portions of the bilateral temporal cortex are characterized by dual category/low-level sensitivity, including Heschl's gyrus (HG), whose medial two-thirds are classically assumed to correspond to the core primary auditory fields (Rademacher et al. 1993; Morosan et al. 2001), and the superior temporal plane both anterior and posterior to HG, that is, the PT and the aSTG. The functional meaning of these results is uncertain, however: It could, for example, be the simple product of a statistical association between low-level and object-category features.
Fine-Grained Spatial fMRI Patterns in Both the Temporal and Extratemporal Cortex Selectively Encode Several Low-Level Features
Figure 4 shows those regions where we observed a simultaneous correlation and partial correlation between the RDMs and the low-level feature SDMs (P < 0.0001, uncorrected for both; partial correlation discards the variance common between target low-level and non-target low-level features and object-category features). These regions meet the stringent statistical criteria of feature selectivity (e.g., Hall and Plack 2009). We observed encoding of: 1) The median value of the time-varying pitch in a large temporal cluster in both hemispheres, comprising the lateral aspects of HG and including the most anterior portions of the PT; 2) the median value of the time-varying loudness in a large patch of the left auditory cortex extending laterally from the middle portion of HG to the anterior aspects of the left PT; 3) the amount of temporal change of the spectral centroid (spectral centroid IQR SDM) in the most medial aspect of the right HG and the right PT; 4) the median value of the time-varying HNR in the right-lateralized temporal cortex (posterior superior temporal gyrus [pSTG]/STS) and in a bilateral frontal cluster comprising the medial frontal gyrus (medFG) and the anterior cingulate cortex (ACC), with a peak effect in the right hemisphere; 5) the overall pattern of temporal variation of loudness (loudness cross-correlation SDM) in the anterior aspect of the superior parietal lobule (SPL).
The Right Planum Temporale and Posterior Superior Temporal Gyrus Represent Abstract Categories of Auditory Objects
Figure 5 shows those regions where both the correlation and partial correlation between RDMs and category-feature SDMs were significant (P < 0.0001, uncorrected for both). These regions are potentially involved in the abstract representation of categories of auditory objects because none of the low-level features we considered explains their encoding in the cortex. We observed encoding of: 1) The category of living sounds, comprising both vocal and non-vocal sounds, as well as human and non-human sounds, in the medial right PT, bordering the medial HG; 2) the category of human sounds, comprising both vocal and non-vocal sounds, as well as both living and non-living sounds, in the right pPT/pSTG, and in the most medial aspect of the left HG. In both cases, we detected a significant within-category effect measuring a greater similarity of the spatial fMRI patterns within the living and human categories (Fig. 2, for more details on the strategy adopted to assess the encoding of object-category features).
Our study aimed to assess the abstract encoding of categories of non-speech naturalistic auditory objects independently of systematic fingerprints of their low-level structure. The main goal of this study also led us to assess the cortical encoding of a large number of low-level features of naturalistic sounds. We adopted a condition-rich design coupled with information-based analyses of the spatial fMRI patterns. The selectivity for both object-category and low-level features was assessed after partialing out their shared variance. Selective encoding was observed for several low-level features: The brain imaging of naturalistic sounds can represent a powerful instrument for characterizing the signal-processing architecture of the cortex. In both hemispheres, posterior temporal regions, among which the PT, appeared to encode abstractly the categories of living and human sounds. These results 1) reveal domain-general processes for the abstract encoding of auditory objects; 2) motivate a revised hierarchical processing model of increasing information abstraction with the distance from A1 both in the anterior and posterior directions (Rauschecker and Tian 2000; Peelle et al. 2010); 3) suggest that part of the by-product of the template-matching process that takes place in the PT (Griffiths and Warren 2002) is abstract in nature.
Accurate Models of the Cortical Encoding of Sound Categories Should Consider Their Low-Level Structure
Our current understanding of the processing of naturalistic auditory objects in the human cortex focuses on the encoding of categorical structure and largely disregards low-level features (see, e.g., Rauschecker et al. 1995 and Leaver and Rauschecker 2010, for exceptions in the human and animal literature, respectively). The presence of large cortical patches sensitive to both low-level and object-category features (Fig. 3) exemplifies the ambiguity of this approach. The observation of a dual low-level/object-category sensitivity can indeed have 2 interpretations. First, it might be the simple product of a statistical association between categories of objects and low-level features. Secondly, it might be the product of multifunctionality, that is, of the simultaneous encoding of low-level and object-category features in the same neural population (see Bizley et al. 2009, for encoding of multiple low-level features in the same neural populations in A1 in the ferret). Variance decomposition methods, like those adopted in the current study, are necessary to decide between these alternative hypotheses and to assess abstract object encoding independently of the rich low-level structure of naturalistic sounds.
Abstract Representation of Biological Sounds
We observed cortical encoding of 2 categories of auditory objects: Living (right medHG/PT) and human-action auditory objects (left HG and right pPT/pSTG; Fig. 5). In both cases, objects belonging to the same category emerged as evoking similar spatial fMRI patterns. This particular within-category effect could not be detected with conventional univariate analysis approaches, and reveals that the cortical encoding of object categories does not necessarily rely on the ability to tell apart different category exemplars. The comparatively large set of low-level features considered in this study cannot explain these results because the measurement of category encoding ignored the variance common between object-category and low-level features. It is possible that additional low-level features, not considered in this study, account for these results. Another interpretation, however, is that these areas encode high-level abstract features optimized for the processing of object-category information. Note that various factors might determine whether abstract cortical encoding of sound categories occurs (e.g., salience-related attentional processes, Kayser et al. 2005; identification-related processes, Kilian-Hütten et al. 2011; or in general, the perceptual set that governs how a listener approaches the heard sounds, Liebenthal et al. 2003). It is thus significant that in this study abstract category encoding emerged in the absence of experimentally induced biases toward focusing on a particular source of information: Participants were free to carry out the 1-back repetition-detection task by focusing on, for example, low-level or category-related information. As such, the observed abstract category-encoding effects might potentially be indicative of cortical processing strategies active outside the laboratory. Consistently with this interpretation, previous psychophysical investigations demonstrated a cognitive bias toward processing living sounds by focusing on high-level semantic information (Giordano, McDonnell, et al. 2010). Interestingly, both the human-action and living categories comprise events of a biological origin. We thus argue that abstract processing is a general functional principle of the auditory brain that operates for both speech and non-speech ecologically relevant biological auditory objects. Notably, the location of category-selective modules in our study is consistent with previous observations of abstract processing of speech and human vocalizations in the posterior temporal cortex (e.g., Warren et al. 2006; Desai et al. 2008; Okada et al. 2010). Our results thus complement the notion of abstract object encoding in the anterior temporal lobe (e.g., Belin et al. 2000; Davis and Johnsrude 2003; Hasson et al. 2007; Goll et al. 2010; Leaver and Rauschecker 2010) and suggest a very simple hierarchical model according to which information abstraction in the temporal lobe grows with the distance from A1 both in the posterior and anterior directions (see Fig. 1B in Peelle et al. 2010, for an earlier proponent of this hypothesis).
Two aspects of our study represent a departure from previous empirical investigations on the cortical encoding of naturalistic auditory objects. First, previous studies rarely considered low-level alternative hypotheses. Secondly, where previous studies relied on univariate analyses of the voxel-specific BOLD response, our study focused on the multivariate analysis of fine-grained spatial fMRI patterns. In the face of these differences, it is thus significant that the location of abstract modules revealed in this study is consistent with the results from previous category-encoding studies. For example, the representation of the living category in the right medHG/PT is evocative of previous observations of sensitivity to vocalizations, particularly animal vocalizations, in the middle temporal gyrus (e.g., Lewis et al. 2006; Altmann et al. 2007; Doehrmann et al. 2008). More consistently, the representation of the human-action category in the left HG agrees with the results of Kaplan and Iacoboni (2007) and Doehrmann et al. (2008) and with the observations of Leaver and Rauschecker (2010) of the abstract encoding of musical instrument sounds (generated as a consequence of human actions) in the same left-HG area. Finally, the representation of human-action events in the posterior temporal cortex is consistent with the previous reports by Lewis et al. (2006), Murray et al. (2006), Kaplan and Iacoboni (2007), and Doehrmann et al. (2008). Surprisingly, the results of our study did not confirm the hypothesis of a middle anterior superior temporal sulcus center that selectively processes human vocal sounds (Belin et al. 2000, 2002; Gervais et al. 2004; Grandjean et al. 2005; Ethofer et al. 2009; Leaver and Rauschecker 2010). One of the potential sources for our null result is the difference in analysis strategies. Future studies will thus be necessary to disentangle this issue. Alternatively, our null result could arise from the low number of human vocalizations in our stimulus set (12.5% of the total), which was not large enough to promote the abstract encoding of this category (see, e.g., Ulanovsky et al. 2003; Asari and Zador 2009; King and Nelken 2009, for short-term plasticity and context-dependence in auditory system), or, more simply, to make a reliable measurement of abstract human-vocal representations possible. Similar explanations might account for the absence in this study of category-encoding effects related to the vocal versus non-vocal distinction at large.
Cortical Labeling of Sound Categories
We assessed 3 different effects of category-membership information on the dissimilarity of the spatial fMRI patterns, independent of low-level information: 1) Between-category differentiation; 2) within-category differentiation; and 3) within-category compression. Each of these effects can have a different functional interpretation: 1) The cortical patch represents abstractly the distinction between different categories; 2) the cortical patch is involved in the fine processing of stimuli within the category of interest leading to an enhancement of their diversity that go beyond that afforded by low-level information; and 3) the cortical patch codes whether stimuli are exemplars of the category of interest, leading to a compression of the diversity of same category exemplars beyond that afforded by the low-level structure. Our analyses did not provide evidence for the former 2 effects. It is interesting to note that previous studies of naturalistic visual objects did reveal encoding of between-category distinctions in spatial patterns of activation (Kriegeskorte, Mur, Ruff, et al. 2008). As such, it remains to be seen whether our failure to observe between-category effects is indicative of a specific property of pattern-based encoding in the auditory brain or, instead, stems from the specifics of our experimental methodology (e.g., choice of stimuli). More importantly, we measured within-category compression for both the living and human-action categories, and in several mid-to-posterior temporal areas, among which the PT (Fig. 5). Interestingly, the PT is considered to implement acoustics-dependent processes that match stored spectrotemporal templates to the incoming sensory information (Griffiths and Warren 2002; Warren et al. 2005; Kumar et al. 2007). Our results thus suggest that although the matching process implemented by the PT relies on the analysis of sensory information, part of the end product of this process is abstract in nature. We thus argue that by matching incoming low-level patterns to stored templates, the PT facilitates a process of labeling auditory objects as members of specific categories. In post-PT stations of information relay, this labeling information could, for example, facilitate the discrimination of different categories, and promote similar processing pipelines for same category exemplars.
Brain Imaging of Naturalistic Sounds Makes it Possible to Assess the Encoding of Multiple Low-Level Features
The majority of studies on the cortical encoding of low-level features investigate synthetic stimuli that differ along a restricted number of acoustical dimensions, often one. From the methodological point of view, single-feature studies are not capable of measuring cortical selectivity because they do not consider the effects of variations in extraneous features (Hall and Plack 2009; Bizley and Walker 2010). In general, brain mapping of naturalistic sounds is a powerful instrument for the study of low-level feature encoding because: 1) It exploits the rich low-level structure of naturally occurring stimuli that likely shapes the neural processing starting from the auditory nerve (Lewicki 2002); 2) it makes it possible to measure encoding of multiple sound features within a single experiment; 3) it makes it possible to test for cortical selectivity. In this study, we considered 12 different low-level features and observed selectivity for 5 of them. Overall, these analyses confirm previous hypotheses of a pitch-encoding center in the lateral HG (e.g., Zatorre 1988) and of a right-lateral bias for the processing of spectral features (spectral centroid, but also HNR; e.g., Zatorre and Belin 2001). We also observed left-lateralized encoding of loudness in the temporal cortex, a rarely observed functional hemispheric asymmetry that could originate from the exclusive allocation of right-hemisphere resources to the processing of the rich spectral structure of the stimuli in the current study. Finally, the left SPL appeared to encode selectively the pattern of temporal variation of loudness. This result might be indicative of a role of this low-level feature in the cortical analysis of auditory and multimodal tool-action events (e.g., a series of loudness impulses for hammering nail versus less abrupt temporal variations for sawing wood; see Lewis 2006, for a review). In the following, we discuss the results for each of the low-level features in more detail.
The median of the time-varying pitch was encoded bilaterally in an area of the temporal cortex that includes the lateral HG. A significant body of evidence supports the hypothesis of a general pitch-encoding center in the lateral HG (Zatorre 1988; Johnsrude et al. 2000; Gutschalk et al. 2002; Patterson et al. 2002; Bendor and Wang 2006; Hyde et al. 2008; Foster and Zatorre 2010). The general validity of this position has recently been criticized on the grounds that some of the studies consistent with this hypothesis are carried with synthetic stimuli from the same class (iterated ripple noises, Hall and Plack 2009). Our results thus provide strong support for the hypothesis of a general pitch-encoding center in the lateral HG, because the stimuli investigated in this study are highly diverse in their low-level structure.
The right medHG and PT encoded the amount of temporal variation of the spectral centroid. In general, these results agree with the hypothesis of finer spectral processing abilities in the right temporal cortex (Zatorre and Belin 2001; Schönwiesner et al. 2005; Warren et al. 2005; Jamison et al. 2006; Kumar et al. 2007; Obleser et al. 2008; see Zatorre and Gandour 2008, for a review; Altmann et al. 2010) and with that of a specialization of the PT in the analysis of time-varying spectral patterns (Griffiths and Warren 2002; Zatorre and Belin 2005). The right-lateralized encoding of this feature is, at least apparently, at odds with the frequent observation of a left-hemisphere specialization for the analysis of the temporal variation of spectral information (Zatorre and Belin 2001; Poeppel 2003; Boemio et al. 2005; Schönwiesner et al. 2005; Zatorre and Gandour 2008). However, we note that several studies did reveal encoding of the spectrotemporal variation in both the right and left temporal cortices. They also show that hemispheric asymmetries generally emerge as a function of the rate of spectrotemporal variation rather than of overall temporal variation (left-hemispheric specialization for faster rates, e.g., Belin et al. 1998; Boemio et al. 2005; see also Obleser et al. 2008), whereas our spectral-centroid IQR measure captures the amount of temporal variation across slower and faster rates.
The median HNR was encoded in the right pSTG/STS and in the bilateral ACC/medFG (peak effect in the right hemisphere). The right lateralization of the pSTG/STS encoding of this feature is perhaps suggestive of cortical computations based on spectrum-matching processes rather than on an analysis of the temporal structure of the incoming waveform (cf., right-hemispheric advantage for spectral processing; see above). From the psychophysical point of view, HNR accounted for the dissimilarity ratings of environmental sounds in Gygi et al. (2007), and for the tool versus animal categorization in Lewis et al. (2005). Within the brain-imaging literature, ACC has been reported to differentiate between highly harmonic voiced speech and less harmonic whispered speech (Schulz et al. 2005). Notably, only 2 studies investigated systematically the cortical encoding of HNR (Lewis et al. 2009; Leaver and Rauschecker 2010). Consistently with our results, both of these studies observed right-temporal sensitivity to HNR, although in more anterior regions. It should be emphasized, however, that the cortical representation of HNR appears to be largely dependent on the investigated sound set (cf. variability of HNR-sensitive centers for animal vocalizations and iterated ripples noises in Lewis et al. 2009). Given the paucity of studies on the cortical processing of HNR, statements about a general processing center are premature. Given the high relevance of HNR for the behavioral evaluation of heterogeneous sets of environmental sounds (e.g., Gygi et al. 2007), it is plausible that the participants in this experiment focused on this same low-level feature when carrying out the 1-back repetition-detection task inside the scanner (e.g., answer “repetition” if 2 subsequent stimuli have highly similar HNR values; note that the task did not explicitly impose constraints on the response strategies). As such, the encoding of the median HNR in the ACC might be the product of task-related processes: This cortical area is indeed part of a “salience network” involved in decisional processes based, for instance, on sensory information (Seeley et al. 2007) and in the processing of errors and conflicts (Menon et al. 2001; Ridderinkhof et al. 2004). Furthermore, it is hypothesized to be part of a network that supports focal auditory activity (Hunter et al. 2006).
Two loudness features were encoded in the left hemisphere: The median of the time-varying loudness in the primary auditory cortex, extending also to anterior planum temporale, and the overall pattern of time-varying loudness in aSPL. Among various studies carried out with synthetic sounds, only Brechmann et al. (2002) observed a clear left-lateralized bias for the processing of the overall loudness of a sound signal, whereas partial agreement emerges concerning the role of the PT in the processing of this property (see Ernst et al. 2008, for a review). Among the various factors that might explain these divergences, it might be speculated that the higher complexity of the spectral structure of the sounds in the current experiment strengthened the right-lateral bias for the processing of spectral properties at the expense of the processing of energetic features such as loudness in the same hemisphere. Focusing on the role of the left SPL in differentiating between temporal patterns of loudness variation, it is interesting to note that this area appears to be involved in the processing of tool-action events in the motor, visual, and auditory domains (Lewis et al. 2005, see Lewis 2006, for a review; and Giusti et al. 2010, for cortical processing of action sounds in left SPL; see Griffiths 2008; Rauschecker and Scott 2009; Recanzone and Cohen 2010, for role of SPL in dorsal pathway), and in the online updating of actions (Tunik et al. 2008). Notably, psychophysical investigations of naturalistic sounds suggest that the identification of the actions carried out on an object relies primary on the temporal patterning of the sound signals (e.g., bouncing versus breaking of a glass bottle, Warren and Verbrugge 1984). As such, the role of SPL in differentiating between temporal loudness patterns might potentially subserve processes of sound-based motor control and of sensory-motor transformation.
Making Sense of a Variable Environment
In this study, we revealed that the spatial patterns of activation in various regions of the temporal cortex label auditory objects as exemplars of 2 ecologically relevant categories: Sounds generated by vibrating living objects, and action sounds involving a human agent. The exact nature of the neural processes at the basis of this result remains to be detailed. For example, category encoding might be product a non-linear feature-combination analysis that merges information from multiple low-level features (Sadagopan and Wang 2009). Independently of the exact nature of the neural code, it is important to emphasize that it appears to be independent of between-sound differences along various fundamental dimensions of auditory sensation such as loudness, pitch, and timbre-related dimensions such as spectral centroid and HNR (note the investigation in this study of different measures of dissimilarity along each of these dimensions). As such, the categorization code appears to be optimized for carrying out a job that is very important for an adaptive organism: Recognizing basic properties of the objects that populate the environment in the face of variations along several attributes of the input sensory information (King and Nelken 2009). In our ancestors, general-purpose abstract encoding mechanisms that serve to extract biologically relevant auditory-object information might thus have spurred the development of increasingly sophisticated strategies for the robust categorical processing of calls, ultimately resulting in the emergence of phonetic analysis processes at the basis of the speech ability.
This work was supported by the Marie Curie Intra-European Fellowships program (FP7 PEOPLE-2011-IEF, project BrainInNaturalSound to B.L.G. and P.B.), by the Biotechnology and Biological Sciences Research Council (grant BB/E003958/1 to P.B.), by the Economic and Social Research Council-MC (grant RES-060-25-0010 to P.B.), by the Canada Research Chair in Music Perception and Cognition (S.Mc.A.), and by the Natural Sciences and Engineering Research Council of Canada (RGPIN 312774-2010 to S.Mc.A.).
Conflict of Interest: None declared