To better define the underlying brain network for the decoding of emotional prosody, we recorded high-resolution brain scans during an implicit and explicit decoding task of angry and neutral prosody. Several subregions in the right superior temporal gyrus (STG) and bilateral in the inferior frontal gyrus (IFG) were sensitive to emotional prosody. Implicit processing of emotional prosody engaged regions in the posterior superior temporal gyrus (pSTG) and bilateral IFG subregions, whereas explicit processing relied more on mid STG, left IFG, amygdala, and subgenual anterior cingulate cortex. Furthermore, whereas some bilateral pSTG regions and the amygdala showed general sensitivity to prosody-specific acoustical features during implicit processing, activity in inferior frontal brain regions was insensitive to these features. Together, the data suggest a differentiated STG, IFG, and subcortical network of brain regions, which varies with the levels of processing and shows a higher specificity during explicit decoding of emotional prosody.
The human brain incorporates a frontotemporal network of regions that decode affective cues and infer emotional states from suprasegmental vocal modulations, referred to as emotional prosody (Banse and Scherer 1996; Grandjean et al. 2006). Specifically, regions along the superior temporal gyrus (STG) and the superior temporal sulcus (STS) as well as in the orbitofrontal cortex (OFC) and the ventrolateral prefrontal cortex have been found to be involved in the sensory, emotional, and evaluative decoding of emotional prosody (Schirmer and Kotz 2006; Wildgruber et al. 2009). In addition to this frontotemporal network, the amygdala (Grandjean et al. 2005; Sander et al. 2005; Fecteau et al. 2007; Bach et al. 2008) and subcortical regions, such as the thalamus (Wildgruber et al. 2004) and the basal ganglia (Kotz et al. 2003; Bach et al. 2008; Grandjean et al. 2008), have been found to be sensitive to emotional prosody. However, evidence for sensitivity of the latter regions to emotional prosody is inconsistent (Calder et al. 2004; Mitchell and Boucas 2009), especially for the amygdala. Some studies report activations of the amygdala (Scott et al. 1997; Sprengelmeyer et al. 1999; Wiethoff et al. 2009; Leitman et al. 2010), whereas the results from other studies do not support its involvement in the processing of emotional prosody (Adolphs and Tranel 1999; Anderson and Phelps 2002; Belin et al. 2008).
Although these recent studies provided evidence for the decoding of emotional information from vocal cues in the frontotemporal and subcortical regions, precise spatial information, especially about the involvement of different temporal and frontal subregions, is still lacking. The location of peak activation can vary substantially and usually extends broadly across different regions. Within the temporal cortex, studies report peak activations for emotional compared with neutral prosody that range from the bilateral posterior superior temporal gyrus (pSTG; Mitchell et al. 2003; Sander et al. 2005; Ethofer, Anders, Erb, et al. 2006; Beaucousin et al. 2007; Ethofer, Kreifelts, et al. 2009) and the bilateral midsuperior temporal gyrus (mSTG; Grandjean et al. 2005; Sander et al. 2005; Bach et al. 2008; Ethofer, Kreifelts, et al. 2009; Leitman et al. 2010) to the right anterior superior temporal gyrus (aSTG; Bach et al. 2008). Although this variation in peak location follows the distribution of human voice-sensitive areas in the superior temporal cortex (Belin et al. 2000), it might suggest different functional roles of these subregions in decoding affective cues from voices (Schirmer and Kotz 2006). For example, adjacent regions in the right pSTG and mSTG are sensitive to different levels of processing. That means, some studies found a general sensitivity in the pSTG or mSTG independent of the attentional focus (directed to or away from emotional prosody), while explicitly focusing attention towards emotional prosody revealed activations in adjacent regions of the STG (Grandjean et al. 2005; Sander et al. 2005). Apart from this variation in peak location, these activations usually extend broadly in the STG or STS and sometimes extend to or comprise regions in the primary and secondary auditory cortex (Wildgruber et al. 2004; Sander et al. 2005; Leitman et al. 2010). The latter are more involved in decoding basic sensory acoustic features from affective utterances rather than in decoding higher level affective cues.
Similar to the variation and extent of activations in the superior temporal cortex, activations in the bilateral inferior frontal gyrus (IFG) that cover different frontal subregions have been frequently reported. Activations were found in the bilateral pars opercularis of the IFG (Brodmann area [BA] 44/45; Buchanan et al. 2000; Mitchell et al. 2003; Schirmer et al. 2004), but these activations often extend to the pars orbitalis of the IFG (BA 47; Fecteau et al. 2005; Bach et al. 2008; Ethofer, Kreifelts, et al. 2009) and to the middle frontal gyrus (BA 9/46; Wildgruber et al. 2004; Ethofer, Anders, Erb, et al. 2006; Leitman et al. 2010). These different inferior frontal subregions are again assumed to serve different functional roles. Left BA 44 is generally involved in linguistic and semantic processes (Kotz et al. 2003; Schirmer et al. 2004) and is functionally distinct from BA 47, which is specifically sensitive to the emotional tone of a voice (Fecteau et al. 2005). Furthermore, right BA 45 has been proposed to be involved in the cognitive evaluation of emotional stimuli, whereas BA 47 decodes the reward value of emotional stimuli when activations also comprise regions in the OFC (Schirmer and Kotz 2006). Right BA 47 might also assess and mirror the emotional quality of auditory cues as was shown for emotional faces (Nakamura et al. 1999; Lee et al. 2006).
Therefore, given the variation of functional activations and the different functional properties of STG and IFG subregions, a more precise anatomical description of these activations and their functional roles is needed. In the present study, we used high-resolution functional magnetic resonance imaging (fMRI) scans to better separate and determine functional activations during the decoding of emotional prosody. Furthermore, we were interested in the dependency of these functional activations on the level of processing of emotional utterances. Finally, we were also interested in the dependency of functional activations on those acoustical stimulus features that support the formation of emotional representations from acoustic cues, such as the fundamental frequency (F0) perceived as pitch and the energy/intensity (I) perceived as intensity of acoustic stimuli. Both features show a specific pattern, depending on the emotional valence (Banse and Scherer 1996; Grandjean et al. 2006).
Concerning the level of processing, this level can range from explicit judgments on emotional prosody to passive listening to stimuli and finally to rather implicit levels of processing when attention is directed toward other stimulus features (Bach et al. 2008) or spatial locations (Grandjean et al. 2005; Sander et al. 2005). This level of processing refers to what has formerly been described as the appraisal level during processing of emotional prosody (Bach et al. 2008) and concerns the attentional focus, which can be directed towards (explicit decoding) or away from emotional prosody (implicit decoding). As mentioned earlier, activation is shifted to adjacent regions of the right midposterior superior temporal gyrus (m-pSTG) when the task requires an explicit rather than a task-independent decoding of emotional prosody, which induces activity in m-pSTG (Grandjean et al. 2005; Sander et al. 2005). This might also suggest task-dependent differences in functional activations in the wider frontotemporal network. Explicit decoding, compared with implicit decoding, induces activity in the bilateral pSTG (Mitchell et al. 2003; Sander et al. 2005; Wildgruber et al. 2005; Ethofer, Kreifelts, et al. 2009), bilateral IFG (BA 45/47) (Buchanan et al. 2000; Wildgruber et al. 2005; Bach et al. 2008; Ethofer, Kreifelts, et al. 2009), and OFC (Wildgruber et al. 2004; Sander et al. 2005), whereas implicit processing is accompanied by activity in the bilateral pSTG (Sander et al. 2005; Beaucousin et al. 2007; Bach et al. 2008), right aSTG (Mitchell et al. 2003), bilateral amygdala (Sander et al. 2005; Bach et al. 2008), and left IFG (Buchanan et al. 2000; Wildgruber et al. 2004; Bach et al. 2008; Ethofer, Kreifelts, et al. 2009), the latter probably due to the focus on linguistic and semantic stimulus features while ignoring the prosodic stimulus feature.
Activity in the frontotemporal regions, in addition to task-dependent levels of processing, might also depend on emotion-specific acoustic stimulus features. Although some studies did not find a dependency on functional activations for basic stimulus features (Wildgruber et al. 2002), others report a covariation of bilateral mSTG activity with the perceived emotional intensity of prosodic stimuli based on F0 variability (Ethofer, Anders, Wiethoff, et al. 2006) or with mean F0 in the left mSTG and with mean energy in the right mSTG (Wiethoff et al. 2008). Increasing emotion-specific acoustic cues, such as high-frequency cues of angry stimuli or F0 variability for fearful and happy stimuli, increases activity in the bilateral pSTG with a concurrent signal decrease in the bilateral IFG (Leitman et al. 2010). These data might suggest a general sensitivity of STG regions to emotion-specific acoustic cues as the basis of auditory emotional representations. However, assuming that more pSTG regions decode sensory cues and more aSTG regions decode emotional cues (Schirmer and Kotz 2006), these data might suggest a distinction in STG regions in their sensitivity to acoustic stimulus features. Furthermore, the amygdala and subcortical structures might additionally be sensitive to emotion-specific acoustic features. Among other brain regions, the amygdala might decode the emotional value from acoustic stimuli (Sander et al. 2005; Fecteau et al. 2007; Bach et al. 2008; Ethofer, Kreifelts, et al. 2009; Wiethoff et al. 2009), which might be based on emotion-specific acoustic cues (Leitman et al. 2010). Similarly, the basal ganglia are sometimes involved in emotional prosody decoding (Kotz et al. 2003; Bach et al. 2008; Grandjean et al. 2008) on the basis of analysis and integration of temporal patterns (Kotz and Schwartze 2010). However, this sensitivity of different STG regions, of the amygdala or of the basal ganglia to emotion-specific acoustic cues, is still unknown.
Therefore, the present study had 3 general aims. First, we used high-resolution fMRI scans to describe the frontotemporal and subcortical network involved in emotional prosody decoding on a finer spatial scale. We predicted that the cluster of regions in the STG and IFG would consist of several subregions that might subserve different functional roles. Second, if decoding of emotional prosody in the STG and IFG is accomplished by different subregions, we hypothesized that activity in these regions might depend on the level of processing (Grandjean et al. 2005; Sander et al. 2005). We used an explicit and an implicit task during the processing of the same prosodic stimuli (Fig. 1A). The explicit task involved a discrimination of prosodic stimuli (anger vs. neutral) and should elicit activity in more aSTG and especially in frontal brain regions (Sander et al. 2005; Bach et al. 2008; Ethofer, Kreifelts, et al. 2009). The implicit task involved a gender discrimination of the speaker’s voice. Gender processing of voices, or identity processing more general, usually elicits activity in more aSTG regions (Belin and Zatorre 2003; Formisano et al. 2008), which is independent of the emotional tone of a voice. However, beyond these simple task effects, we specifically expected that both levels of processing will differentially influence the decoding of emotional compared with neutral prosody. We specifically expected that a more explicit decoding of emotional compared with neutral prosody might reveal activation more anteriorly in STG regions compared with implicit processing (Grandjean et al. 2005). Since different emotional prosodies comprise a different pattern of basic acoustic features, we finally explored the influence of the mean and variation of F0 and the energy of prosodic stimuli on functional activations during both levels of processing, and we predicted a stronger influence on brain regions that decode sensory stimulus features compared with higher level processing regions.
Materials and Methods
Seventeen healthy participants recruited from Geneva University took part in the experiment (3 males; mean age 25.52 years, standard deviation [SD] = 5.08, age range 20–38 years). All participants were native French speakers, were right handed, had normal or corrected-to-normal vision, and had normal hearing abilities. No participant presented a neurologic or psychiatric history. All participants gave informed and written consent for their participation in accordance with ethical and data security guidelines of the University of Geneva. The study was approved by the local ethics committee.
Stimulus Material and Trial Sequence
The stimulus material consisted of 4 speech like but semantically meaningless words (“molen,” “belam,” “nikalibam,” and “kudsemina”) extracted from the Geneva Multimodal Emotion Portrayal database (Bänziger and Scherer 2010). These words were spoken in either a neutral or an angry tone by 2 male and 2 female speakers, resulting in 32 different stimuli (see Fig. 1A). Auditory stimuli had a mean duration of 690 ms and were equated for mean sound pressure level. A preevaluation of the stimuli by 16 participants (5 males; mean age 25 years, SD = 6.15, age range 20–39 years) revealed that both neutral and angry stimuli were significantly rated as neutral (F1,15 = 188.464, P = 8.835 × 10−32) and angry (F4,56 = 163.692, P = 3.380 × 10−30), respectively. Angry voices were rated as more arousing compared with neutral stimuli (F1,14 = 88.371, P = 1.998 × 10−7).
During scanning, auditory stimuli were presented binaurally with magnetic resonance imaging compatible headphones (MR Confon) at a sound pressure level of approximately 70 dB. Auditory stimuli were preceded by a visual fixation cross (1 × 1°) for 1 ± 0.5 s. The fixation cross-remained on the screen for as long as 2 s after the auditory stimulus to indicate a time window during which participants were stressed to respond. Auditory stimuli were presented between functional volume acquisitions and had an onset of 4 ± 0.75 s prior to the onset of the volume acquisition that followed (see below).
The same stimuli were presented during 2 blocks of explicit prosody discriminations on the stimuli (neutral or angry; right index and middle finger) and during another 2 blocks, where participants made gender discriminations (male or female) on the voices. For the latter, we assumed that though participants focus on the gender of the voice, the emotional tone of the voice is still decoded on an implicit level of processing. This task might involve functional activations, which are related to the explicit processing of gender information of the voice. However, as for the explicit task, we also computed contrasts between angry and neutral voices for this implicit task, which keeps the gender of the voice counterbalanced for this comparison. This should eliminate any activation primarily related to processing of the gender of a voice. Each of the experimental blocks contained 38 trials, including 6 silent events with no auditory stimulation. Task blocks alternated across the experiment, and block order and response buttons were counterbalanced across participants.
To localize human voice-sensitive regions in the bilateral superior temporal cortex, we used 8-s sound clips taken from an existing database (see http://vnl.psy.gla.ac.uk/ and Belin et al. 2000). These sounds clips contained 20 sequences of human voices and 20 sequences of animal or environmental sounds. Each sound clip was presented once with a fixation cross on the screen and a 4-s gap between each clip. The scanning sequence also contained twenty 8-s silent events, and participants had to passively listen to the stimuli (see Supplementary Material for a full description of the voice localizer scan).
For the main experiment, we obtained high-resolution imaging data on a 3 T Siemens Trio System (Siemens, Erlangen, Germany) by using a T2*-weighted gradient echo planar imaging sequence. Twenty-five axial slices were aligned to the superior temporal sulcus along the anterior–posterior orientation (thickness/gap = 2/0.4 mm, field of view (FoV) = 192 mm, in-plane 1.5 × 1.5 mm). We used a sparse temporal acquisition protocol with time repetition (TR) = 10 s, which consisted of 1.75 s for volume acquisition and a silent gap of 8.25 s. For the voice localizer, we used a continuous whole-head acquisition of 36 slices (thickness/gap = 3.2/0.64 mm, FoV = 205 mm, in-plane 3.2 × 3.2 mm) aligned to the anterior to posterior commissure plane with TR/time echo (TE) = 2.1/0.03 s. Finally, a high-resolution magnetization prepared rapid acquisition gradient echo T1-weighted sequence (192 contiguous 1 mm slices, TR/TE/time to inversion = 1.9 s/2.27 ms/900 ms, FoV = 296 mm, in-plane 1 × 1 mm) was obtained in sagittal orientation to obtain structural brain images from each participant.
We used the statistical parametric mapping software SPM (version 8; Welcome Department of Cognitive Neurology, London, UK) for preprocessing and statistical analysis of functional images. Functional images were realigned and coregistered to the anatomical image. A segmentation of the anatomical image revealed warping parameters that were used to normalize the functional images to the Montreal Neurological Institute (MNI) stereotactic template brain. During normalization, functional images from the voice localizer scan were resampled to 1.5 × 1.5 × 2 mm voxel size. Normalized images were spatially smoothed with a nonisotropic Gaussian kernel of full-width at half-maximum 3 × 3 × 4 mm.
We used a general linear model for the first-level statistical analyses, including boxcar functions defined by the onset and duration of the auditory stimuli. These boxcar functions were convolved with a canonical hemodynamic response function. Separate regressors were created for each experimental condition, and the general linear model for the main experiment also included one additional repressor containing all erroneous and missed trials. Six motion correction parameters were finally included as regressors of no interest to minimize false-positive activations that were due to task-correlated motion. Linear contrasts for the experimental conditions for each participant were taken to a second-level random effects group analysis of variance.
For the main experiment, we set up a 2 × 2 factorial design including the factors “task” and “emotion.” For the task factor, we explored what brain regions revealed higher functional activation during explicit compared with implicit processing and vice versa. For the factor emotion, we were interested in increased brain activations elicited by angry compared with neutral stimuli. We first compared the general brain activity of angry with neutral stimuli across both tasks. Moreover, the same contrast was also computed separately for the explicit and implicit task to find out how the general brain network for the processing of angry compared with neutral prosody emerges during both tasks. Finally, to find out brain activity that is exclusive to the processing of angry prosody during a specific task, we finally also did 2 different interaction analyses including the combination of angry prosody with the explicit and the implicit task, respectively. All contrasts were thresholded at P < 0.001 (uncorrected). Cluster extend threshold corrected for multiple comparison was computed based on the estimated smoothness of the data according to Forman et al. (1995) as implemented in the Brain Voyager QX software (version 188.8.131.520; Brain Innovation, Maastricht, The Netherlands). An iterative Monte Carlo simulation with 1000 repetitions for each single contrast yielded a minimum cluster extent of k = 6 voxels corresponding to a cluster-level false-positive rate of P < 0.001.
We were additionally interested in the influence of basic auditory stimulus features on functional brain activations. We therefore included the log-transformed mean and variation (SD) of the F0 and the energy (I) of each auditory stimulus in 2 additional analyses. In a first analysis, we included the mean and SD of the F0 as a covariate on a trial-by-trial basis in the first-level analysis. This analysis should reveal functional activations that are insensitive to differences in F0 stimulus features. The same analysis was repeated by including the mean and SD of I as a covariate. All contrasts were thresholded at P < 0.001 (uncorrected) with a cluster extent of k = 6 voxels.
For the voice localizer, we finally contrasted vocal against nonvocal animal and environmental stimuli at a threshold of P < 0.001 (uncorrected) and a cluster extent of k = 6 voxels. We determined voice-sensitive regions along the STG and STS in both hemispheres for each participant as well as for the entire sample.
Regions in the superior temporal cortex are not only sensitive to emotional prosody but more generally to human voices (Belin et al. 2000). To define voice-sensitive areas in superior temporal cortex, we first ran a voice localizer fMRI scan where participants listened to either 8-s auditory stimuli of human nonspeech voices or to auditory stimuli of environmental sounds or animal vocalizations (see http://vnl.psy.gla.ac.uk/ and Belin et al. 2000). We compared functional activations for human voices with activations for environmental and animal sounds and revealed extended bilateral activations in the STG and STS for the entire group of participants (see Supplementary Fig. 1 and Supplementary Table 1). These group activations served to define voice-sensitive areas in the bilateral temporal cortex (black outline in fig2, 3, 5, and 6) and were confirmed by single-subject analysis on the voice-sensitive regions (see Supplementary Fig. 1B). Our results of voice-sensitive areas are consistent with a recent study, which has shown a differential decoding of emotional prosody within these voice-sensitive regions (Ethofer, Van De Ville, et al. 2009).
Frontotemporal and Subcortical Subregions for the Decoding of Emotional Prosody
To reveal voice-sensitive subregions in superior temporal cortex as well as subregions in inferior frontal and subcortical brain regions that are involved in the decoding of emotional prosody, we asked 17 participants to perform a prosody (explicit) or a gender discrimination task (implicit) on angry and neutral voices, while we collected another set of functional fMRI scans (Fig. 1A). Functional scans were aligned to the posterior–anterior orientation of STS and were spatially restricted to superior temporal cortex and inferior frontal cortex covering also subcortical regions such as the amygdala to obtain high-resolution images from these brain regions. We used a sparse temporal acquisitions protocol where auditory stimuli were presented in an 8.25-s silent gap between image acquisitions (see Materials and Methods).
In a first analysis, we explored the effects of task by comparing functional activations during the prosody discrimination task (explicit processing) with activations during the gender discrimination task (implicit processing) irrespective of the emotional value of the voice. While explicit compared with implicit processing of prosody revealed no functional activation in our primary regions of interest in lateral superior temporal and inferior frontal lobe (see Supplementary Table 2A for a full list of activations), implicit compared with explicit processing elicited activity in right middle temporal gyrus (MTG, MNI x, y, z; 66, −7, −16; Fig. 2).
Besides the main effects of task, we were furthermore interested in the general effects of angry compared with neutral voices across both tasks. This analysis should reveal brain activity, which is independent of the level of processing. We found that several subregions in superior temporal and inferior cortex are sensitive to emotional prosody. Especially, in the right hemisphere, we found temporal subregions in the fundus of the posterior superior temporal gyrus (f-pSTS; 45, −34, 4), pSTG (69, −22, 4), mSTG (66, −3, 2), and planum polare (PP; 53, −4, −4) as well as frontal subregions in the frontal operculum (fOP; 48, 13, −2) and IFG (51, 32, −2). The fOP activation was located in the right hemispheric homologue to left BA 44, whereas the other IFG activation was located on BA 45/47. In the left hemisphere, we found activity in PP (−50, −10, 4), pSTG (−68, −27, 6), IFG (−44, 29, 0) as well as in the amygdala (−18, −4, −16). All subregions in the superior temporal cortex were located in voice-sensitive areas, except for the activity in PP (Fig. 3A; see Supplementary Table 2C for a full list of activations).
Since, however, activity in this frontotemporal and subcortical set of brain regions might also depend on the level of processing, we additionally compared activity for angry compared with neutral voices separately for the explicit and the implicit discrimination task. We found the left IFG (−50, 26, 6) and amygdala (−18, −4, −18), as well as the right PP (53, −4, −4), mSTG (60, 1, 0), and pSTG (65, −24, 6), to be active only during the explicit discrimination task (Fig. 3B; see Supplementary Table 2D), whereas the left IFG (−42, 29, 0) and PP (−51, −10, −4), as well as the right IFG (54, 31, 2), fOP (47, 13, 0), PP (53, −4, −4), pSTG (68, −22, 4), and globus pallidus (23, −13, 4), were active during the implicit discrimination task (Fig. 3C; see Supplementary Table 2E). These results suggest that some regions in STG and IFG are generally sensitive to emotional prosody independent of the level of processing (right f-pSTG, left IFG), while other regions are active only during the explicit (amygdala) or implicit decoding of emotional prosody (right IFG, left PP, left pSTG). To further specify specific brain activity during the explicit or implicit discrimination task, we performed an interaction analysis to find brain activity, which is unique for the decoding of angry prosody during a specific level of processing. This analysis revealed specific activations for angry voices in the explicit discrimination task in the subgenual anterior cingulate cortex (sgACC; −5, 31, −10) and in the left striatum putamen (Fig. 4A) and bilateral striatal activity during the implicit discrimination task (Fig. 4B; see Supplementary Tables 2F,G).
Taken together, we found a general frontotemporal and subcortical network of brain regions for the decoding of angry prosody that consist of several local brain regions especially in the right hemisphere. Activity in this general set of brain regions, however, revealed a dependency on the levels of processing. Another factor, which is assumed to influence activity in this frontotemporal network for the sensory and evaluative decoding of emotional prosody are emotion-specific basic acoustical features, such as the F0 and the energy of stimuli, which are known to vary across different vocal expressions of emotions (Banse and Scherer 1996; Grandjean et al. 2006).
Sensitivity to Emotion-Specific Acoustic Features in posterior STG and Amygdala
We performed 2 additional analyses that were similar to the one described in the former section but taking into account F0 and energy (I) differences between angry and neutral stimuli. Specifically, we scored the mean and SD of the F0 and the energy for each of the 16 angry and 16 neutral auditory stimuli. These stimulus features were log transformed and entered into the statistical analysis for each participant on a trial-by-trial basis (angry voices: logF0mean = 5.83 [standard error of the mean—SEM 0.04], logF0SD = 3.74 [SEM 0.15], logImean = 4.29 [SEM 0.01], logISD = 1.93 [SEM 0.07]; neutral voices: logF0mean = 5.00 [SEM 0.10], logF0SD = 2.66 [SEM 0.20], logImean = 4.31 [SEM 0.01], logISD = 1.66 [SEM 0.06]). We performed one analysis by taking into account the mean and SD of the F0 and a separate analysis with the mean and SD of the energy. Group analyses and contrasts were performed in the same way as described in the former section. Functional activations, which are modulated by this analysis compared with our former analysis, should indicate a strong sensitivity to acoustical stimulus features.
When including the mean and SD of the F0 as covariate in the analysis, the right MTG activations that we found during the gender compared with the prosody discrimination task remained active indicating insensitive to F0 stimulus features (see Supplementary Table 3B). We also found more activity in the right f-pSTS (47, −33, 0) as well as in left aSTG (−56, 11, −10) and IFG (−45, 41, −10) for the prosody compared with gender discrimination task (Fig. 5; see Supplementary Table 3A). For the contrasts related the experimental factor emotion, most of the activations for angry compared with neutral voices in subcortical regions and in posterior regions of STG disappeared, except for the activity in right mSTG (59, −1, −2) and PP (51, −4, 0), which seems also insensitive to F0 stimulus features (Fig. 6; see Supplementary Table 3C–E). Furthermore, activity in bilateral IFG also showed an insensitivity to F0 stimulus features, where left IFG (−51, 26, 4) again was generally active during both tasks, whereas right IFG (51, 32, −2) was only active during implicit processing. We again explored specific activations using an interaction analysis to find unique activation for angry prosody during explicit or implicit processing. Compared with our first analysis that did not take F0 stimulus features into account, we found right globus pallidus (23, −15, 0) activity during implicit processing (Fig. 7D; see Supplementary Table 3G) and widespread activations in bilateral mSTG (left; −62, −1, −4 and right; 60, 1, 0) and pSTG (left; −65, −33, 4 and right; 65, −22, 6) as well as in the right fOP (56, 8, 14), Heschl gyrus (HG; 65, −16, 12), and planum temporale (PT; 50, −22, 4) for angry voices during the explicit task (Fig. 7A–C; see Supplementary Table 3F). Especially during explicit processing, additionally, activity was found in region in between right mSTG and pSTG (termed as m-pSTG; −60, −18, 2) located within voice-sensitive superior temporal cortex.
We performed an identical analysis with the mean and SD of the energy as covariate and found almost identical activations compared with the analysis with the mean and SD of the F0 as covariate (see Supplementary Table 4 and Supplementary Figs 2–4). The only exception was additional activation in the sgACC (−5, 31, −10) for the interaction analysis including angry voices in the explicit task (see Supplementary Table 4F), suggesting that sgACC is less sensitive to energy differences than to F0 differences. However, the general similarity for the analysis with F0 and the energy as covariate indicates a comparable influence of F0 and energy stimulus features on frontotemporal and subcortical brain activity since F0 and energy features seem to be highly associated during angry prosody (Banse and Scherer 1996; Leitman et al. 2010). This influence of F0 and energy stimulus features was especially pronounced in pSTG and m-pSTG regions and especially when attention was explicitly focused on the emotional prosody and most probably on prosody-specific acoustical stimulus features.
The results of present study suggest that the frontal, temporal, and subcortical network that is commonly involved in the decoding of emotional prosody consists of several subregions especially at the cortical level. We found that this differentiation in superior temporal and inferior frontal brain regions was strongly driven by the comparison between angry and neutral prosody, while the difference between the prosody and the gender discrimination task did not reveal strong differences except for an activity in right MTG during the gender discrimination task. This might be indicative of increased gender encoding as part of a general speaker’s identity processing in more anterior temporal brain regions (Belin and Zatorre 2003; Formisano et al. 2008). The fact that we did not find strong effects for the explicit prosody discrimination task might be due to our restricted scanning space mainly including superior temporal and inferior frontal brain regions, while recent studies mainly found activity in superior parietal and superior frontal activity during the explicit decoding of emotional prosody (Bach et al. 2008; Ethofer, Kreifelts, et al. 2009). However, only when taking central prosodic stimulus features into account, we revealed a set of distributed brain regions during explicit decoding of emotional prosody, as we will discuss below.
Though task factor alone was not able to reveal a strong differentiation on brain regions, it showed a strong influence when combined with the factor emotion. The comparison of angry with neutral prosody revealed several temporal and frontal subregions, which become differentially active during task-independent, explicit, or implicit decoding of emotional compared with neutral prosody. The right hemisphere in particular revealed at least a temporal network of 4 regions, namely, regions in the fundus of the superior temporal sulcus (f-pSTS), pSTG, mSTG, and PP. All these regions were located within voice-sensitive areas, except for the PP. The f-pSTS has not frequently been reported as a region in the superior frontal cortex that is sensitive to emotional prosody but might correspond with posterior STS regions involved in the integration of multimodal emotional signals (Kreifelts et al. 2009). The f-pSTS was active independent of the task, whereas the pSTG and PP were similarly active during both tasks, and the mSTG was active only during explicit prosody discrimination. This posterior-to-anterior gradient for a task-independent and stimulus-driven decoding of emotional prosody to more explicit decoding in the superior temporal cortex is in accordance with a former observation (Grandjean et al. 2005) and resembles a proposed increase in levels of stimulus processing when information is fed forward to the more anterior superior temporal cortex (Schirmer and Kotz 2006) along a proposed pathway of auditory object recognition (Rauschecker and Scott 2009).
This temporal network showed a strong differentiation in the right hemisphere, whereas only 2 regions could be differentiated in the left hemisphere. A region in the left pSTG located in voice-sensitive areas was active during implicit processing and PP during both implicit and explicit processing. Although this left hemispheric pattern of activations follows the same posterior-to-anterior gradient, the right hemispheric dominance is in accordance with a stronger, but not exclusive, involvement of the right superior temporal cortex in decoding emotional prosody of vocalizations consisting of nonintelligible speech (Grandjean et al. 2005; Sander et al. 2005) or nonspeech stimuli (Fecteau et al. 2007).
Similar to the differentiation of functional activations in the superior temporal cortex, peak activations in the inferior frontal cortex could be located in different subregions and also showed task dependency. Angry compared with neutral voices elicited increased activity in the right fOP and IFG during implicit processing and in the left IFG during both implicit and explicit processing. The distinction of activation in the right inferior frontal cortex supports the notion that different subregions might subserve different functional roles. Activity in the fOP (BA 45) might subserve increased cognitive evaluation of emotional prosody (Leitman et al. 2010), whereas more anterior IFG regions (BA 47) are associated with outcome-related evaluations (Schirmer and Kotz 2006). Both processes seem to be especially increased during implicit decoding when attentional focus is not directly focused on the emotional prosody feature. Explicit decoding of emotional prosody engaged only the left IFG in BA47 and is in accordance with recent studies (Bach et al. 2008; Ethofer, Kreifelts, et al. 2009) and with studies showing increased left IFG activations when explicit decoding is stressed by contextual factors (Schirmer et al. 2004; Mitchell 2006). Specific activity during explicit decoding of emotional prosody was also found in the ventromedial frontal cortex and might serve in similar elaborate decoding and appraisal processes of emotional prosody when individuals attend to emotional prosodic stimulus features (Sander et al. 2005).
Explicit decoding of emotional prosody also revealed activation in the left amygdala. Some recent studies also report activation in the amygdala for emotional compared with neutral voices (Sander et al. 2005; Fecteau et al. 2007; Bach et al. 2008; Ethofer, Kreifelts, et al. 2009; Wiethoff et al. 2009). However, though some studies report amygdala activations only during explicit processing of emotional prosody (Wiethoff et al. 2009; Leitman et al. 2010), other studies highlight the notion that the amygdala is more generally active independent of the level of processing (Sander et al. 2005) or specifically during the implicit processing of emotional prosody (Bach et al. 2008). In the latter case, the amygdala is supposed to act as a detector of important emotional information even when this information is presented outside the focus of attention. We also found, in accordance with the results of Sander et al. (2005), general left amygdala activation independent of the task. However, Bach et al. (2008) found amygdala activation during implicit processing only when comparing all emotional and neutral stimuli together with the explicit processing of both kinds of stimuli. In the present study, we specifically compared emotional and neutral stimuli within each task separately and did not find specific amygdala activity for emotional stimuli during implicit processing; rather, we found this activity during explicit processing.
Apart from activity in the amygdala, which might code the emotional value of emotional prosody during explicit processing, we found specific activity in the left basal ganglia during explicit decoding but also in the bilateral basal ganglia during implicit decoding of emotional prosody. Rather than coding the emotional value, the basal ganglia are assumed to code the temporal patterns of emotional acoustic cues, such as rhythms or variations in the auditory signal (Kotz and Schwartze 2010). Angry as compared with neutral prosody is especially characterized by a strong variation of the F0 and the energy, rapid speech onset and high speech rate that results in a specific and distinguished temporal acoustical pattern (Banse and Scherer 1996; Grandjean et al. 2006).
Whereas basal ganglia seem to specifically code the temporal pattern of emotional acoustic cues, cortical brain regions seem to be sensitive to other acoustic cues (Wiethoff et al. 2008; Leitman et al. 2010). We tested whether the different temporal and frontal subregions that we found to be active during the explicit and implicit decoding of emotional prosody are sensitive to F0 and energy stimulus features of the angry compared with neutral prosody. When including mean and variation of the F0 or of the energy as a covariate in our analysis, we found that bilateral activations in the inferior cortex were insensitive to these acoustic stimulus features. Moreover, during the prosody discrimination task, we found additional activity in a left anterior IFG region (Ethofer, Anders, Erb, et al. 2006; Bach et al. 2008), which might represent a frontal voice and prosody sensitive area (Fecteau et al. 2005). This left anterior IFG regions seem not only insensitive to acoustic features of emotional prosody but also seem to be covered by F0 and energy differences between conditions in the first analysis. Explicit attention to emotional prosody also revealed additional activity in a right posterior and left anterior voice-sensitive area, indicating that the explicit decoding of emotional prosody in general includes a temporofrontal network that can only be detected when central prosody features are included in the analysis, for which a different set of brain regions seems to be sensitive.
For the comparison of emotional with neutral prosody, we found that especially posterior regions in STG are sensitive to prosodic stimulus features. Except for activations in the right PP and mSTG, activations in the amygdala, the sgACC (only for the F0 as covariate), and most of the activations in the bilateral STG disappeared, indicating sensitivity to acoustic stimulus features. However, the right PP and mSTG were active only during explicit processing, which we found even when taking acoustical features into account. These activations might indicate some higher level auditory emotional representations independent of more basic stimulus features but depending on explicit attention to the emotional prosody. Furthermore, the interaction analysis of specific activations revealed widespread bilateral STG activations, again, especially during explicit decoding of angry prosody. We found activations in a left region in between the mSTG and pSTG (m-pSTG) as well as in the right HG and PT. Explicitly orienting attention to the prosodic features and presumably to emotion-specific acoustical features of angry compared with neutral stimuli might have led to this enhanced activity in these regions. These regions were not active during the first analysis, in which we did not take stimulus features into account. Therefore, these activations again might have been covered by strong F0 and energy differences between conditions. In particular, the m-pSTG was located in more pSTG regions for implicit decoding and in mSTG regions for explicit decoding. This finding complements the formerly discussed posterior-to-anterior gradient of voice-sensitive STG regions, where a more anterior location implies a greater sensitivity to emotional prosody during explicit decoding.
We obtained similar results when including mean and variation of the energy as a covariate in the analysis instead of the F0, the only exception being that the sgACC showed no sensitivity to energy-related stimulus features. The similarity of results could be based on the fact that F0 and energy modulations are highly correlated for angry expressions and can separate anger from other emotions (Banse and Scherer 1996; Leitman et al. 2010; Patel et al. 2011). Though it is assumed that the left hemisphere is usually more sensitive to the F0, whereas the right hemisphere is sensitive to energy (Zatorre et al. 2002), we could not confirm this hemispheric specialization. Regions in the bilateral STG and amygdala showed a similar sensitivity to F0 in addition to energy-related stimulus features.
A surprising finding was the sensitivity of amygdala activation to F0 and energy stimulus features. Activity in the amygdala is usually assumed to reflect the decoding of the emotional value of auditory stimuli rather than the decoding of basic acoustic cues (Scott et al. 1997; Sprengelmeyer et al. 1999; Wiethoff et al. 2009; Leitman et al. 2010). However, Bordi and LeDoux (1992) have shown a sensitivity of the amygdala to simple auditory sensory stimulation suggesting some sensitivity of the amygdala to basic acoustic stimulus features. Angry prosody consists of a unique combination of F0 and energy stimulus features, and activity in the amygdala in response to angry prosody (Grandjean et al. 2005; Sander et al. 2005) might partly rely on these basic acoustic stimulus cues (Leitman et al. 2010).
Taken together, the present data revealed 3 major findings. First, the common frontotemporal network of brain regions consists of several subregions. We were able to distinguish at least 4 subregions in the right STG and 2 subregions in the IFG when comparing emotional to neutral prosody but also during the prosody compared with gender discrimination task when taking stimulus features into account. Second, these subregions are differentially sensitive to the levels of processing. Implicit processing of emotional compare with neutral prosody engages more pSTG regions, the bilateral IFG, and bilateral basal ganglia, whereas explicit processing relies on more mSTG regions, the left IFG, amygdala, left basal ganglia, and sgACC. Third, a part of these regions also showed sensitivity to emotion-specific acoustic cues when comparing emotional with neutral prosody. This sensitivity was specifically strong in the bilateral pSTG during implicit processing, whereas explicit processing revealed a widespread network of bilateral mSTG and pSTG regions, which were relatively independent of F0 and energy-related acoustic cues.
This study was supported by the Swiss National Science Foundation (SNSF 105314_124572/1) and by the National Center for Competence in Research in Affective Sciences at the University of Geneva.
Conflict of Interest: None declared.