Cross-modal integration of visual and auditory emotional cues is supposed to be advantageous in the accurate recognition of emotional signals. However, the neural locus of cross-modal integration between affective prosody and unconsciously presented facial expression in the neurologically intact population is still elusive at this point. The present study examined the influences of unconsciously presented facial expressions on the event-related potentials (ERPs) in emotional prosody recognition. In the experiment, fearful, happy, and neutral faces were presented without awareness by continuous flash suppression simultaneously with voices containing laughter and a fearful shout. The conventional peak analysis revealed that the ERPs were modulated interactively by emotional prosody and facial expression at multiple latency ranges, indicating that audio–visual integration of emotional signals takes place automatically without conscious awareness. In addition, the global field power during the late-latency range was larger for shout than for laughter only when a fearful face was presented unconsciously. The neural locus of this effect was localized to the left posterior fusiform gyrus, giving support to the view that the cortical region, traditionally considered to be unisensory region for visual processing, functions as the locus of audiovisual integration of emotional signals.
In everyday situations, emotional signals are often encoded redundantly in multiple perceptual modalities. For example, utterances with angry prosody tend to co-occur with the display of angry facial expressions. Cross-modal integration of the redundant emotional cues is supposed to be advantageous, because using information from multiple modalities increases the identification accuracy of the encoded information.
Consistent with this, an increasing number of studies have shown cross-modal interactions between auditory and visual emotional information (Klasen et al. 2012). One type of study presented congruent or incongruent information simultaneously in multiple modalities and examined the facilitatory effect of affective congruency (Massaro and Egan 1996; de Gelder et al. 1999; Pourtois et al. 2000; Dolan et al. 2001; Föcker et al. 2011; Klasen et al. 2011; Müller et al. 2011, 2012; Stienen et al. 2011), while another compared the processing efficiency of bimodally presented emotional signals with that of emotional signals presented in a single modality (Pourtois et al. 2005; Ethofer et al. 2006; Kreifelts et al. 2007; Collignon et al. 2008; Robins et al. 2009; Jessen and Kotz 2011; Jessen et al. 2012). Although the details of the results are not necessarily convergent, these studies have consistently shown cross-modal interactions between auditory and visual information in emotional processing (for review, see, de Gelder and Bertelson 2003; Campanella and Belin 2007; Klasen et al. 2012).
From an evolutionary perspective, the automaticity or mandatory nature of emotional processing is of critical importance in increasing the odds of survival, because it ensures that emotional processing kicks in regardless of whether attentional or cognitive resources are allocated to the emotional stimulus itself. The automatic nature of affective information processing has been demonstrated for signals indicating the existence of threats in the environment (Öhman et al. 2001; Armony and Dolan 2002; Holmes et al. 2005; Pourtois et al. 2005, 2006). For example, Öhman et al. (2001) have shown that the speed of detecting threatening animals, such as snakes and spiders, within an array of visual stimuli is almost independent of display size, which is a diagnostic sign of preattentive visual search. Importantly, later studies have gone even further by revealing that biologically important emotional stimuli can induce physiological (Liddell et al. 2004; Carretié et al. 2005; Kiss and Eimer 2008; Pegna et al. 2008) and behavioral changes (Dimberg et al. 2000; Li et al. 2008; Tamietto et al. 2009; Stienen et al. 2011) even when they are rendered invisible. Furthermore, evidence has accumulated that subcortical structures, including the amygdala, detect emotional information without awareness (Morris et al. 1999; Killgore and Yurgelun-Todd 2004; Williams et al. 2004; but see, Pessoa et al. 2006).
Taking into account the advantage of mandatory processing, it seems likely that the cross-modal integration of auditory and visual emotional information also takes place automatically. The support for this conjecture comes from Vroomen et al. (2001), which revealed that the influence of simultaneously presented facial expressions on emotional prosody recognition is robust to cognitive load incurred by a secondary task. The ultimate test for the automaticity of audiovisual integration would be to see whether a synergistic interaction between auditory and visual emotional information is observed when either or both of these sources of information are presented without awareness. This test has been conducted in a series of studies by de Gelder et al. (2002, 2005) on blind-sighted patients, which have shown that the unconscious presentation of affectively congruent facial expressions enhanced activation in the amygdala and superior colliculus (de Gelder et al. 2005), and increased the amplitude of auditory event-related potentials (ERPs) to emotional prosody (de Gelder et al. 2002).
The demonstration of cross-modal integration between prosody and unconsciously presented emotional stimuli constitutes especially strong support for the view that cross-modal integration of affectively congruent information can take place in a mandatory or automatic manner (de Gelder et al. 2002, 2005). At the same time, the neural mechanisms underlying the cross-modal integration between emotional prosody and unconsciously presented facial expressions are still unresolved. To be more specific, existing studies (de Gelder et al. 2002, 2005) have dealt with only a limited number of patients with severe damage to their visual cortices. Although this peculiar characteristic of the participant group is advantageous in mitigating the potential influence of residual conscious awareness of visual stimuli (Tamietto and de Gelder 2010), it prevented researchers from examining the possibility that cross-modal integration between emotional prosody and unconscious affective faces recruits visual cortices, generally thought to be responsible for unisensory processing, in a neurologically normal population.
The support for this possibility comes from 2 lines of recent studies. First, increasing numbers of human and animal studies have shown that the neural responses in a supposedly unisensory region can be modulated by simultaneous presentation of stimuli from multiple sensory modalities (Schroeder and Foxe 2005; Kayser and Logothetis 2007; Driver and Noesselt 2008; Doehrmann et al. 2010; Lewis and Noppeney 2010). For example, a recent neuroimaging study has revealed that temporal synchrony between auditory and visual stimuli affects the primary visual cortex activation (Noesselt et al. 2007). Likewise, some of human electrophysiological studies have indicated that activation of occipitotemporal regions can be increased nonlinearly within 40–110 ms after the onset of the audiovisual stimulus (Giard and Peronnet 1999; Joassin et al. 2004; Cappe et al. 2010).
The second line of support comes from research on the neural basis of visual consciousness. Several competing models have been proposed as the neural mechanism engendering visual consciousness. However, the majority of models are in agreement that visual information below conscious awareness barely reaches regions outside the occipitotemporal areas (Lamme and Roelfsema 2000; Dehaene et al. 2006; Del Cul et al. 2007). On the basis of this, occipitotemporal areas are good candidates for the loci of cross-modal interaction between unconscious visual information and emotional prosody. For example, an influential model (Dehaene et al. 2006; Del Cul et al. 2007) proposes that, in order for visual information to reach conscious awareness, recurrent processing of this information, mediated by a feedback loop between visual and other cortical regions, has to be sustained for a prolonged period. This might imply that cortical regions outside the visual cortical areas do not spend sufficient time or processing resources on weak visual information below the threshold of conscious awareness. This conjecture has been partly supported by electrophysiological investigation by Del Cul et al. (2007), which revealed that cortical activation was greatly reduced and short-lived, except in the occipital-temporal region, when the visual stimulus was rendered invisible by visual-masking with short stimulus onset asynchrony compared with when participants could consciously perceive it.
For the purpose of elucidating the neural mechanism of automatic integration between cross-modally presented affective information in neurologically intact participants, the present study investigated the modulation of the neural activation underlying affective prosody recognition by unconsciously presented facial expressions. In this experiment, we measured ERPs elicited in discriminating emotional prosody while simultaneously viewing unconsciously presented facial expressions. The facial expressions were rendered invisible by continuous flash suppression (CFS; Tsuchiya and Koch 2005; Tsuchiya et al. 2006; Yang et al. 2007). To be more specific, static low-contrast pictures of facial expressions were projected to the nondominant eye, while dynamic nonsensical images were projected to the dominant eye. In this way, static information loses the interocular competition with dynamic images, and does not reach conscious awareness. One advantage of CFS over the oft-used visual-masking paradigm (Dimberg et al. 2000; Liddell et al. 2004; Eimer et al. 2008; Pegna et al. 2008) is that facial expression information can be presented unconsciously with prosody information for a prolonged time, which presumably increases any effects of unconscious facial expressions on affective prosody processing. We included fearful, happy, and neutral expressions as unconscious visual stimuli. The auditory stimuli were voices of laughter and fearful shouts. By combining these auditory and visual emotional signals orthogonally, 6 types of audiovisual stimuli were presented to the participants. The effectiveness of the auditory and visual stimuli were evaluated by rating experiments, in which participants rated their emotional responses towards the stimuli along the valence and arousal dimensions. It was because the valence and arousal are considered to be the fundamental dimensions defining the emotional states (Cuthbert et al. 2000; Lang and Bradley 2010) that these scales were used in the rating experiments. The ratings for the stimuli should vary considerably along the valence (pleasant–unpleasant) dimension. As for the arousal dimension, many studies have shown that stimuli with negative valence generally tend to induce higher arousal than pleasant stimuli (Hamann 2003). Furthermore, threatening or aversive information is reported to be especially effective in mobilizing attentional and physiological responses such as autonomic and electrophysiological activations (Öhman et al. 2001; Hamann 2003; Vuilleumier 2005; Lang and Bradley 2010; Smith et al. 2013). On the basis of these, we chose to use the fear-related stimuli that induce higher arousal than the happy and neutral stimuli in the present study.
At the first stage of ERP analysis, we conducted hypothesis-driven tests on conventional measures of peak amplitude and latency. Previous ERP studies on audiovisual integration have demonstrated the modulation of the ERP peak amplitudes or latencies of early components elicited within 200 ms after stimulus onset (Giard and Peronnet 1999; Joassin et al. 2004; Stekelenburg and Vroomen 2007; Brefczynski-Lewis et al. 2009). As for the audiovisual integration of emotional information, de Gelder et al. (1999) have shown that the effect of affective congruency between facial expression and emotional prosody manifests itself in visual mismatch negativity elicited about 200 ms after stimulus onset. Likewise, Pourtois et al. (2000) have revealed an increase in auditory N1 amplitude when paired with emotionally congruent facial expressions. These studies support the possibility that neural activation within the early latency range underlies the initial stages of cross-modal integration of emotional signals. Another reason for focusing on the early components is that the ERP components within 200 ms after stimulus onset are reported to be sensitive to unconsciously presented emotional expressions. For example, unconsciously presented threatening stimuli (such as spiders and fearful faces) increase the amplitude of the frontomedial positivity elicited around 150 ms compared with nonthreatening stimuli (Carretié et al. 2005; Smith 2012). Furthermore, a centro-posterior negativity around 170–250 ms was augmented by a fearful face specifically when it was presented subliminally (Liddell et al. 2004; Kiss and Eimer 2008). On the basis of these findings, we first analyzed the modulatory effects of the pairing of emotional prosody and unconsciously presented facial expressions on the early ERP components: P1 and N2 in the occipitotemporal electrode sites (Rossignol et al. 2005; Vlamings et al. 2009) and N1 and P2 in the frontocentral electrode sites (Pourtois et al. 2000; de Gelder et al. 2002).
A number of ERP studies on emotion processing have revealed that the evaluation of “emotional meaning” rather than the extraction of emotion-specific perceptual features takes place following unisensory cortex activation, and is reflected in late-latency components (Cuthbert et al. 2000; Olofsson and Polich 2007; Schupp et al. 2007), which indicates that a qualitatively different stage of cross-modal integration takes place at this latency range. On the basis of this, we analyzed the modulation of long-latency components elicited around 300–700 ms after stimulus onset. The amplitude of long-latency components elicited around 400 ms after stimulus onset is reported to be sensitive to incongruities between human images and nonhuman sounds (Puce et al. 2007), but few studies have examined the effects of emotional congruency on ERPs during this latency range. As a notable exception, Paulmann et al. (2009) have shown that the long-latency negativity N300 was decreased by bimodal compared with unimodal presentation of facial expressions and emotional prosody. Likewise, Jessen and Kotz (2011) have revealed stronger alpha suppression in frontal regions in this latency range when affectively congruent information was presented multimodally. At the same time, de Gelder et al. (1999) failed to show any effects of emotional congruity between facial expression and emotional prosody on late-latency positivity. Therefore, although there are some indications that the processing of audiovisual integration continues beyond 300 ms after stimulus onset (Paulmann et al. 2009; Jessen and Kotz 2011), the nature of the processing at this latency range is yet to be clarified.
In addition to the conventional ERP analyses, the locus of cross-modal integration was examined in an explanatory manner by a hypothesis-blind analysis of ERP scalp field data. Although undoubtedly informative, the peak analysis of empirically defined ERP components has some shortcomings (Pourtois et al. 2005; Murray et al. 2008). First, a priori determination of the latency range during which each ERP component is analyzed can bias the conclusion drawn from the analysis, especially in the analysis of the long-latency component that often spans as wide as several hundred milliseconds. Second, the selection of prominent ERP components carries a risk of missing a potentially important modulation of low-amplitude waveforms. In order to avoid these pitfalls in conventional ERP analysis, the ERP waveforms were analyzed using an approach that combined multiple tests of scalp electrical fields (Lehmann and Skrandies 1980; Pourtois et al. 2005; Murray et al. 2008; Schulz et al. 2008) including the nonparametric testing of global field power (GFP) and global dissimilarity (DISS). The neural source of the observed modulations was localized by standard low-resolution electrical tomography analysis (sLORETA; Pascual-Marqui 2002). The application of these hypothesis-blind approaches nicely complements conventional ERP analyses by reducing potential biases (Murray et al. 2008).
In summary, the primary focus of the present study is to investigate the neural underpinnings of cross-modal integration between emotional prosody and unconsciously presented facial expressions in the neurologically intact population. On the basis of the existing ERP studies on emotional information processing and visual consciousness, we hypothesized that the cross-modal integration takes place at the evaluative stage of emotion processing captured by long-latency ERP components as well as the perceptual stage in unisensory cortices. To examine this hypothesis, the modulation of several well-studied ERP components was examined by conventional analysis of ERP peak latency and amplitude. In addition to this analysis, the locus of cross-modal integration was investigated using hypothesis-blind analysis of the ERP scalp field to avoid the potential drawbacks of conventional analyses.
Ten males and five females (M = 21.5 ± 2.8 years old) participated in the present study after giving informed consent following the declaration of Helsinki. No participant had a history of mental illness or was on medication at the time of participation. All participants were right-handed, and they all had normal or corrected-to-normal visual acuity. The levels of anxiety in these participants were measured by the Japanese translation of State-Trait Anxiety Inventory questionnaire (Spielberger et al. 1970; Cronbach's α = 0.85 for the trait anxiety and 0.87 for the state anxiety; Shimizu and Imae 1981). Trait and state anxiety levels ranged from 34 to 51, indicating that every participant had an anxiety level within ±1 SD of the distribution in the normative data (Shimizu and Imae 1981). Experimental procedures were approved by the institutional ethical committee of Nagasaki University.
Apparatus and Stimuli
The visual stimuli were generated on a 19-inch color monitor, viewed through colored glasses. An example of the visual stimuli is schematically depicted in Figure 1. In the visual display, a static picture of a facial expression was presented simultaneously with a dynamic nonsensical image comprising of multiple patches, which alternated at a rate of 20 Hz. Facial expressions and nonsensical images were depicted by different color channels, that is, red and green. Thus, when seen through colored glasses, each of these images was projected separately to each eye. In the experiment, the allocation of the colors was determined so that an image of the facial expression was projected onto the nondominant eye (Moradi et al. 2005). This mode of visual stimulus presentation induces CFS (Tsuchiya and Koch 2005; Tsuchiya et al. 2006; Yang et al. 2007), in which only the nonsensical dynamic images reach conscious awareness. The facial expressions and nonsensical images each subtended 5.5° in width and 5.1° in height.
The facial stimuli consisted of fearful, happy, and neutral expressions of 3 Japanese males and 3 females taken from the ATR DB99 database (ATR Promotions). The region covering the brows, eyes, nose, and mouth was cropped in a rectangular shape. Then, the contrast, luminance, and spatial frequency components were equated among the facial pictures using the SHINE toolbox (Willenbockel et al. 2010) to reduce the variation of low-level perceptual features. In contrast to some of the previous studies using CFS, in which the contrast of the to-be-invisible stimulus was increased gradually, the face picture was presented in full contrast from the onset of the visual stimulus.
The auditory stimuli were emotional prosody uttered by 3 males and 3 females taken from the Montreal Affective Voices database (Belin et al. 2008). The original recordings were edited to a length of about 630 ms (M = 632 ms, SD = 39 ms). The auditory stimuli were presented at about 60 dB by 2 loudspeakers flanking the monitor.
The visual and auditory stimuli were presented simultaneously, and the onsets of these stimuli were temporally aligned. The 6 speakers of emotional prosody and the 6 face models were paired randomly under the constraint that the speaker and face model were from the same gender. Every participant was presented with audiovisual stimuli of both male and female models. The stimulus set of speakers and face models each included 3 males and 3 females. Therefore, a total of 6 pairings (3 pairings for each gender) of speaker and face model were presented to each participant. The correspondence between the speaker and face model was changed across the participants, but was constant for each participant throughout the experiment.
The experiment comprised several distinct stages. First, the ocular dominance was checked, and then the main experiment of electroencephalogram (EEG) recording started. After the completion of EEG recording, 3 behavioral experiments were conducted for the same participants as in the EEG recording to further check the validity of the experimental manipulation in EEG recording. The behavioral experiments included Direct Stimulus Detection Task, Cross-modal Stimulus Evaluation, and Unimodal Stimulus Evaluation; these behavioral tasks were conducted in this fixed order for every participant.
Ocular Dominance Examination
After arriving at the laboratory, the dominant eye of each participant was examined. Specifically, a participant was instructed to hold a piece of paper with a small hole of 1 cm in diameter at the center of it about 30 cm away from his/her face, and an experimenter stood facing the participant with her index finger upright. Then, the participant adjusted the location of the paper so that the index finger of the experimenter could be seen through the hole binocularly. After the adjustment, the participant closed each eye independently and reported whether he/she could see the index finger with only one eye. The eye with which the participant could see the index finger monocularly was determined to be the dominant eye. According to this procedure, the left eye was determined to be the dominant one in 7 of the 15 participants.
After the ocular dominance examination, the participants put on colored glasses and the EEG recording session started. During EEG recording, the participant was seated 110 cm from the screen in a dimly lit room. Each trial began with a 500 ms presentation of a fixation cross at the center of the screen, which was followed by the presentation of the stimulus. There was no intertrial interval, and the disappearance of visual stimulus was immediately followed by the presentation of the fixation cross. The visual stimuli were presented for 1000 ms. Each of Prosody (2) × Facial Expression (3) = 6 conditions were presented 96 times, yielding in total 576 experimental trials. The trials were separated into 8 experimental blocks.
To avoid confounding oculomotor artifacts, each participant was asked to maintain fixation on the center of the screen during the trials and to blink while the fixation cross was presented. They were instructed to answer the emotional category of the auditory stimuli by pressing the “1” or “3” key as soon as possible. The correspondence between emotional category and button was counterbalanced across participants. Half the participants were asked to press 1 for shout and 3 for laughter, the other half 3 for shout and 1 for laughter. They were also instructed to press the 2 key if they consciously perceived even a fragment of the facial image. To be more exact, we gave the following instruction to the participants. Note that we did not explicitly say that a facial stimulus was presented in every trial, so as to prevent the participants from intentionally trying to detect a facial image.
“Your task is to answer whether voice is shout or laughter by button-press. Please press 1 for shout/laughter and 3 for laughter/shout as fast and accurately as possible. In some trials, you may, or may not see a face-like pattern. If you think you saw even a tiny fragment of human face, please press 2 instead of 1 or 3.”
EEG signals were recorded from 64 electrode sites referenced to the vertex using a Geodesics system (Electrical Geodesics, Inc., OR, USA). The EEG signal was sampled at 1000 Hz, and the data were stored on a hard disk.
Direct Stimulus Detection Task
In order to ascertain whether the facial pictures were successfully rendered invisible, a direct stimulus detection task (Merikle et al. 2001; Snodgrass and Shevrin 2006) was administered right after the EEG recording. In the task, a visual image was presented to the nondominant eye simultaneously with dynamic nonsensical images to the dominant eye as in the EEG recording. The visual image rendered invisible by CFS was either one of the facial images presented in the EEG recording or a spatial frequency-scrambled counterpart of it. The spatial frequency-scrambled image had the same power distribution of spatial frequency components with facial expression, but the phase of each component was shifted randomly (Honey et al. 2008), which resulted in a meaningless cloud-like pattern.
At the start of each experiment, a fixation cross was presented for 500 ms at the center of the display. Thereafter, the visual stimuli were presented for 1000 ms. The participant's task was to answer using a button-press whether a facial image was presented or not after the offset of visual stimuli. Each of the Image Type (Face-Scrambled) × Facial Expression (3) × Person (6) = 36 conditions were presented twice, yielding a total of 72 experimental trials.
Cross-Modal Stimulus Evaluation
The subjective feeling induced by the cross-modal stimulus was measured by a rating experiment. In each trial, Emotional Prosody (3) × Facial Expression (3) × Person (6) = 54 types of audiovisual stimuli were presented for 1000 ms. The face stimuli were rendered invisible by CFS, just as in the EEG experiment. The participants wore the colored glasses throughout the cross-modal stimulus evaluation, and the color allocated to the face stimuli was the same as in the EEG recording, ensuring that the participants viewed the audiovisual stimuli under the same experimental conditions as in the main experiment. Following this, participants were presented with 2 9-point Lickert scales for pleasantness and arousal rating, and were instructed to rate their subjective feelings toward the emotional prosody. The order of stimulus presentation was determined pseudorandomly.
Unimodal Stimulus Evaluation
In order to ascertain whether the visual and auditory stimuli used in the EEG experiment adequately conveyed the intended emotion, stimulus evaluation was conducted separately for each perceptual modality. The order of auditory and visual stimulus evaluations were pseudorandomly changed across participants.
In the visual stimulus evaluation, each stimulus face was presented simultaneously with 2 track bars, one for arousal rating and the other for pleasantness rating. The participants were instructed to move the track bars so that the ratings reflected their feeling toward each presented stimulus and to click the button at the bottom of PC monitor to register their evaluation. Clicking on the button triggered the next trial. Each facial stimulus was presented only once during the stimulus evaluation period, and the presentation order was determined randomly. Note that the visual stimuli were presented consciously in contrast to the EEG experiment.
The stimulus display and the procedure of the auditory stimulus evaluation was essentially the same as that of the visual stimulus evaluation except for that clicking a button at the bottom started an audio file that played the emotional prosodies presented in the EEG recording one at a time.
EEG Data Analysis
The EEG data were analyzed offline. The raw data were digitally filtered using a 0.2-Hz high-pass filter and a 30-Hz low-pass filter and re-referenced to the average reference, and then downsampled to 250 Hz. All EEG data were segmented into epochs ranging from 100 ms before to 800 ms after the stimulus onset. The prestimulus window served as the baseline. Artifact rejection was automatically performed with a threshold of ±100 μV and visually checked afterward. Thereafter, the grand-averaged ERP waveforms were calculated. Only trials in which the participants correctly recognized the emotional valence of the prosody and did not report a conscious percept of facial stimuli were incorporated into the calculation of grand-averaged waveforms. The trials in which a participant reported to have perceived facial image amounted to <2% of the whole trials in each condition for every participant.
Conventional ERP Analysis
On the basis of the previous ERP studies (Pourtois et al. 2000; de Gelder et al. 2002; Rossignol et al. 2005; Vlamings et al. 2009), 2 clusters of electrodes were defined in the frontocentral (E7, E16, E15, E14, E12 in the left and E54, E51, E53, E57, E60 in the right hemisphere) and occipitotemporal regions (E29, E30, E32, E27 in the left and E47, E44, E43, E45 in the right hemisphere) bilaterally. The locations of these clusters are presented in Figure 2.
In the frontocentral cluster, peak amplitude and latency analyses were conducted for N1, P2, and late-latency negativity. N1 and P2 were defined as the peak activations in time windows from 120 to 160 and 160 to 260 ms after stimulus onset, respectively. The latency of each component was the time point at which the ERP reached its peak activation. The amplitude at the corresponding peak latency was computed as the peak amplitude. The peak latency of late-latency negativity was the time at which the negativity reached its maximum 300–700 ms after stimulus onset. The average amplitude within the 100-ms time window around the peak latency was entered into further analyses because the late-latency negativity spanned a wide-latency range.
In the occipitotemporal cluster, P1 and N2 were defined as the peak activations from 120 to 160 and 160 to 260 ms after stimulus onset, respectively. The amplitude at the peak latency was entered into further analyses as the peak amplitude of these components. The late-latency positivity was defined as the peak positivity between 300 and 700 ms after stimulus onset. The peak amplitude was computed as the mean amplitude during the 100-ms latency window around the peak latency.
The peak latencies and amplitudes were first entered into a 3-way analysis of variance (ANOVA) with the factors of Hemisphere (Left–Right), Prosody (Laughter–Shout), and Facial Expression (Fearful–Happy–Neutral). The degrees of freedom were corrected by the Huyhn–Feldt procedure if necessary. Only corrected P-values were reported.
ERP Scalp Field Analysis
The effects of experimental manipulations on the ERP scalp field data were analyzed in a hypothesis-blind manner. The sources of ERP scalp field change can be broadly classified into 2 categories. One type of scalp field change does not accompany topographic modulation of the scalp field map, but is characterized mainly by the modulation of response strength. In order to evaluate the modulation of scalp field strength by experimental factors, the effects of Prosody and Facial Expression on GFP (Lehmann and Skrandies 1980) at each time point were examined using a statistic randomization test (Koenig et al. 2011). GFP corresponds to the standard deviation of amplitudes across the entirety of the electrode sites, and represents the strength of response elicited by the stimulation irrespective of topographic change in the scalp field map.
The other type of ERP scalp field change corresponds to the topographic change of scalp field map independent of the response strength represented by GFP. To evaluate this aspect of scalp field modulation, we tested the influence of experimental factors on the normalized ERP scalp field data. The ERP data were first normalized by GFP at each temporal point. Then, the square root of the mean of the squared differences between the normalized scalp fields referred to as DISS (Lehmann and Skrandies 1980) at each time point was examined by statistical randomization tests.
Because GFP is blind to whether there was any topographic difference across conditions, tests of GFP and DISS are considered to be complementary. It is generally accepted that topographic change independent of response strength modulation implies the activation of dissociable intracranial sources across conditions. In contrast to this, GFP modulation without any topographic change is considered to derive from the differential activation of common neural regions (Murray et al. 2008). The randomization statistics was conducted using the program library of Ragu software (Koenig et al. 2011). The false-positive rate across the entire time frames was corrected following the procedure of Koenig and Melie-García (2009). The level of significance was set to P < 0.05.
When there was significant modulation of the scalp field, the topographic distribution of the effect was localized on the scalp surface by cluster-based permutation test implemented in Field Trip (Oostenveld et al. 2011). At the first step of the cluster-based permutation method, amplitudes in 2 focal conditions were compared by sample-dependent t-test at each electrode site. The neighboring electrode sites at which significant difference was obtained at the level of P < 0.05 were grouped into a single cluster, and the significance of the effect at each cluster was tested by the nonparametric permutation method (see for more details, Maris and Oostenveld 2007). In this way, the scalp distribution of the effect is localized to the cluster of electrode sites while controlling the familywise Type I error rate.
Source-Localization Analysis (sLORETA)
The neural locus responsible for the modulation of the ERP scalp field was localized by sLORETA (Pascual-Marqui 2002). sLORETA estimates the cerebral generator of electrophysiological responses using a distributed linear inverse solution. The inverse problem was solved by computing the smoothest distribution of current density by using a realistic head model (Fuchs et al. 2002) created on the basis of the MNI152 template with the 3D solution space restricted to cortical gray matter. The intracerebral volume was partitioned into 6239 voxels at a 5-mm spatial resolution.
Accuracy Rate and Reaction Time in EEG Recording
The interparticipant means of accuracy rate and reaction time (RT) in each condition are summarized in Table 1, together with their standard deviations. Accuracy rates and RTs were entered into an ANOVA with the factors of Prosody (2) and Facial Expression (3).
|RT (ms)||629.3 (114.9)||629.9 (116.5)||629.1 (117.8)||640.7 (117.9)||635.1 (122.3)||633.7 (125.8)|
|Accuracy rate (%)||93.8 (1.5)||93.2 (2.5)||92.9 (1.9)||92.4 (3.4)||92.0 (2.9)||92.7 (2.0)|
|RT (ms)||629.3 (114.9)||629.9 (116.5)||629.1 (117.8)||640.7 (117.9)||635.1 (122.3)||633.7 (125.8)|
|Accuracy rate (%)||93.8 (1.5)||93.2 (2.5)||92.9 (1.9)||92.4 (3.4)||92.0 (2.9)||92.7 (2.0)|
In the parenthesis are the standard deviations.
The ANOVA on the accuracy data revealed that the accuracy for the laughter was significantly higher than that for the fearful shout, F1,14 = 6.45, P < 0.05. Neither the main effect of Facial Expression, F2,28 = 0.55, P > 0.10, nor the interaction between Prosody and Facial Expression, F2,28 = 0.65, P > 0.10, reached significance. The ANOVA on RT data yielded no significant results, F's < 0.75, P's > 0.10.
Direct Stimulus Detection Task
In order to quantify the performance of direct detection of face stimulus, a d-prime measure was calculated for each facial expression on the basis of hit and false alarm rate. The resultant d-primes were first compared with chance performance (d-prime = 0) using 2-tailed paired t-tests. The t-tests revealed that the participants could not detect the facial expression above chance level in any of Facial Expression conditions, t's < 0.0.12, P's > 0.10. The d-prime data were entered into a 1-way ANOVA with the factor of Facial Expression (3). The ANOVA did not yield significant results, F2,28 = 0.08, P > 0.10.
Stimulus Evaluation Task
Cross-Modal Stimulus Evaluation
The interparticipant means of arousal and valence ratings for each type of audiovisual stimulus are summarized in Table 2.
|Pleasantness||4.8 (0.8)||4.9 (0.7)||4.8 (1.0)||2.5 (0.5)||2.8 (0.8)||2.7 (0.9)|
|Arousal||4.2 (0.7)||4.1 (0.8)||4.1 (0.9)||5.3 (1.0)||5.4 (0.8)||5.3 (0.9)|
|Pleasantness||4.8 (0.8)||4.9 (0.7)||4.8 (1.0)||2.5 (0.5)||2.8 (0.8)||2.7 (0.9)|
|Arousal||4.2 (0.7)||4.1 (0.8)||4.1 (0.9)||5.3 (1.0)||5.4 (0.8)||5.3 (0.9)|
In the parenthesis are the standard deviations.
The valence ratings for each type of stimulus were entered into a 2-way ANOVA with the factors of Prosody (2) and Facial Expression (3). The ANOVA revealed that the laughter was rated significantly more positively than the fearful shout, F1,14 = 47.90, P < 0.01. The main effect of Facial Expression also reached significance, F2,28 = 3.43, P < 0.05. Pairwise comparisons revealed that prosody combined with an unconscious fearful face was rated significantly more negatively than prosody combined with an unconscious happy face, t(28) = 2.62, P < 0.05. The other comparisons did not yield significant results, P's > 0.10. The interaction between Prosody and Facial Expression failed to reach significance, F2,28 = 0.51, P > 0.10.
The arousal ratings were entered into a 2-way ANOVA with the same factorial design as that described above. The ANOVA revealed that the shout was rated significantly more arousing than the laughter, F1,14 = 29.91, P < 0.01. Neither the main effect of Facial Expression, F2,28 = 0.41, P > 0.10, nor the interaction, F2,28 = 1.49, P > 0.10, reached significance.
Unimodal Stimulus Evaluation
The interparticipant means of arousal and valence ratings for each type of stimulus are summarized in Table 3.
|Fearful||3.4 (0.7)||6.9 (0.6)|
|Happy||6.1 (1.1)||5.3 (0.3)|
|Neutral||4.2 (0.59)||3.2 (0.7)|
|Laughter||4.4 (0.9)||2.4 (0.7)|
|Shout||3.1 (0.4)||3.7 (0.5)|
|Fearful||3.4 (0.7)||6.9 (0.6)|
|Happy||6.1 (1.1)||5.3 (0.3)|
|Neutral||4.2 (0.59)||3.2 (0.7)|
|Laughter||4.4 (0.9)||2.4 (0.7)|
|Shout||3.1 (0.4)||3.7 (0.5)|
In the parenthesis are the standard deviations.
The arousal ratings for facial expression were subjected to a 1-way ANOVA with the factor of Facial Expression (3). The main effect of Facial Expression reached significance, F2,18 = 167.3, P < 0.01. Pairwise comparisons revealed that the fearful expression was rated significantly more arousing than the happy expression, t(28) = 7.96, P < 0.01, and the neutral expression, t(28) = 18.25, P < 0.01. The happy expression was rated more arousing than the neutral expression, t(28) = 10.29, P < 0.01.
The ANOVA for valence rating for the facial expressions also yielded a significant main effect of Facial Expression, F2,28 = 81.68, P < 0.01. The fearful expression was rated significantly more negatively than the happy expression, t(28) = 12.45, P < 0.01, and the neutral expression, t(28) = 3.71, P < 0.01. The happy expression was rated more positively than the neutral expression, t(28) = 8.74, P < 0.01.
The arousal and valence ratings for the laughter and shout stimuli were compared using a paired t-test. The analysis revealed that the fearful shout was rated as significantly more arousing, t(15) = 10.62, P < 0.01, and more negatively than the laughter, t(15) = −6.95, P < 0.01.
Conventional ERP Peak Analysis
The grand-averaged waveforms in each condition in the frontocentral cluster are shown in Figure 3. The peak amplitudes and latencies were analyzed with a 3-way ANOVA using the factorial design described above.
The ANOVA failed to reveal any significant effects either in the analysis of N1 peak amplitude, F's < 1.42, P's > 0.10, or peak latency, F's < 2.45, P's > 0.05.
The interparticipant means of peak amplitude and latency of the P2 component are shown in Figure 4 together with standard errors.
The ANOVA on P2 peak amplitude revealed no significant findings, F's < 2.63, P's > 0.08.
The ANOVA on P2 peak latency revealed a significant interaction between Prosody and Facial Expression, F2,28 = 3.45, P < 0.05. No other main effects or interactions yielded significance, F's < 1.67, P's > 0.10.
In order to clarify the source of the significant interaction, simple main effect analyses were conducted. The analyses revealed a simple main effect of Facial Expression in the Laughter, F2,56 = 3.31, P < 0.05, but not in the Shout condition, F2,56 = 1.48, P > 0.10. Pairwise comparisons revealed that P2 peak latency to the laughter was significantly shorter when combined with the unconscious fearful than neutral expression, t(56) = 2.50, P < 0.05. No other pairwise comparisons reached significance, t's < 1.77, P's > 0.08. The simple main effect of Prosody did not reach significance either in the Fearful, F1,42 = 0.46, P > 0.10, or in the Happy Facial Expression condition, F1,42 = 0.04, P > 0.10.
The ANOVA on the peak amplitudes of the late-latency negativity revealed no significant main effects or interactions, F's < 2.62, P > 0.05. The ANOVA on the peak latency of the late-latency negativity did not yield significant findings either, F's < 2.8, P > 0.05.
The grand-averaged waveforms in each condition in the occipitotemporal cluster are shown in Figure 5. The peak amplitudes and latencies were analyzed with a 3-way ANOVA using the factorial design described above.
The interparticipant means of P1 amplitude and latency are shown in Figure 6a together with standard error.
In the analysis of P1 peak amplitude, no main effect or interaction reached significance, F's < 3.13, P > 0.05.
The ANOVA on the P1 peak latency revealed a significant interaction between Prosody and Facial Expression, F2,28 = 3.53, P < 0.05. Further analyses revealed that the peak latency was significantly shorter to the shout than to the laughter in the Neutral Facial Expression condition, F(1,42 = 4.64, P < 0.05. No such tendency was observed in the Fearful, F1,42 = 2.46, P > 0.10, or in the Happy Facial Expression condition, F1,42 < 0.01, P > 0.10. The simple main effect of Facial Expression did not reach significance either in the Shout, F2,56 = 2.64, P > 0.05, or in the Laughter condition, F2,56 = 1.07, P > 0.10.
The ANOVA on the N2 amplitude revealed a significant interaction between Hemisphere and Facial Expression, F2,28 = 3.65, P < 0.05. No other main effects or interactions reached significance, F's < 2.89, P's > 0.05.
In order to clarify the source of the interaction between Hemisphere and Facial Expression, simple main effect analyses were conducted. The analyses revealed no significant effects, F's < 2.72, P's > 0.05.
In the analysis of N2 latency, no main effects or interactions reached significance, F's < 2.54, P's > 0.10.
The ANOVA on the late-latency positivity amplitude revealed a significant interaction between Prosody and Facial Expression, F2,28 = 9.28, P < 0.01. No other main effect or interactions reached significance, F's < 3.16, P > 0.05.
The interparticipant means of late-latency positivity amplitude and latency are shown in Figure 6b together with standard error. Further analyses revealed that the amplitude for the shout was significantly larger than that for the laughter in the Fearful Facial Expression condition, F1,42 = 11.66, P < 0.01. No such trend was observed for the happy, F1,42 = 1.75, P > 0.10, or the neutral facial expressions, F1,42 = 2.12, P > 0.10.
The simple main effect of Facial Expression reached significance in the Shout, F2,56 = 7.05, P < 0.01, but not in the Laughter Prosody condition, F2,56 = 2.11, P > 0.10. Pairwise comparisons revealed that the amplitude for the shout was significantly larger when combined with a fearful facial expression than with a happy, t(56) = 3.43, P < 0.01, or neutral expression, t(56) = 3.04, P < 0.01. The amplitude for the shout did not differ significantly between the Happy and Neutral Facial Expression conditions, t(56) = 0.39, P > 0.10. No other main effects or interactions reached significance, F's < 2.25, P's > 0.10.
For the latency of the late-latency positivity, the ANOVA revealed significant interactions between Hemisphere and Prosody, F1,14 = 7.18, P < 0.05, and between Hemisphere and Facial Expression, F2,28 = 4.86, P < 0.05. No other main effects or interactions reached significance, F's < 1.95, P's > 0.10.
In order to clarify the source of the significant interactions, simple main effect analyses were conducted. As for the interaction between Hemisphere and Prosody, the analyses revealed no significant simple main effects, F's < 2.5, P's > 0.10.
In the analysis of the interaction between Hemisphere and Facial Expression, the simple main effect of Facial Expression reached significance in the left, F2,56 = 4.59, P < 0.05, but not in the right hemisphere, F2,56 = 0.67, P > 0.10. Pairwise comparisons revealed that peak latency was significantly shorter in the Neutral Facial Expression condition than in the Fearful, t(56) = 2.35, P < 0.05, and Happy Facial Expression conditions, t(56) = 2.83, P < 0.05. There was no significant difference in the peak latency between the Fearful and the Happy Facial Expression conditions, t(56) = 0.48, P > 0.10. The simple main effect of Hemisphere did not reach significance for any of the facial expressions, F's < 2.14, P's > 0.10.
ERP Scalp Field Analysis
Statistical Randomization Tests of GFP and DISS
The effects of experimental factors on GFP and DISS during 0–800 ms after stimulus onset were tested using a statistical randomization procedure (Koenig et al. 2011). The interaction between Prosody and Facial Expression reached significance during 412–608 ms for GFP, while the main effects of Prosody and Facial Expression did not at any temporal points. The P-value of the interaction is plotted along the temporal axis in Figure 7a. No main effects or interactions yielded significance at any temporal point for DISS.
The interparticipant means of GFP during 412–608 ms in each condition are shown in Figure 7b together with standard errors. In order to further clarify the pattern of GFP modulation, the mean GFPs during this latency range in each condition were subjected to a 2-way ANOVA with factors of Prosody (2) and Facial Expression (3).
The ANOVA revealed a significant interaction between Prosody and Facial Expression, F2,28 = 4.53, P < 0.05. Neither the main effect of Prosody, F1,14 = 0.19, P > 0.10, nor Facial Expression, F2,28 = 0.01, P > 0.10, reached significance.
In order to elucidate the source of the interaction, simple main effect analyses were conducted. The analyses revealed that GFP for the shout was significantly larger than for the laughter when fearful expression was unconsciously presented, F1,42 = 7.39, P < 0.01. The simple main effect of Prosody did not reach significance either in the Happy, F1,42 = 0.15, P > 0.10, or the Neutral Facial Expression condition, F1,42 = 2.85, P > 0.10. The simple main effect of Facial Expression failed to reach significance in the Laughter, F2,56 = 2.63, P > 0.05, and in the Shout Prosody condition, F2,56 = 2.41, P > 0.10.
GFP for the fearful shout was larger than for the laughter when fearful expression was presented unconsciously. The scalp distribution of this effect was localized by nonparametric testing. First, the averaged amplitudes during 412–608 ms were computed for the shout and the laughter in the Fearful Facial Expression condition at each electrode site. The electrode cluster where the mean amplitudes differed significantly between the shout and the laughter was determined by the cluster-based permutation test (Maris and Oostenveld 2007). The topography of t-values are depicted in Figure 7c with significant electrode sites marked by filled black circles. As can be seen from this figure, in comparison to the laughter, the fearful shout enlarged the amplitude at the left occipital electrode cluster.
The analyses above revealed significant difference in GFP during 412–608 ms to laughter and fearful shout when combined with unconsciously presented fearful expressions. The neural generator of this effect was estimated using sLORETA by determining the voxels that were activated differentially by the shout and the laughter in the Fearful Facial Expression condition. sLORETA localized a significant voxel within the posterior region of the left fusiform gyrys (BA19; MNI coordinates X = −40, Y = −70, Z = −20) as shown in Figure 7d.
The present study investigated the locus of cross-modal integration between emotional prosody and unconsciously presented facial expressions in neurologically intact participants using ERP measures. The behavioral results have shown that both visual and auditory stimuli induced the intended emotion, that is, the induction of higher arousal and more negative emotion by the fear-related stimuli, and that the facial expressions were rendered invisible successfully throughout the experiment. Despite the participants' sheer inability to consciously report the existence of facial stimuli, the unconsciously presented fearful face still influenced the pleasant ratings of emotional prosody, which ensures that the observed effects of facial expressions are unconscious phenomena. The analysis of the ERP has revealed that the elicitation of the P1 component to unconscious neutral expression was accelerated by shout compared with laughter. In addition, the latency of P2 in response to laughter was shorter when fearful expressions were presented than when neutral faces were presented. Following these results, conventional ERP analyses revealed an increased amplitude of long-latency positivity at occipital-temporal electrodes to fearful shout than to laughter when combined with unconscious presentation of fearful expressions. The results of ERP scalp field analyses indicated that the interaction during the late-latency range derives not from topographic change but from the modulation of scalp field strength reflected in the GFP. The locus of this modulatory effect was localized to the left posterior fusiform gyrus.
Conventional analysis of ERP peak latency revealed that the latency of early components, that is, P1 in the occipitotemporal cluster and P2 in the frontocentral cluster, was modulated interactively by facial expression and emotional prosody. These results concur with previous studies indicating audiovisual integration of emotional signals in an early latency range (de Gelder et al. 1999, 2002; Pourtois et al. 2000). At the same time, the details of the interactions in the present study differed from those in previous studies. Previous studies have found an effect of congruency of emotion category (angry–sad or fearful–happy) on the auditory N1 amplitude. In contrast to this, the present study did not find equivalent effects with our pairings. One potential explanation for the discrepancies is that neural correlates underlying cross-modal integration differ according to the emotional category of stimuli, although the present study cannot be compared directly with the previous ones due to a number of procedural differences. One might find it contradictory that a hypothesis-blind analysis did not find significant modulation of the ERP scalp field during the early latency range. However, the analyses of both GFP and DISS deal mainly with the spatial modulation of ERP responses at each temporal point, and as such are relatively insensitive to temporal jittering of peak activation. Together with the observation that the amplitudes of the early components were not susceptible to cross-modal influences, these results seem to indicate that the unconscious face modulates processing efficiency rather than activation strength in the early latency range.
The modulation patterns of P1 and P2 latencies observed in the present study indicate that the combination of laughter and neutral face delays the latency compared with the combination including either fearful face or voice. Such pattern of latency modulation were initially unexpected, so the underlying cause for this is not necessarily clear at this point. However, we have some tentative explanations for these findings.
The elicitation of P1 component, that is considered to be among the earliest visual components, is suggested to be accelerated by the presentation of attention-grabbing stimuli (for a review, see, Taylor 2002). On the basis of this, the P1 modulation seems to indicate that an unconsciously presented neutral face captures visuospatial attention more efficiently when combined with the fearful shout than the laughter. A neutral facial expression is inherently ambiguous, but, several studies have indicated that a neutral face is perceived to be slightly dominating and threatening (Shah and Lewis 2003; Doi et al. 2010). Together with the proposition that exposure to threat information like fearful shout makes the organism more vigilant to the potential threat in the environment (LeDoux 2003; Phelps and LeDoux 2005; Tamietto and de Gelder 2010), the present finding might indicate that more attentional resources were allocated to a potentially threatening neutral expression when a fearful shout was presented simultaneously. It might be because a fearful face is in itself so attention-grabbing (Holmes et al. 2005; Pourtois et al. 2005, 2006) that the effect of the emotional prosody was less prominent in the case of a fearful than a neutral expression.
Following the P1 modulation, the elicitation of P2 in response to laughter was accelerated when combined with a fearful face compared with a neutral face. Importantly, there was no discernible difference between P2 latency to the laughter-happy combination and that to the laughter-neutral combination, indicating that the modulation pattern of P2 latency is not necessarily linked to emotional congruency between face and voice. One potential interpretation of the P2 latency modulation is that the unconscious presentation of a fearful face facilitated the processing of laughter. This proposition runs counter to the claim that the presentation of affectively congruent facial stimuli promotes the processing of voice (Collignon et al. 2008; Föcker et al. 2011) in the neurologically intact population. However, the majority of these previous findings have been obtained for the cross-modal integration between auditory and visual stimuli, both of which were presented above conscious awareness. Therefore, it is necessary to see more closely whether the processing of positive stimuli can be facilitated by the unconscious presentation of threat information.
GFP analysis revealed an interaction between emotional prosody and unconscious presentation of facial expression 412–608 ms after stimulus onset. Together with the observation that the DISS statistical randomization test revealed no prominent change in scalp field topography, it can be said that the observed modulation of the ERP scalp field during the late-latency range largely stems from the quantitative change of activation in common neural generators across conditions. Further analysis revealed increased GFP for fearful shout than for laughter when combined with unconscious presentation of a fearful expression. This finding indicates that the unconscious presentation of fearful facial expression modulates the neural activations for emotional prosody. The effect of facial expression on affective prosody processing has been reported in a number of previous studies on the neural underpinnings of audiovisual integration (Pourtois et al. 2000; Dolan et al. 2001; Klasen et al. 2011; Müller et al. 2011, 2012). However, this is, to the best of our knowledge, the first study clarifying the neural activation underlying the integration between emotional prosody and unconscious facial expression in a neurologically intact population.
Previous electrophysiological and neuroimaging studies have identified multiple functional regions including the posterior superior temporal sulcus, the insula and the amygdala as the center of audiovisual integration (for a review, see Klasen et al. 2012). Furthermore, valence-incongruity across modalities produces larger perceptual and cognitive loads, leading to the additional recruitment of prefrontal conflict resolution mechanisms, such as the inferior frontal gyrus and the anterior cingulate cortex (Pourtois et al. 2002; Klasen et al. 2011). Inconsistent with these, the present study determined the locus of cross-modal integration to be in the left posterior fusiform gyrus. The centroid of this activation was located within the occipital object-related area (Levy et al. 2001). The locus of the activation was posterior to the functionally localized fusiform face area (Grill-Spector et al. 2004). However, due to the spatial-smoothing by sLORETA, it is conceivable that the activation of this voxel might reflect activation of the face-selective area to some extent. If this is the case, the present finding might give tentative support to the view that an unattended or unconsciously perceived fearful face can modulate the activation of the face-selective area of the fusiform gyrus (Vuilleumier and Schwartz 2001; Morris et al. 2007; Jiang and He 2006). It is a topic of intense controversy whether the neural response to a fearful face in this region is under the influence of spatial attention (Pessoa et al. 2002). However, Vuilleumier and Schwartz (2001) have argued the possibility that emotional faces can enhance the activation of temporal face-selective areas without awareness, on the basis of their finding that an emotional face presented in the extinct hemifield can capture visuo-spatial attention. Furthermore, recent neuroimaging studies have shown that the fusiform gyrus sustains measurable activations to a fearful face, when emotional recognition was not the task at hand (Winston et al. 2003), or even when face was rendered invisible (Liddell et al. 2005; Jiang and He 2006).
The ventral occipital region, as localized in the present study, has long been considered to be a unisensory region recruited specifically in visual processing (Malach et al. 1995; Levy et al. 2001). However, an increasing number of studies have indicated that the temporo-occipital region extending from the primary visual cortex to the inferior temporal gyrus receives input from multiple perceptual modalities, and potentially serves as the locus of cross-modal integration. As for the integration between auditory and visual information, Lewis and Noppeney (2010) have shown that temporal synchrony between auditory and visual signals increases the activation of the lateral occipital region bilaterally. Likewise, Doehrmann et al. (2010) revealed that the activation of the left occipital region during audiovisual stimuli processing was enhanced by the repetition of auditory information, indicating the possibility that audiovisual integration occurs at the “unisensory” cortical regions.
The present findings indicate that the visual cortex, that has been traditionally considered to be unisensory, is recruited in the cross-modal integration between unconscious facial expression and affective prosody. However, the specific nature of the processing undertaken within the fusiform gyrus has yet to be clarified. As for this point, we propose several potential models for the neural mechanism mediating the multisensory influence on the fusiform gyrus activation. The first model postulates that modulation of the fusiform gyrus activation reflects the influence of a feedback loop from the higher order multisensory cortices (Driver and Nosselt 2008). To be more specific, the auditory and visual information were integrated first in the higher order multimodal association cortices, and the output from these cortices in turn influences the processing in the fusiform gyrus. This possibility sits well with the previous neuroimaging research showing the activation modulation of multimodal convergence zones, such as superior temporal sulcus, by the affective congruity between face and voice (Müller et al. 2012; Robins et al. 2009). At the same time, there are some aspects of the data that cannot be readily explained within the framework of this model. Primarily, we did not find differential activation of cortical regions outside the fusiform gyrus. In addition, previous studies on the neural mechanism sustaining visual consciousness have shown that visual information below the conscious threshold induces only small activations, if any, outside the occipital and posterior temporal regions (Del Cul et al. 2007). Given this, it is quite conceivable that the multimodal cortices did not receive robust signals of facial expression in the present study.
The second and seemingly more plausible model for the mechanism underlying the cross-modal influence on the fusiform gyrus activation is the direct corticocortical connection between the auditory cortex and the fusiform gyrus. According to this model, the auditory information is relayed from the auditory cortex directly to the fusiform gyrus after the initial stages of auditory processing had been completed. Then, the output from the auditory processing influences the processing of the unconscious facial information within the fusiform gyrus. In support of this model, several recent studies have succeeded in showing the direct interaction between visual and auditory cortices (Noesselt et al. 2007; Cappe et al. 2010; Lewis and Noppeney 2010). Of particular relevance to the present study, Blank et al. (2011) have revealed a direct anatomical connection between the voice-sensitive region and the fusiform gyrus.
If it is truly the case that a direct connection between the fusiform gyrus and the auditory cortex mediates the cross-modal integration between emotional prosody and unconscious facial expression, what functional significance can such a mechanism possibly confer? One possibility is that the auditory signal helps the fusiform gyrus interpret the weak signal of a facial image presented below conscious awareness. Albeit speculative, this hypothesis has some resemblance to “the inverse-effectiveness principle” (Collignon et al. 2008; Kayser and Logothetis 2007) implying that one of the primary functions of cross-modal integration is to constrain the interpretation of weak sensory signal by the concomitant information from the other sensory modalities.
The cross-modal interaction in the left posterior fusiform gyrus was more prominently observed for fearful than for the other expressions. The present results also indicated the possibility that the unconscious presentation of a fearful face increased the processing efficiency at the early latency range. In addition, the behavioral results showed that the unconscious presentation of fearful expressions lowered the pleasantness ratings for emotional prosody, in spite of the participants' inability to detect the presence of facial stimuli in a direct stimulus detection task. Together, these findings seem to bolster the hypersensitivity of the neural system to threat-related information (Öhman et al. 2001; Carretié et al. 2005; Holmes et al. 2005; Pourtois et al. 2006) that operates independently of conscious awareness and attention allocation. It is well established that the human visual system can detect the fearful expressions even when they are presented unconsciously (for a review, see, Tamietto and de Gelder 2010). The exact mechanism underlying such sensitivity to unconscious fearful expression is still under ongoing controversy. However, a number of existing studies have indicated that threat information below unconscious awareness is relayed from the superior colliculus to the amygdala via the thalamus (LeDoux 2003; Liddell et al. 2005; Phelps and LeDoux 2005; Tamietto and de Gelder 2010), thereby bypassing the pathway from the lateral geniculate nucleus to striate cortex that is the primary circuit for processing consciously presented visual stimuli. The amygdala is considered to play a pivotal role in making the organism vigilant to the potential threat in the environment by sending excitatory projections to the sensory cortices (LeDoux 2003; Phelps and LeDoux 2005). On the basis of these, it is quite conceivable that the putative projection from the limbic system to the fusiform gyrus might underlie prominent modulation of ERP by the unconscious presentation of fearful expressions compared with happy or neutral expressions (LeDoux 2003; Phelps and LeDoux 2005; Tamietto and de Gelder 2010).
Although several studies have indicated the possibility that the amygdala detects pleasant and appetitive information (Killgore and Yurgelun-Todd 2004; Williams et al. 2004), the evidence of happy face detection in the amygdala is not as robust as the well-established sensitivity of the amygdala to fearful expressions. One might find it contradictory that we did not find significant activation of the amygdala in sLORETA, but we tentatively think that this is due to low spatial resolution achieved in the present source-localization analysis. To be more specific, the activations of neighboring voxels are smoothed out in sLORETA, which might have made it difficult, if not impossible, to delineate the activation in the small subnucleus of the amygdala. In addition to this, we did not register the electrode coordinates for each participant, which might have introduced some variance across participants in scalp electrode locations relative to brain anatomy. In order to examine the validity of the above-mentioned hypothesis, ROI analysis of amygdala activation measured by fMRI should be conducted concomitantly with ERP recording in future studies. With regard to the potential contribution of the amygdala to the present findings, some of the previous studies have linked the amygdala activation specifically to the arousal level irrespective of the valence dimension (Anderson et al. 2003; Britton et al. 2006; Sabatinelli et al. 2007). Together with the present behavioral finding that both shout and fearful face were rated higher in arousal than their happy and neutral counterpart, these studies may indicate the possibility that the high arousal level elicited by threat information rather than the threatening quality per se has resulted in the prominent influence of fear-related information on ERPs in the present study. However, the modulation pattern of the ERPs was not necessarily determined by the arousal level; for example, the ERPs were modulated interactively by the prosody and unconscious facial expression, while analogous interaction was not observed in the arousal ratings to the audiovisual stimuli. On the basis of this, we do not think that the prominent effects of fear-related information in the present study can be explained solely by the modulation of arousal state.
In a closely relevant study by de Gelder et al. (2002), it was reported that the auditory N1 component induced by an emotional voice was enlarged by simultaneous presentation of an affectively congruent facial expression to the blind hemifield, which indicates that the initial stage of audiovisual integration takes place in the auditory cortex in blind-sighted patients. In contrast to this, the present pattern of results suggests that the main locus of interaction between emotional prosody and unconscious facial expression is localized in the cortical regions that have traditionally been linked to visual processing. Although there was some indication of auditory P2 modulation in the present study as well, this effect does not seem to reflect the detection of the emotion congruency, in contrast to de Gelder et al. (2002). It is possible that the discrepancy between de Gelder et al. (2002) and the present study might have stemmed partly from cortical re-organization after the aberration of the striate and the extrastriate cortices in the blind-sighted patients (Tamietto and de Gelder 2010). Considering the plasticity of the adult neural system (Kujala et al. 1997; Voss et al. 2006; Bridge et al. 2008), it is quite possible that aberration of the visual cortices is compensated for by the other cortical regions, which might have led to topographical change in the ERP scalp field in cross-modal integration in the participants of de Gelder et al. (2002). At the same time, there is no denying that the other procedural differences might have played critical roles in yielding the discrepancies across studies. For example, the facial stimuli were rendered invisible by CFS in the present study, which might have hampered the conscious access to facial image at a processing stage different from de Gelder et al. (2002). Likewise, in contrast to de Gelder et al. (2002), the distribution of spatial frequency power, which has been identified as a potential cue of face detection (Honey et al. 2008), was controlled across conditions in the present study. Another important aspect is the onset timing of visual and auditory stimuli. In the present study, the onset of the unconscious face and the emotional prosody was temporally aligned. But, de Gelder et al. (2002) presented the emotional prosody 900 ms after the onset of facial expression. As indicated by Stekelenburg and Vroomen (2007), preceding visual information exerts a prominent influence on the process of audiovisual integration. Importantly, several of the previous ERP studies on the cross-modal integration of affective information (Pourtois et al. 2000; Jessen and Kotz 2011) also presented the visual information first, followed by the presentation of an affective voice. Given these, the discrepancies between the present study and several of the previous ERP studies, especially in the early latency ranges, might be at least partly explainable by the difference in the onset timing of audiovisual stimuli. Elucidating the effects of these variables would lead to the clarification of further details regarding the nature of cross-modal integration between emotional prosody and unconscious facial expression.
The present study investigated the neural mechanism underlying the cross-modal integration between emotional prosody and unconsciously presented facial expression in neurologically intact participants using ERP measurement. The results revealed signs of cross-modal interaction at multiple latency ranges, indicating that audiovisual integration of emotional signals takes place automatically without conscious awareness. Unconscious presentation of fearful expressions exerted more prominent influences on emotional prosody recognition than happy expressions during the late-latency range, and the generator of this effect was localized to the left posterior fusiform gyrus. This finding bolsters the view that the cortical regions, traditionally considered to be unisensory regions for visual modality, can function as the locus of cross-modal integration of emotional signals.
This work was supported by JSPS KAKENHI 24791218 to H.D.
Conflict of Interest: None declared.