Speech production, both overt and covert, down-regulates the activation of auditory cortex. This is thought to be due to forward prediction of the sensory consequences of speech, contributing to a feedback control mechanism for speech production. Critically, however, these regulatory effects should be specific to speech content to enable accurate speech monitoring. To determine the extent to which such forward prediction is content-specific, we recorded the brain's neuromagnetic responses to heard multisyllabic pseudowords during covert rehearsal in working memory, contrasted with a control task. The cortical auditory processing of target syllables was significantly suppressed during rehearsal compared with control, but only when they matched the rehearsed items. This critical specificity to speech content enables accurate speech monitoring by forward prediction, as proposed by current models of speech production. The one-to-one phonological motor-to-auditory mappings also appear to serve the maintenance of information in phonological working memory. Further findings of right-hemispheric suppression in the case of whole-item matches and left-hemispheric enhancement for last-syllable mismatches suggest that speech production is monitored by 2 auditory-motor circuits operating on different timescales: Finer grain in the left versus coarser grain in the right hemisphere. Taken together, our findings provide hemisphere-specific evidence of the interface between inner and heard speech.
Speech perception is known to be modulated by speech production. For example, the identification of heard syllables is improved by production of concordant syllables (Sams et al. 2005). Speech production in the form of overt rehearsal has also been shown to facilitate the learning of phonological sequences, such as new words in a second language (Ellis and Beaton 1993). The effect of overt speech production on the identification of speech signals is proposed to be due to a change of activity in the auditory cortex as a result of forward prediction of the acoustic consequences of speech: A predictive efference copy (cf. Sperry 1950; von Holst and Mittelstaedt 1950) of speech information is thought to be sent from inferior frontal articulatory cortex to inferior parietal and further to auditory cortex (e.g., Houde et al. 2002). According to recent models (Guenther 1995; Guenther et al. 2006; Tian and Poeppel 2010; Hickok et al. 2011; Houde and Nagarajan 2011; Price et al. 2011; Tourville and Guenther 2011; Hickok 2012; Tian and Poeppel 2013), forward prediction plays an important role in the control of overt speech production: It is thought to be used for internal monitoring of speech output and providing feedback on its match (or mismatch) with the expected auditory signal (see Levelt 1989 for prearticulatory monitoring in psycholinguistic models). As a result of this monitoring, speech production can be corrected to match the sensory goal (Niziolek et al. 2013).
Remarkably, however, effects similar to those of oral speech have been observed also when speech is not overtly produced: Identification of speech is significantly improved even by covert articulatory rehearsal of concordant phonological sequences (Sams et al. 2005). Correspondingly to overt speech, inner speech without articulation can change the perception of ambiguous external stimuli (Scott et al. 2013). In addition, hearing, imagined hearing, and imagined articulation of speech have been shown to have similar effects on brain activation (Tian and Poeppel 2010). These studies suggest partly overlapping mechanisms for overt and covert speech (for a review, see Tian and Poeppel 2012). Based on their similar effects on behavioral and brain responses, also covert or inner nonarticulated speech has been suggested to induce an efference copy or a corollary discharge (e.g., Sams et al. 2005; Tian and Poeppel 2010; Scott 2013; Scott et al. 2013). This allows forward mapping from inner speech, that is, the prediction of its sensory consequences, although in this case motor commands are not executed and there are no acoustic consequences. Consequently, the corollary discharge has been proposed to provide the sensory content of inner speech, the “inner voice” (Scott 2013).
Forward prediction is likely to account for the known phenomenon of suppressed or delayed auditory cortex activity during overt or covert speech (e.g., Numminen and Curio 1999; Kauramäki et al. 2010; Tian and Poeppel 2010, 2013). Most often, this effect is manifest as a modification of the so-called N1 response, the most prominent component of auditory event-related potentials or corresponding magnetic fields (ERPs or ERFs), when probed with electro- or magnetoencephalography (EEG or MEG), respectively. This suppression effect has been observed for self-uttered, overtly produced vowels or syllables (Curio et al. 2000; Houde et al. 2002; Heinks-Maldonado et al. 2006; Aliu et al. 2009; Ventura et al. 2009; Chang et al. 2013; Niziolek et al. 2013), covert vowel production (Numminen and Curio 1999; Kauramäki et al. 2010), audiovisual speech (Klucharev et al. 2003), and lipreading (Jääskeläinen et al. 2004; Kauramäki et al. 2010). N1 suppression caused by covert articulation has even been observed for low-level auditory processing such as that of pure tones (Kauramäki et al. 2010), raising the possibility that the effect is not necessarily speech-specific. However, other studies have suggested that rather than being a nonspecific effect, N1 suppression is modulated by expectations, because altered auditory feedback reduces suppression during overt speech production (Houde et al. 2002; Heinks-Maldonado et al. 2006). Recently, feedback control of pitch was shown to suppress auditory responses to expected feedback, but to enhance responses to unexpected, altered feedback during vowel production (Chang et al. 2013). The question arises whether a similar feedback control mechanism is used for monitoring speech sounds and their combinations representing categorical units of language, namely, phonemes, syllables, and words. Because both overt and covert speech improves the behavioral identification of syllables concordant with it, but impairs perception of discordant syllables (Sams et al. 2005), forward prediction triggered by both overt and covert speech can be expected to down-regulate the activation of auditory cortex to a different degree depending on whether auditory stimuli are phonologically concordant or discordant with the overt or covert speech production. Specifically, suppression effects should be stronger for auditory stimuli that match overt or covert speech production than for those that do not. Moreover, if recent models of speech production (Guenther 1995; Guenther et al. 2006; Tian and Poeppel 2010; Hickok et al. 2011; Houde and Nagarajan 2011; Price et al. 2011; Tourville and Guenther 2011; Hickok 2012) are correct in suggesting that forward prediction is used for internal monitoring and feedback in speech production, it should be able to represent even the smallest units of speech that can change word meanings, that is, it should be phoneme- or syllable-specific. Otherwise, such a monitoring/feedback mechanism cannot be efficiently used for correcting speech errors that involve only one phoneme but at the same time can change the meaning of a word or make it incomprehensible. Suppression effects selective to stimuli matching overt or covert speech production would be strong evidence for an efficient predictive mechanism controlling one of the core aspects of speech production—the phonological code underlying articulation. Scott et al. (2013) have shown behavioral effects of inner speech at a subphonemic level affecting categorization of phonemically ambiguous stimuli. Niziolek et al. (2013) found efferent-driven suppression to be modulated by the prototypicality of spoken vowels in single syllables. However, so far the experimental evidence for the control of the phonological code has been indirect or paradigms used have tapped later processing stages than N1 (Tian and Poeppel 2013). To our knowledge, longer phonological items have not been studied at all.
The first goal of the present study was to investigate whether suppression of auditory cortex activity caused by forward prediction is specific to the phonological content of speech, in the current experiment covert speech. For this purpose, we measured neural spatio-temporal patterns with MEG during 2 working memory tasks and simultaneous auditory and visual stimulation (see Fig. 1). To induce suppression of brain responses to auditory stimuli, a rehearsal task involved covert mental rehearsal of spoken pseudowords each time an external stimulus was presented. To control for the processing of the same auditory stimuli in the absence of pseudoword rehearsal, a control task involved counting the number of visually presented symbols. When an auditory speech stimulus is covertly rehearsed in phonological working memory, the acoustic-phonetic code is thought to be converted into a phonological code that is used for speech production and is then further cycled between input and output buffers putatively located in temporal speech perception areas and frontal speech production areas, respectively (Jaquemot and Scott 2006). In the present study, covert rehearsal was assumed to project repeated forward prediction signals to the auditory cortex. The activation level of phonological representations was probed by presenting auditory stimuli that were either concordant or discordant with the covertly rehearsed tokens. We expected that when compared with the absence of pseudoword rehearsal, during rehearsal the presentation of an auditory pseudoword phonologically concordant with a simultaneously rehearsed pseudoword would cause suppression of auditory responses, most notably N1m (the neuromagnetic counterpart of N1), which is the earliest auditory response reported to show forward prediction effects. [Although the effect is not necessarily restricted to N1m, this is the most prominent auditory response and a hallmark of cortical sound processing. Having a well-defined peak, N1m enables direct compatibility with previous reports demonstrating suppression effects at this latency.] In contrast to the expected suppression effect, if a stimulus phonologically discordant with the rehearsed pseudoword was presented during rehearsal, no N1m suppression but an enhancement would be expected (Chang et al. 2013). This pattern of results would suggest that N1m suppression and, consequently, forward prediction, are phoneme- or syllable-specific, enabling monitoring of speech production. Alternatively, if stimuli concordant and discordant with the rehearsed item are similarly suppressed in line with a previous finding for nonspeech sounds of different frequencies (Kauramäki et al. 2010), the result would suggest that forward prediction is not sufficiently specific to be utilized for accurate monitoring of speech production.
Previous studies on N1m suppression as a result of covert speech used isolated vowels or syllables as auditory stimuli and covert speech production targets (Numminen and Curio 1999; Kauramäki et al. 2010). However, given that N1m consists of both onset-related and stimulus-specific subcomponents (Mäkelä 2007), a considerable part of the N1m elicited by isolated vowels may consist of an onset-related N1m subcomponent that is not stimulus-specific, whereas the contribution of a stimulus-specific subcomponent is smaller. Longer stimuli with transitions within them eliciting N1m were expected to better reveal stimulus-specific effects. This is because the contribution of an onset-related subcomponent to an N1m elicited by an acoustically salient stimulus-internal transition is likely to be smaller, and that of a stimulus-specific N1m subcomponent larger, compared with N1m elicited by stimulus onsets (Pardo et al. 1999). Therefore, more complex speech stimuli, namely, tri-syllabic pseudowords, were used. The use of tri-syllabic pseudowords was also motivated by the second goal of this study. By concentrating on N1m suppression effects to the final, third syllable of the pseudowords, we aimed to determine whether forward prediction affects speech perception at the fine-grain (phoneme or syllable) level as opposed to the level of the full stimulus (potential words), and whether possible effects show differences in lateralization. Fine-grain predictive coding would be supported by results indicating that the suppression of N1m elicited by the third syllable is solely determined by the concordance of the critical third syllable with the content of rehearsal, whereas the concordance of the pseudoword beginning should not play a role. Based on the proposition of asymmetric temporal sensitivity of human auditory cortices (Schwartz and Tallal 1980; Zatorre et al. 2002; Poeppel 2003), such fine-grain coding could be hypothesized to be left-lateralized. Full-stimulus predictive coding, in turn, would be supported by results indicating that the suppression of N1m elicited by the third syllable is determined by information accumulating over the whole item. In this case, suppression would result from both the beginning of the stimulus and the critical third syllable being concordant with covert speech production. Such full-stimulus coding could be expected to be right-lateralized, because the right hemisphere has been suggested to prefer longer sounds (Boemio et al. 2005; McGettigan and Scott 2012).
The experiment was approved by the Research Ethics Committee of the Helsinki University Central Hospital. All participants signed a written informed consent before the experiment.
A total of 24 right-handed native speakers of Finnish (11 males and 13 females, mean age 23 years, age range 19–33 years) with normal hearing participated in the MEG recording. None of the participants reported neurological problems or speech- or language-related dysfunctions.
The auditory stimulus material included 30 different pseudowords (see Supplementary Material for details), each having 2 spoken variants (60 stimuli in total). The pseudowords had a 3-syllable CVCVCCV structure (C = consonant, V = vowel), where CC was always a long (i.e., geminate) stop consonant, thus having an extended silent closure time before the final syllable (see Fig. 1). The pseudowords complied with the phonotactic rules of Finnish but had no meaning and were, thus, unfamiliar to the participants, which was to minimize interactions with any semantic processes that could influence the response. The stimuli were produced at a normal speaking rate and neutral prosody by a female native speaker of Finnish, and digitally recorded in an acoustically shielded room. The most prototypical experimental stimuli were chosen from several tokens on the basis of judgments of 3 naïve native speakers of Finnish. The selected pseudowords were further modified with the Praat software (Boersma and Weenink 2008) as follows: The intensity of the stimuli was scaled to 90% of maximum, and the durations of syllables within the stimuli were equalized preserving their typical ratio (first and second syllable together 260 ms; the silent phase of the geminate stop 220 ms; third syllable following the silent phase 120 ms; 600 ms in total). The natural word-stress pattern of the pseudowords was preserved to ensure the correct segmentation of the pseudowords in the stimulus stream. In Finnish, primary word stress always falls on the first syllable of a word. It is cued by higher fundamental frequency (F0) and higher intensity when compared with unstressed syllables. This applies also to the present stimuli, where F0 of the first syllable was on average 54 Hz higher than that of the third syllable (239 vs. 185 Hz, respectively), and intensity maximum of the first syllable was on average 5 dB higher than that of the third syllable (85 vs. 80 dB, respectively).
In addition to speech stimuli, the stimulus material included 2 additional sounds: A humming sound created by filtering a pseudoword stimulus [pαmup:α] with a 250-Hz low-pass filter and a nonspeech stimulus of 75 ms duration and 500 Hz F0 with harmonic partials of 1000 and 1500 Hz, respecively. In both tasks, participants were presented with visual stimuli (black square, circle, triangle, or diamond on gray background) that were projected on a screen placed in front of them. The auditory stimuli were presented at a comfortable sound level from a loudspeaker placed in front of the MEG device, above the screen on which the visual stimuli were projected. The stimulus presentation was controlled by a script written in Presentation 12.2 (Neurobehavioral Systems, Albany, USA).
The MEG experiment included 2 main tasks: rehearsal and control. Participants heard pseudoword stimuli and simultaneously saw visual symbols. In the rehearsal condition, participants were instructed to first memorize the first auditory pseudoword of a trial and then covertly rehearse it, each time an auditory stimulus was heard and a simultaneous visual symbol was shown. To motivate efficient rehearsal, they were requested to pronounce the rehearsed pseudoword aloud at the end of each trial. These utterances were monitored by an experimenter to ensure that participants performed the tasks correctly, including correct segmentation of the pseudowords in the stream. If a participant had segmented the stimuli incorrectly or had not understood the task, the experiment would have been terminated. In the control condition, participants were instructed to count the number of times the visual symbol presented first in that trial occurred, and to say aloud the result at the end of the trial. Although memorizing the numbers may involve their covert rehearsal, the content of the rehearsal in the control condition did not resemble the auditory stimuli that were expected to reveal phonological rehearsal effects in the pseudoword rehearsal task (see Supplementary Material for rehearsed items). Therefore, inspite of the possibility that some items were covertly rehearsed in both tasks, hereafter we will use “rehearsal condition” to refer to the pseudoword rehearsal task and “control condition” to refer to the counting task. The 2 conditions were run in a counterbalanced order.
The tasks differed between the 2 conditions, but the simultaneous auditory and visual stimulation was identical with the exception that stimulus order was randomized anew. Each trial (see Fig. 1) began with a cross showing up on the screen as a sign warning the subject to get ready to perform the task. After 2 s, the participant heard a pseudoword and saw a symbol and had to either remember and covertly rehearse the pseudoword or begin counting the number of times the symbol occurred during the trial, depending on the task. Thereafter, 9 auditory stimuli were presented with an interstimulus interval (ISI) of 300 ms. The to-be-remembered stimulus was first followed by a humming sound (a low-pass filtered pseudoword), presented to set the rhythm for the covert rehearsal. A hum instead of another pseudoword was used to avoid immediately erasing the to-be-remembered pseudoword from echoic memory before participants began rehearsing it. After the hum, 4 random pseudowords discordant with the to-be-remembered pseudoword were presented. These auditory distractors were presented to rule out the possibility that acoustic echoic memory determined the processing of pseudowords. The distractors were followed by 4 types of pseudowords in a random order: (1) One with a concordant beginning and a concordant ending with the rehearsed pseudoword (this stimulus was not, however, acoustically identical to the first pseudoword of the sequence but another token of the same pseudoword), (2) one with a concordant beginning but a discordant ending, (3) one with a discordant beginning but a concordant ending, and (4) one with a discordant beginning and a discordant ending (see Fig. 1 for examples). Stimuli that served as distractors and probe stimuli in one trial served as targets of rehearsal in other trials and, vice versa, in a counterbalanced fashion. All onsets of auditory pseudowords were accompanied by simultaneous onsets of visual symbols, but the auditory and visual stimuli were not otherwise associated with each other. Visual symbols were presented on the screen for 900 ms, after which they were changed (ISI was 0 ms to minimize offset and onset responses). Each symbol was presented 2–4 times in a random order during a trial. After 10 simultaneous presentations of auditory and visual stimuli, a question mark was shown on the screen. This indicated that the participants should say aloud, depending on the task, either the rehearsed pseudoword or the number of counted symbols in the trial. A new trial began after 2.5 s from the occurrence of the question mark.
A hundred trials of each task condition were divided into 5 blocks separated by 10 s breaks. At the beginning of each block, prior to the actual tasks, 20 repetitions of an auditory nonspeech stimulus were presented without any task (ISI = 325 ms). Responses to these were recorded for the localization of auditory cortex, but they were not included in the analysis. A longer break was allowed between conditions. Instructions describing the task were given before each condition.
MEG Recording and Analysis
ERFs were recorded with a 306-channel Vectorview MEG device (Elekta Neuromag, Elekta Oy, Helsinki, Finland) comprising 204 planar gradiometers and 102 magnetometers. The participants sat in a magnetically shielded chamber with their head covered by the helmet of the MEG device. To reduce eye-movement artifacts, participants were instructed to refrain from excessive blinking during the experimental trials and to blink extensively during breaks. They were also instructed to minimize their head movements (even during the breaks). Before the experiment, 4 head-position indicator coils were attached to each participant's head, and their location with respect to anatomical landmarks (nasion and preauricular points) was determined by an Isotrak 3D digitizer (Polhemus, Colchester, VT, USA) to track the position of the head within the MEG helmet during the recording. MEG signals were recorded with a 600-Hz sampling rate and filtered with a band pass filter of 0.1–200 Hz.
A spatio-temporal signal space separation (tSSS) method of the MaxFilter™ software (Elekta Neuromag) was applied offline to the raw data to remove the effects of external interference and artifacts produced by nearby sources (Taulu and Simola 2006). The data were further analyzed using the BESA Research 5.3 software (BESA GmbH, Gräfelfing, Germany) and in-house Matlab scripts (MathWorks, Inc., Natick, MA, USA). They were filtered using a pass-band filter of 0.50–30 Hz. The MEG signals were averaged from 100 ms before to 900 ms after the stimulus onset separately for each stimulus type. Averaged MEG responses were baseline corrected to the 100-ms interval immediately preceding the stimulus onset. Trials with MEG amplitude exceeding 1200 fT/cm on gradiometers and 3000 fT on magnetometers were rejected automatically. An average of 94/100 (SD 7) artifact-free responses was obtained for all subjects and for each category that was used in the analysis. To determine response strength, 12 gradiometers (6 pairs) showing maximal responses above the temporal lobe of each hemisphere were chosen (see Fig. 2A), and average areal mean signal (AMS) was computed from them. The latencies of third-syllable N1m peaks were measured from grand-average AMS waveforms. Then, mean AMS was measured from individuals' averaged waveforms by centering a 20-ms time window around this latency. The sources of the magnetic responses to the critical third syllable were estimated with equivalent current dipoles (ECDs) fitted to the whole-head gradiometer data. One bilateral dipole was fitted to individual data in this time window because principal component analysis suggested that one principal component explained most of the variance (99% for grandaverage gradiometer data) in a 20-ms time window centered around the latency of the N1m grand-average AMS peaks. Dipole strength in each individual was measured in a 20-ms time window centered around the latency of the grand-average N1m source waveform peaks. The mean goodness of fit of the dipole model was 70–80% depending on the condition.
Although the responses to the third syllable were of primary interest, AMSs for the pseudoword beginnings were analyzed to characterize the effects of earlier predictions, possibly affecting the third-syllable responses. The responses to the pseudoword beginnings were determined otherwise similarly to those for the third syllables, but in three 20 ms time windows: N1m at around 120 ms, later sustained response around 220 ms, where the most prominent suppression seemed to take place, and the 460- to 480-ms time window overlapping silence immediately preceding the third-syllable onset.
To determine the differences in response strength between tasks and stimulus types, the AMSs in different time windows and dipole strength data were submitted to separate 2 × 2 × 2 × 2 analyses of variance (ANOVAs) with factors Task (rehearsal or control), Pseudoword beginning (concordant or discordant), Pseudoword ending (concordant or discordant), and Hemisphere (left or right) as within-subjects factors. Separate ANOVAs of the same structure were also used to determine the differences between tasks and stimulus types in location coordinates of third-syllable ECDs for the 3 dimensions (x-axis passing through periauricular points, y-axis through the nasion, and z-axis through the vertex). Significant interactions involving Task were always followed up with Bonferroni-corrected pairwise comparisons.
Areal Mean Signals
Auditory responses were elicited in all conditions, and ERF and AMS strength for the 4 time windows could be successfully computed for all stimulus types (see Fig. 2A).
For the first-syllable N1m, the interaction between Task and Pseudoword beginning was significant (F1, 23 = 23.64, P < 0.001). According to further pairwise comparisons, N1m responses to pseudoword beginnings concordant with rehearsed items were significantly weaker in the rehearsal condition than in the control condition (P = 0.002), whereas those to discordant beginnings were stronger in the rehearsal condition than in the control condition (P = 0.007). For the later sustained response at around 220 ms, a significant interaction between Task and Pseudoword beginning was also found (F1, 23 = 16.15, P < 0.001). Pairwise comparisons indicated that the effect was driven by continued weaker responses to concordant beginnings in the rehearsal condition when compared with the control condition (P < 0.001), whereas responses to discordant beginnings no longer differed significantly between the two conditions. The responses were significantly stronger in the right than the left hemisphere regardless of the task (the main effect of Hemisphere F1, 23 = 8.68, P = 0.007). The interactions involving Task did not reach significance in the 460- to 480-ms time window, overlapping with the silent occlusion phase of geminate stop consonants, and immediately preceding the third-syllable onset. However, the 3-way interactions between Task, Pseudoword beginning, and Hemisphere approached significance (F1, 23 = 3.84, P = 0.062), suggesting opposite effects of concordance and rehearsal in the 2 hemispheres (see Fig. 3).
For the third-syllable stimulus-internal N1m, ANOVA for AMS amplitude showed a significant main effect of Hemisphere (F1, 23 = 14.79, P < 0.001), indicating that N1m responses to the third syllable were stronger in the left than the right hemisphere. However, this effect was qualified by a significant 4-way interaction of Task, Pseudoword beginning, Pseudoword ending, and Hemisphere (F1, 23 = 5.71, P = 0.025). The interaction was followed up by Bonferroni-corrected pairwise comparisons, which revealed that when compared with the control condition, the N1m responses to concordant pseudoword endings were significantly suppressed in the rehearsal condition in both hemispheres (P < 0.05), whereas those for discordant pseudoword endings were not suppressed in either hemisphere. This differential effect was further substantiated by a significant interaction between Task and Pseudoword ending (F1, 23 = 35.60, P < 0.001). Rather than suppression, the N1m responses to the third syllables of pseudowords with concordant beginnings but discordant endings showed a significant enhancement in the rehearsal compared with the control condition. However, this N1m enhancement by rehearsal to phonological mismatches was found only in the left hemisphere (P = 0.002). Responses to discordant third syllables following discordant beginnings (i.e., fully discordant pseudowords) were neither significantly suppressed nor enhanced as a result of rehearsal. When pseudoword types were compared within each condition, no significant differences were found in the control condition. In the rehearsal condition, however, fully concordant pseudowords elicited significantly more suppressed N1m responses than pseudowords with discordant beginnings and concordant endings. This concordant whole-item suppression effect was significant only in the right hemisphere (P = 0.02). Sensitivity to phoneme/syllable-level detail was revealed by an enhancement effect as a result of a mismatch in one phoneme: Pseudowords with concordant beginnings but discordant endings elicited significantly stronger N1m responses than fully discordant pseudowords selectively in the left hemisphere (P = 0.04) (see Fig. 4C).
To summarize the results for the third syllable, responses to concordant endings were suppressed as a result of rehearsal (see Figs 2 and 4A–B). Concordant whole-item suppression was robust only in the right hemisphere. Response enhancement elicited by discordant endings following concordant beginnings was lateralized to the left hemisphere (see Figs 2A and 4C). A similar but bilateral pattern of suppression and enhancement effects was seen for the beginnings of the pseudowords (see Fig. 3).
ECD N1m sources for the third syllable were localized in the vicinity of the auditory cortex (see Fig. 2B and Supplementary Table 1). An ANOVA with the ECD location in the anterior-posterior dimension (y-axis, passing through nasion) as a dependent variable revealed a significant interaction between Task and Pseudoword ending (F1, 23 = 13.56, P = 0.001). Bonferroni-corrected pairwise comparisons further indicated that the interaction was driven by pseudowords with concordant endings: Their ECDs were significantly more posterior in the rehearsal than the control condition (P = 0.002), whereas those for the discordant endings did not differ between the tasks.
Another ANOVA with ECD strength as the dependent variable showed that the sources were overall stronger in the left than in the right hemisphere (main effect of Hemisphere F1, 23 = 20.83, P < 0.001). It also indicated a significant interaction between Task and Pseudoword ending (F1, 23 = 11.55, P = 0.002). According to further pairwise comparisons, ECD sources were weaker for concordant endings during rehearsal compared with control (P = 0.001), whereas no difference in ECD source strength was found for discordant endings between the tasks (see Figs 2B and 4B).
The current study examined the regulatory effects of forward prediction on brain's neuromagnetic N1m responses to heard stimulus-internal transitions in pseudowords, which overlapped with rehearsed pseudowords fully, partly or not at all. The main findings suggested that forward prediction effects were determined by the content of the rehearsal. In particular, right-hemispheric auditory cortex was sensitive to full-item overlap as indicated by response suppression, whereas the left auditory cortex detected differences at the finer-grained phoneme or syllable level as reflected in higher-amplitude responses for discordant phonemes/syllables. We will now consider these findings in more detail.
The first goal for this study was to examine whether the suppression of auditory cortex activity by forward prediction is specific to the predicted phonological units (phonemes or syllables). As indicated by AMS strength over the auditory cortices, covert rehearsal of pseudowords had a significant suppression effect on N1m for the syllables of auditory pseudowords when these were concordant, but not when they were discordant with the rehearsed pseudoword (see Figs 2A, 3, and 4A). The AMS suppression result was corroborated by the ECD analysis: Significantly weaker dipolar sources were observed for concordant, but not for discordant syllables during rehearsal when compared with control (see Figs 2B and 4B). In addition, source localization revealed that the ECDs associated with concordant endings were significantly more posterior in the rehearsal than the control condition, whereas ECD location for discordant endings did not significantly differ in the 2 tasks. This difference in location is likely due to different contributions of the nonspecific, onset-related and the stimulus-specific N1m subcomponent, which can be represented by a single dipole (Mäkelä 2007). The general onset-related N1m has been found to be more posterior and the stimulus-specific N1m more anterior when compared with each other (Sams et al. 1993; Mäkelä 2007). Consequently, the more posterior locations of ECDs for concordant endings during rehearsal likely reflect the suppression of the anterior stimulus-specific N1m, agreeing with the AMS and ECD strength results on N1m suppression during rehearsal. Because the N1m suppression was dependent on the rehearsal task, we can exclude the possibility that acoustic factors such as differences in stimulus intensity could account for the N1m effect. Rather, the results are compatible with the view that a forward prediction suppressed the auditory responses (Houde et al. 2002; Guenther et al. 2006; Rauschecker and Scott 2009; Tian and Poeppel 2010; Hickok et al. 2011; Houde and Nagarajan 2011; Price et al. 2011; Hickok 2012; Chang et al. 2013). The forward prediction signal has been suggested to be inhibitory (Hickok 2012), which may account for the N1m suppression. Alternatively, the forward prediction signal may preactivate phonological long-term memory representations or, in the case of repeated covert rehearsal, support maintenance of their activation for extended periods of time. Without rehearsal, the activation would soon cease. In an active state, the neurons of a phonological representation would be less responsive to concordant auditory stimuli in an analogous manner to the case of repeated auditory stimuli, which typically result in suppressed responses (Grill-Spector et al. 2006). Crucially, in any of the measures (AMS, ECD strength, and ECD location), no suppression was observed for pseudoword endings differing from the rehearsed one only by the final vowel. Thus, the N1m suppression was specific to the rehearsed vowel, suggesting that forward prediction is specific to phonological units corresponding to speech content. This specificity enables accurate monitoring and feedback of speech production, as proposed by recent models of speech production (Guenther et al. 2006; Tian and Poeppel 2010; Hickok et al. 2011; Houde and Nagarajan 2011; Price et al. 2011; Hickok 2012).
In the present study, the code of the forward prediction was most likely phonological for 3 reasons. First, the fact that no effect of stimulus type was seen for the control task suggests that the responses for the rehearsal task were not determined by echoic or sensory memory, apparently because of the intervening distractor stimuli. Secondly, the to-be-remembered pseudowords and phonologically concordant probes were acoustically different tokens. Thirdly, the code of phonological working memory used in covert rehearsal has been assumed to be phonological (Baddeley 1986; Jacquemot and Scott 2006). Thus, our experimental paradigm in all likelihood enabled to tap long-term memory representations at a phonological level, that is, at a high level of abstraction. However, we cannot exclude the possibility that subphonemic codes are also involved in forward prediction (see Niziolek et al. 2013 and Scott et al. 2013 for subphonemic effects). Similarly as in speech perception research, different tasks may tap different levels of processing. The present results obtained with a working memory task speak to the functioning of the phonological loop component in the working memory framework of Baddeley and Hitch (1974; see also Jacquemot and Scott 2006). Specifically, motor-to-auditory mappings via efference copy signals that are created by inner speech (cf. Scott 2013), together with auditory-to-motor mappings via the dorsal route of speech processing (Hickok and Poeppel 2007), could form the neural correlate of the maintenance and manipulation of verbal information in phonological working memory. This is also compatible with involvement of the same neural circuits, namely, inferior frontal articulatory cortex, the inferior parietal lobule, and auditory cortex, in both phonological working memory (Paulesu et al. 1993; Davachi et al. 2001) and forward prediction via efference copy (Rauschecker and Scott 2009).
The second aim of the study was to determine whether the phonological specificity of forward prediction occurs at the fine-grain level (phoneme or syllable) as opposed to the level of the full stimulus (potential words). The pairwise comparisons explaining the significant interaction of Task, Pseudoword beginning, Pseudoword ending, and Hemisphere obtained for the third-syllable AMS data suggested that the N1m for a pseudoword's final syllable was determined by both the beginning and the ending of the pseudoword. The first interesting effect revealed by the interaction was the observation that when auditory pseudowords were fully concordant with concurrently rehearsed pseudowords, the N1m was weaker compared with the N1m for pseudowords with discordant beginnings but concordant endings. This effect was significant in the right, but not in the left, hemisphere (see Figs 2A and 4C). Thus, the longer the sequence of concordant phonemes (full stimulus), the more robust was the suppression effect in the right hemisphere. We suggest that these findings reflect asymmetric temporal sensitivity of human auditory cortices (Schwartz and Tallal 1980; Zatorre et al. 2002; Poeppel 2003). Consistent with “asymmetric sampling in time” hypothesis (Poeppel 2003), 2 or 3 distinct timescales or windows of integration—corresponding to low gamma (25–35 Hz), theta (4–8 Hz), and delta (1–3 Hz) oscillations—have been suggested to be used in speech analysis (Boemio et al. 2005; Giraud and Poeppel 2012). Slowly modulated signals associated with a relatively long time window of analysis have been proposed to be preferentially processed in the right hemisphere (Boemio et al. 2005; see also Abrams et al. 2008; McGettigan and Scott 2012). An asymmetric timescale of analysis in speech perception is also supported by a previous MEG work using meaningful words (Kujala et al. 2002). In that study, responses to word-medial syllables were overall stronger in the left hemisphere, and left-hemispheric processing was not modulated by word context or lack of it. In the right hemisphere, however, syllables elicited a larger magnetic mismatch negativity brain response when embedded in a word context when compared with presentation in isolation. Correspondingly, our results indicated more robust suppression in the right hemisphere as a function of the length of the phonological overlap between the rehearsed and the auditory stimuli. According to our interpretation of these findings, short and long timescales used in perceptual analysis (Boemio et al. 2005; Giraud and Poeppel 2012) also apply to forward prediction and the feedback control system in speech production.
The short and long timescales would be ideal to monitor the production of syllables and speech prosody, respectively. Our stimuli resulting in asymmetrical effects did not differ in the rate of acoustic modulations and were neutral in terms of emotional prosody and identical in terms of linguistic prosody. Their role as probe and distractor stimuli was counterbalanced in the experiment. Therefore, in our case, the greater robustness of the suppression effect in the right hemisphere cannot be due to prosodic features or the rate of acoustic modulations. Rather, the longer timescale of analysis per se may account for the observed pattern of results (see McGettigan and Scott 2012 for a similar conclusion in speech perception).
The 4-way interaction of Task, Pseudoword beginning, Pseudoword ending, and Hemisphere in the present AMS data for the third syllable was a result of both suppression and enhancement effects. During rehearsal, stimuli with concordant beginnings but discordant endings elicited a significantly stronger N1m response in the left but not in the right hemisphere compared with the case when the stimuli were fully discordant (see Figs 2A and 4C). The left-hemispheric N1m enhancement appeared to take place at the fine-grain (phoneme or syllable) level, because it was triggered by a mismatch of one vowel between the auditory and rehearsed pseudowords. A possible account for this N1m enhancement is a prediction error signal (Friston 2005) additive to N1m. However, this interpretation would be valid only if the rehearsed pseudoword beginnings induced predictive coding similar to that reported for real-word beginnings (Gagnepain et al. 2012). In other words, a concordant beginning should have generated a prediction of how the auditory pseudoword would end in order to explain the elicitation of stronger prediction error responses to discordant endings compared with responses to the discordant endings following discordant beginnings. Taken together with motor-to-auditory forward mappings, these string-internal predictions may thus form a hierarchical predictive coding mechanism (Wacongne et al. 2011) generating double predictions in the left hemisphere. In addition to inducing an enhancement as a response to discordance, however, these double predictions that build up as the concordant beginnings unfold should also induce stronger suppression for concordant endings when compared with concordant endings following discordant beginnings. This pattern of enhancement and suppression results, although significant only for the former, was indeed observed during rehearsal (see Fig. 4C), but only in the left hemisphere. In contrast, no enhancement effects for discordance were found in the right hemisphere, suggesting that it deals only with motor-to-auditory predictions at a single level.
Alternatively, the modulatory effect of pseudoword beginnings on the processing of their endings could have been due to attention. Hearing concordant pseudoword beginnings after discordant pseudowords could have decreased participants' concentration on rehearsal. In this case, the discordant endings could have changed the focus of attention from covert rehearsal to the auditory stimuli. Since auditory selective attention to speech has been suggested to enhance responses to the attended signal (Hillyard et al. 1973; Alho et al. 2003; Ahveninen et al. 2006), this could have increased the N1m amplitude. However, attention does not account for discordance-induced enhancements in responses to pseudoword beginnings (seeFig. 3). In addition, other attentional switches from the internal-speech tasks to auditory stimuli would have decreased rather than increased the suppression effects. The results thus suggest that participants' attentional fluctuations do not fully account for either our suppression or enhancement findings.
N1m to pseudoword onsets was assumed to be less sensitive to stimulus-specific effects and more sensitive to onset effects (Pardo et al. 1999). However, similarly to the results observed for the pseudoword ending, those to the pseudoword beginning are compatible with the view that forward prediction caused by covert rehearsal suppresses the N1m brain response for concordant pseudoword beginnings, but enhances the responses to discordant pseudoword beginnings. The latter is likely due to a prediction error signal. Suppression and enhancement effects may thus serve as online motor-to-auditory monitoring signals. In a later time window (210–230 ms), only suppression for concordance but no prediction error for discordance was observed. Taken together with the finding of neither suppression nor enhancement for the discordant pseudoword endings following discordant beginnings, this result suggests that the prediction error signal is elicited only once during a continuous discordance. This could be interpreted to indicate that the prediction error signal terminates further forward predictions. However, this is not supported by the fact that suppression effects do occur after a prediction error in the third-syllable N1m for the pseudowords with discordant beginnings and concordant endings. Recall, however, that on the basis of the enhanced response to discordant endings following concordant beginnings, rehearsed pseudoword beginnings were suggested to induce prediction of their endings (cf. Gagnepain et al. 2012 for a similar effect in words). This would represent a different string-internal level of predictive coding than motor-to-auditory forward prediction (cf. Wacongne et al. 2011 for hierarchical predictive coding). Perhaps the prediction error signal serves to terminate the unnecessary string-internal predictions in the case of discordance.
The use of relatively long stimuli in the present study enabled to reveal dynamic changes in the lateralization of their processing. For the pseudoword beginning, no hemispheric differences were found during N1m, whereas the sustained responses were stronger in the right hemisphere. These findings are in contrast with the left-lateralized auditory N1m responses observed for the third syllable. This pattern of results may be interpreted to suggest that the right hemisphere uses a dynamically extending time window for integration of auditory input. First, at the very beginning of the pseudowords (N1m following 100 ms), the time windows in the 2 hemispheres are similar, resulting in bilateral processing. As the pseudowords unfold to a certain extent (about 200 ms), responses lateralize to the right. This could be accounted for by an extending time window in the right hemisphere: A longer excerpt of input is to be integrated, which may generate stronger responses. However, the onset-locked time window cannot be infinitely extended or the incoming information gets less and less weight. Eventually at some point before 500 ms, new input is decreasingly able to affect the overall integration process, reducing the responsiveness of the right hemisphere. However, the fine-grain analysis is not subject to the same integration limits and can keep going, resulting in the lateralization to the left. Such 3-phased integration (bilateral−right-laterilized−left-lateralized) could be one of the factors explaining mixed conclusions about findings concerning the lateralization of speech processing (for a review, see McGettigan and Scott 2012): Stimuli of different length could produce different lateralization results. This effect of time window on lateralization seems to be driven by the auditory processing of our stimuli, as it is seen in main effects across tasks rather than interactions with the rehearsal. Nevertheless, it is noteworthy that, as illustrated by Figure 3 (rightmost panel), there was a tendency for the suppression associated with concordance to persist during rehearsal in the right but not in the left hemisphere throughout the later (>400 ms post-onset) occlusion phase of geminate stops when no sound was heard, and, therefore, there was no need for online comparisons for prediction and input. Although the interaction for this effect did not quite reach significance, it may be interpreted to support the view that speech production is monitored asymmetrically in the 2 hemispheres. This issue requires further research.
Combining the different response patterns in the 2 hemispheres during covert rehearsal into an “asymmetric speech monitoring hypothesis,” a possible account for the results is that forward predictions are projected from the left inferior frontal cortex to both auditory cortices, generating 2 auditory-motor monitoring and feedback circuits in speech production. We propose that the circuit extending to the right hemisphere is specialized for holistic integrative monitoring of longer sequences. According to our data, the timescale of analysis in the right hemisphere is at least 300 ms. Owing to the 220-ms silent closure phase of geminate stop consonants in our stimuli (see Fig. 1), the previous speech sound that was informative on concordance with the rehearsed pseudoword occurred >220 ms prior to the onset of the third syllable. Samples stretching over tens of milliseconds (or more) were likely to be needed from both before and after the silent phase for integrative analysis. Thus, in terms of oscillatory brain activity (Giraud and Poeppel 2012), the window of integration would have corresponded to delta oscillations (1–3 Hz) in the right hemisphere. It cannot, however, be ruled out that the right-hemispheric timescale is longer than 300 ms or even would have covered the whole pseudowords in our study, as suggested by the lateralization effects. The possible role of the left auditory cortex in holistic processing requires further research for better teasing apart effects at different levels of linguistic units (e.g., onsets, rhymes, phonemes, and syllables) or different lengths of sequences.
Recent models of speech production mostly agree that the production of syllables or monosyllabic words is monitored in a left-lateralized neural network (cf. Price et al. 2011; Tourville and Guenther 2011; Hickok 2012; Tian and Poeppel 2013). In line with this, we hypothesize that the left-hemispheric monitoring circuit is specialized to the timescale of phonemes or syllables. [In the present study, the N1m for concordant endings was suppressed in both hemispheres, but it is difficult to disentangle whether monitoring of short segments such as phonemes or syllables is bilateral or left-lateralized on the basis of suppression alone: given that the processing seems to take place in longer windows of integration in the right hemisphere, right-hemispheric responses to any auditory item partially concordant with a rehearsed item should always show some degree of suppression. Therefore, in this case, responses to discordant endings may be better indicators of lateralization.] The proposition of the fine-grain timescale in the left hemisphere is compatible with the finding that the responses to pseudowords with concordant beginnings and discordant endings in the rehearsal condition showed left-lateralized responses as a result of one mismatching vowel between prediction and the auditory stimulus. The comparisons between the responses to the pseudoword types also suggested that, in the left hemisphere, speech is monitored at 2 hierarchical levels: Motor-to-auditory forward prediction and string-internal predictive coding (phonemes/syllables within potential words). Thus, the present asymmetric speech monitoring hypothesis extends recent models of speech production by suggesting 2 monitoring circuits with asymmetrical sampling rates and a different hierarchical structure. We propose that the left-hemispheric fine-grain monitoring circuit (cf. Price et al. 2011; Tourville and Guenther 2011; Hickok 2012; Tian and Poeppel 2013) includes 2 hierarchical levels: motor-to-auditory and string-internal. Another auditory-motor circuit is proposed to extend from the left inferior frontal cortex to the right auditory cortex. This single-level circuit is suggested to be used for integrative monitoring of longer sequences than a syllable, which is ideal for the monitoring of speech prosody.
In conclusion, the results of the present study suggest that forward prediction of auditory language material induced by covert speech rehearsal is specific to fine-grained phonological detail, reflecting one-to-one motor-to-auditory mapping. Taken together with an auditory-to-motor mapping stream, this “inner voice” could also be used to maintain verbal sequences in phonological working memory. The observed hemispheric differences in forward prediction can be accounted for by assuming 2 auditory-motor monitoring circuits for speech production: The production of speech sounds or syllables is suggested to be monitored in the left hemisphere and that of longer sequences is proposed to be monitored holistically in the right hemisphere. This proposal is consistent with asymmetric temporal sensitivity of the auditory cortices.
This work was supported by the Academy of Finland (grant numbers 131963 and 110230), Natural Sciences and Engineering Research Council of Canada (NSERC), and UK Medical Research Council (core program U1055.04.014.00001.01 MC-A060-5PQ90).
The authors thank Prof. Mari Tervaniemi for discussions and Mr Miika Leminen and Mr Tommi Makkonen for technical assistance. Conflict of Interest: None declared.