Reading skills are indispensible in modern technological societies. In transparent alphabetic orthographies, such as Dutch, reading skills build on associations between letters and speech sounds (LS pairs). Previously, we showed that the superior temporal cortex (STC) of Dutch readers is sensitive to the congruency of LS pairs. Here, we used functional magnetic resonance imaging to investigate whether a similar congruency sensitivity exists in STC of readers of the more opaque English orthography, where the relation among LS pairs is less reliable. Eighteen subjects passively perceived congruent and incongruent audiovisual pairs of different levels of transparency in English: letters and speech sounds (LS; irregular), letters and letter names (LN; fairly transparent), and numerals and number names (NN; transparent). In STC, we found congruency effects for NN and LN, but no effects in the predicted direction (congruent > incongruent) for LS pairs. These findings contrast with previous results obtained from Dutch readers. These data indicate that, through education, the STC becomes tuned to the congruency of transparent audiovisual pairs, but suggests a different neural processing of irregular mappings. The orthographic dependency of LS integration underscores cross-linguistic differences in the neural basis of reading and potentially has important implications for dyslexia interventions across languages.
Few skills have a greater impact on an individual's success in modern culture than the ability to read (UNESCO 2005). However, in contrast to the ease with which children learn some skills such as walking or speaking, learning to read is an effortful process that relies upon years of formal education (Vaessen and Blomert 2010). This discrepancy in the amount of effort needed to master spoken language and written language is thought to reflect the fact that, unlike spoken language, the emergence of reading as a cognitive skill is so phylogenetically recent that the forces of natural selection have most probably not had sufficient time to shape our brains specifically for reading (Rayner and Pallatsek 1989). Nonetheless, reading is a skill that is deeply rooted in the spoken language system (e.g., Liberman et al. 1974) and is thought to be constructed through an appropriation of more general linguistic and visual abilities and their corresponding neural circuits (Dehaene and Cohen 2007; Blomert 2011).
Because reading is constructed upon the phonological representation of spoken language, one of the most important building blocks of reading is the ability to unite visual (orthographic) and auditory (phonological) information into a single multimodal percept. Indeed, in the alphabetic languages, one of the most critical precursors of learning to read is the ability to link a written symbol (letter) with its corresponding speech sound (Marsh et al. 1981; Snowling 2000; Ehri 2005). The process of learning to associate the visual features of a letter with the auditory information it represents results in the integration of the information from both sensory modalities into a unified audiovisual percept. The resulting audiovisual letter–sound (LS) pairs become highly overlearned through the course of formal instruction and reading experience, which allows the visual features of the letter to automatically and effortlessly evoke its paired auditory information in skilled readers (Perfetti and Sandak 2000; Snowling and Hulme 2005).
Due to its primacy in learning to read, audiovisual processing has provided a unique point of departure for research into the neurobiological underpinnings of reading ability and disability across developmental time (Blomert and Froyen 2010). The neural correlates of audiovisual integration have, therefore, been studied extensively within the context of reading (see, e.g., Schlaggar and McCandliss 2007). Convergent evidence from a growing body of neuroimaging research has implicated superior temporal regions as the principal neural correlates underlying the integration of letters and speech sounds (Raij et al. 2000; Hashimoto and Sakai 2004; van Atteveldt et al. 2004, 2009), as well as in the audiovisual integration of monosyllabic words (Maurer et al. 2006; McNorgan et al. 2014). This includes regions of the posterior superior temporal gyrus (pSTG) and superior temporal sulcus (pSTS) as well as portions of the auditory association cortex (AAC) located on the transverse temporal gyrus. The involvement of these brain regions in LS processing has been shown to emerge slowly over developmental time (Froyen et al. 2009), and their disruption has been linked to difficulties in learning to read (Blau et al. 2009, 2010).
In our previous work, we have systematically studied the brain mechanisms underlying the audiovisual integration of letters and speech sounds in Dutch readers. In the first study (van Atteveldt et al. 2004), we took advantage of the overlearned quality of LS pairs in adult fluent readers and manipulated the congruency between auditory and visual information. Participants were asked to passively view letters and listen to speech sounds while their brains were imaged with functional magnetic resonance imaging (fMRI). Four conditions were administered. In the unimodal visual (UV) condition, participants viewed letters in silence. In the unimodal auditory (UA) condition, participants heard speech sounds with no corresponding visual information. In the bimodal conditions, participants were presented with visual letters and auditory speech sounds simultaneously. In addition, the congruency of the bimodal pairs was manipulated: In some blocks, the visual letter matched the heard speech sound (bimodal congruent, BC) while in other blocks the visual letter did not match the heard speech sound (bimodal incongruent, BI). The bimodal conditions, when statistically compared with the unimodal conditions, were associated with greater recruitment of the AAC and pSTG. The pSTG was responsive to UV and auditory stimulation, and also showed enhanced activation when participants were presented with a combination of auditory and visual information, whether congruent or incongruent. The AAC showed a stronger response to UA stimulation, but interestingly, was sensitive to the congruency of the bimodal condition: It showed a significantly larger response to congruent audiovisual pairs relative to speech sounds alone, while incongruent audiovisual pairs showed a reduced response, suggesting that the auditory response was modulated by visual information. Subsequent studies have revealed direct evidence that the pSTG is also sensitive to the congruency of the audiovisual speech pairs (van Atteveldt et al. 2010) and suggested that congruency is detected by the pSTG and fed back to modulate the auditory signal (van Atteveldt et al. 2009). The functional–behavioral importance of this neural sensitivity to audiovisual congruency in letters is highlighted by 2 recent studies that linked deviant congruency processing in these brain regions with dyslexia in both children (Blau et al. 2010) and adults (Blau et al. 2009).
In sum, the findings described above can be codified into the following account. As individuals learn to read an alphabetic script, they first learn to associate the speech sounds of their spoken language with a visual symbol to form the script code necessary to decipher words. This process is referred to as acquiring the alphabetic principle (Byrne et al. 1996), or phonological recoding (Share 1995). Over time, these symbol–sound associations form an audiovisual percept that is highly overlearned such that the visual symbol automatically elicits an auditory referent. On the neural level, regions known to be involved in multisensory processing (pSTS/pSTG) and speech sound processing (AAC) become tuned to the (congruency of) audiovisual LS pairs. As a result of this tuning, these regions come to respond differently to the congruent LS pairs that are standard to a given orthography (e.g., v - /v/) than to novel or incongruent pairs (e.g., v - /l/). This sensitivity to congruency is stimulus driven and can be automatically elicited in the absence of an active task. Moreover, individuals with reading difficulties show a concomitant lack of this automatic neural sensitivity that is likely a fundamental contributing factor to their reading disabilities. While such individuals activate the same brain regions in response to LS pairs as typically developing readers, the neural response in dyslexic individuals does not differentiate between congruent and incongruent pairings. Taken together, the evidence seems to suggest that orthographic–phonological audiovisual integration is one of the fundamental building blocks out of which successful reading is constructed (see Blomert and Froyen 2010; Blomert 2011, for reviews).
While the findings described above are of significant interest, it must be noted that all of the extant experiments characterizing the neurobiological correlates of integration at the LS level have been conducted in languages with relatively transparent orthographies. In these orthographies, each letter is mostly consistently associated with a single speech sound resulting in audiovisual LS sound pairs with very little ambiguity. A crucial open question is whether a similar pattern of neural activity would be found in readers of more opaque alphabetic orthographies, such as English, where a single letter or letter cluster typically represents several different speech sounds dependent upon the context of the other letters in the word in which the letter occurs (such as the letter cluster “ou” in though, through, tough, or ouch). Because English is less predictable in its LS pairs, it is possible that different processes underlie the learning of reading in English (Goswami et al. 2005; Ziegler and Goswami 2006). Put another way, if the congruency response reported in previous work depends upon the regularity of the orthography being read, one might expect that a similar response would be weaker or absent in relatively opaque orthographies, such as English. This notion is supported by the suggestion of different developmental stages during reading acquisition in English compared with more regular orthographies (Share 2008). This developmental difference could result in differences in the audiovisual processing of single LS pairs. This led us to question whether or not English LS pairs would show congruency-related modulation similar to that found in previous studies of audiovisual processing in readers of more transparent languages.
The potential difference between basic audiovisual processing in opaque and transparent orthographies may have far-reaching implications. If such a difference exists, it would support the proposition that learning to read an alphabetic orthography is not a monolithic process common across all languages, but rather one that is dependent upon differences in the orthography being learned (Ziegler and Goswami 2006). In addition, understanding audiovisual processing of orthographic–phonological information in English has practical implications. As described above, atypical audiovisual integration has been linked with dyslexia in the fairly transparent orthography of Dutch. Therefore, an understanding of similarities and differences between the neural correlates of audiovisual processing in English (more opaque) and transparent orthographies might prove instrumental in the development of appropriate remediation strategies for dyslexia. If the audiovisual processing of LS pairs depends upon the transparency of the orthography being read, this could suggest that different techniques for remediation should be used for each language.
Against the background of the above discussion, we extended the paradigm employed previously (van Atteveldt et al. 2004) to investigate the neural correlates underlying LS sound integration in English readers. Specifically in addition to examining the neural correlates associated with audiovisual integration in LS, we also explored the integration response in 2 other audiovisual pairs: Letters and letter names (LN) and numerals and number names (NN). These conditions were included to provide contrasts that differed in transparency from LS pairs. NN pairs are completely transparent in Canadian English; each numeral has one name. However, the transparency status of LN pairs is less clear. Within the context of letter names, LN pairs are completely transparent. With the exception of the letter Z, which can be called zee or zed, all of the letters of the English alphabet have one name. However, the same visual form is associated with 2 different sets of auditory referents (names and speech sounds). This implies that when an individual sees a single letter, it is always ambiguous whether the associated auditory information of that letter is the transparent association of a letter name or the opaque association of a speech sound. In contrast, NN pairs have only one possible auditory referent. Therefore, NN and LN pairs can be seen as offering 2 different levels of audiovisual transparency that serve as contrasts to LS pairs.
In the experiment described below, we contrasted the neural response to LS pairs with the neural response to LN and NN pairs in native English speakers. Against the background of recent research highlighting important differences in the behavioral correlates of reading between transparent and opaque orthographies (Ziegler et al. 2010), we hypothesized that the audiovisual integration (congruency) response demonstrated in the described studies that were conducted in a transparent orthography would be weaker in the superior temporal cortex (STC) of English readers due to the more opaque nature of the English orthography that putatively results in weaker LS associations. Furthermore, we expected that the same participants would show the effect of congruency in other learned audiovisual pairs (letter names and number names) as these pairs, like LS pairs in transparent orthographies, have highly regular visual-to-auditory mappings, and might be more representative of the units important for reading acquisition in English.
Materials and Methods
Eighteen individuals (9 females, 9 males; age range: 19–35; mean age: 24) were paid to participate in this study. Participants were recruited from undergraduate and graduate faculties at the University of Western Ontario as well as from the surrounding community in London, Ontario. All participants reported normal or corrected-to-normal vision, no hearing problems, right-handedness, and Canadian English as their first and primary language. Participants gave informed consent as monitored by the Research Ethics Board at the University of Western Ontario.
Stimuli and Experimental Design
Stimuli consisted of 8 lowercase letters (b, h, j, k, l, p, r, v) and 8 single-digit numbers (1, 2, 3, 4, 5, 6, 8, 9) presented in both the visual and the auditory modality. The numeral 7 was not used because its auditory referent “seven” is 2 syllables while all other auditory referents used were 1 syllable. The letters were selected because they are the most transparent single-letter LS pairs in English orthography (Berndt et al. 1987). We reasoned that the transparent LS pairs are the least likely to differ from LN and NN pairs. Visual stimuli were presented in white 40-point Arial font and centered on a black background. Auditory stimuli consisted of number names, letter names (e.g., “bee,” “kay”), and phonemes (e.g., /b/ /k/, sounding like “buh,” “kuh”) spoken by a female Canadian English speaker. Auditory stimuli were digitally recorded with a sampling rate of 44.1 kHz with 16-bit quantization. Each phoneme used in the fMRI experiment was recognized correctly 100% of the time by 10 additional participants in a pilot experiment, who were asked to point out to the corresponding written letter from a list of the 26 letters in English. Stimulus presentation time was 350 ms for the visual symbols. The sounds varied somewhat in exact duration but the average duration was 332 ms (SEM 14 ms). For the different sound types: Number names average duration was 379 ms (s.e m. 19 ms), letter names average duration was 319 ms (SEM 25 ms), letter sound average duration was 300 ms (SEM 18 ms).
Stimuli for each of the 3 audiovisual pair types (LS, LN, NN) were presented in 4 different conditions using E-Prime 1.2. In all conditions, participants were instructed to attentively watch and listen to the visual symbols and/or sounds, without making responses. In the UV condition, participants were asked to watch a series of either visual letters or numerals presented individually and silence. The UA condition presented auditory content (speech sounds, letter names, or number names) to the participants without corresponding visual information. In the bimodal conditions, visual and auditory information was presented simultaneously. The BC condition provided visual information that matched the corresponding auditory information, while the BI condition did not. Each condition (UV, UA, BC, BI) was presented separately for each audiovisual pair (LS, LN, NN) resulting in a total of 12 conditions overall.
The 12 conditions were each presented to the participants in separate blocks of 21 400 ms. Each unimodal block was presented once over the course of a run, and each bimodal block was presented twice over the course of a run. Each participant completed 2 runs and thus 4 blocks of each of bimodal conditions and 2 blocks of each of the unimodal conditions. The blocks were pseudo-randomized over the course of a run such that the same condition was never presented twice in a row. At the beginning and end of each run as well as between each block, a fixation period of 16 000 ms was presented.
Because most of our stimuli included an auditory signal, we employed a sparse sampling paradigm to reduce the confound of scanner background noise (Hall et al. 1999; Jäncke et al. 2002). Sparse sampling takes advantage of the temporal delay in the hemodynamic response function. Typically, and in this experiment, a stimulus or series of stimuli are presented in silence followed by the recording of a single functional volume, which samples the hemodynamic response as it peaks from the preceding stimulation. Our use of this type of paradigm required us to divide our blocks of trials into separate mini blocks. Thus, for each mini block of trials, participants would be presented with 5 stimuli (350 ms each) of the relevant experiment condition in the absence of scanner noise followed by short fixation in which a single functional volume (1500 ms) was collected. The interstimulus interval, which might be described instead as a stimulus buffer, was 350 ms. Six stimulus buffers were included: One between each stimulus trial, one before the first stimulus, and one following the last stimulus. In total, the 5 stimuli (350 ms × 5 = 1750 ms; this timing was used also for the AU condition), the 6 buffers (350 ms × 6 = 2100 ms) and the single volume acquisition, TR (1500 ms) resulted in a mini block that was 5350 ms long. Four mini blocks were collected in each larger block, resulting in a total of 20 stimuli per block (5350 ms × 4 = 21400 ms). Please see Figure 1 for details. In total, then, we collected 2 runs, each consisting of 37 blocks (19 blocks of fixation: 16 000 × 19 = 304 000 ms; 18 blocks of trials: 21 400 × 18 = 385 200 ms) totaling roughly 11.5 min.
MRI Data Acquisition
Functional and structural images were acquired in a 3-Tesla Siemens Tim Trio whole-body MRI scanner, using a Siemens 12-channel head coil. A gradient echo-planar imaging T2* sequence sensitive to the blood oxygenation level–dependent (BOLD) contrast was used to acquire 28 functional slices per volume, which were collected in an interleaved order (3 mm thickness, 64 × 64 matrix, repetition time (TR): 5350 ms, echo time (TE): 30 ms, flip angle: 78°) and covered the whole brain with the exception of the most anterior and inferior section of the temporal poles. 256 volumes were acquired for each functional run. High-resolution anatomical images were acquired with a T1-weighted MPRAGE sequence (1 × 1 × 1 mm, T1 = 2300 ms, TE = 4.25 ms, TR = 2300 ms, flip angle: 9°).
fMRI Data Preprocessing
All functional images were preprocessed using BrainVoyager QX 2.2.0 (Brain Innovation, Maastricht, the Netherlands; Goebel et al. 2006). The steps included slice scan time correction (cubic spline interpolation), correction for 3D head motion (trilinear motion detection and sinc motion correction), and temporal high-pass filtering (GLM-Fourier 2 cycles). Each functional image was then coregistered to the subject's anatomical image, transformed into Talairach space, and smoothed with a 6-mm full-width at half-maximum Gaussian smoothing kernel.
fMRI Analysis Strategy
The analysis for this study was adapted from van Atteveldt et al. (2004). For each participant, a design matrix was created with 12 predictors: each of the 4 conditions (UV, UA, BC, BI) for each of the 3 audiovisual pair types (LS, LN, NN). The resulting random-effects (RFX) whole-brain general linear model included these 12 predictors.
Recall that the central goal of this study was to investigate the neural correlates of audiovisual integration in readers of a relatively opaque orthography (English). With this in mind, we restricted our analysis to the 2 audiovisual (congruent and incongruent) conditions within the 3 types of audiovisual pairs. By focusing upon the congruent and incongruent conditions, our analysis was specifically targeted to probing the audiovisual integration of learned pairs rather than a more general multimodal processing. In other words, the presence of simultaneous auditory and visual information would be expected to elicit a hemodynamic response in both modality-specific and heteromodal regions of the cortex. However, only a subset of these regions should be modulated by the learned associations between visual and auditory information, which can be revealed by looking at the effect of the congruency of the bimodal pairs on brain activation. Thus, to detect learned audiovisual integration, it is not sufficient to simply examine regions that respond to both auditory and visual stimuli. One must, instead, look for neural activation that is dependent upon the presence of a learned or “correct” (congruent) audiovisual pair relative to the presence of an unlearned or “incorrect” (congruent) pair. For more detailed discussion of this point, see van Atteveldt, Formisano, Blomert, et al. (2007) and Goebel and van Atteveldt (2009). The unimodal stimulation conditions (UV, UA) were used to inspect the unisensory responses within the voxels revealed by the congruency contrast, which will tell whether these voxels are modality-specific or heteromodal.
Against the background of this reasoning, we structured our analysis as follows. We initially conducted a 3 × 2 analysis of variance using pair type (LS, LN, NN) and congruency (congruent, incongruent) as within-subjects factors (this is the 2nd level of the RFX analyses). Within this analysis, we focused upon the interaction of pair type and congruency as a means to reveal any statistical differences in the congruency effect between the 3 pair types. Of secondary interest was the detection of any main effects (condition or congruency). This 2 × 3 analysis of variance gives maps for main effects and interactions, but does not test for the presence of congruency-related modulation within each pair type independently. Therefore, we employed a subsequent series of 3 whole-brain t-tests (congruent vs. incongruent for each pair type) to verify and supplement the results of the initial analysis. All statistical maps (voxel-level P < 0.005) were corrected at the cluster level at P < 0.05 (Forman et al. 1995; Goebel et al. 2006). In the clusters revealed by the whole-brain 2 × 3 ANOVA and t-tests, z-standardized beta estimates were extracted for all 12 conditions to plot detailed response profiles of uni- and multisensory processing of all pair types.
Interaction Between Congruency and Pair Type
A significant interaction between congruency and pair type was found in 3 regions of the STC. As illustrated in Figure 2, a bilateral region in the pSTG as well as a more anterior region in the left superior temporal gyrus (STG) showed a significant interaction. All 3 regions show a clear auditory-specific response (see bar charts in Fig. 2, but note that the A and V responses are only included to show the modality–specificity of the clusters, these conditions were not part of the second-level interaction term). These bar charts also illustrate the pattern of the interaction.
In the right STG, the predicted congruency effect, characterized by greater activation in response to congruent relative to incongruent audiovisual pairs (see the 2 rightmost bars in each chart) is present only in the NN condition, t(19) = 4.0, P < 0.01. The LN condition showed no significant difference between congruent and incongruent conditions, t(19) = 0.99, P = 0.33. In response to the LS pairs, this region showed a reverse congruency effect, with greater activity in response to incongruent relative to congruent pairs, t(19) = −2.9, P < 0.05.
In the more posterior of the 2 left STG activations, a congruency effect was once again observed in the NN condition, t(19) = 3.5, P < 0.01. In response to LN pairs, this region showed no significant congruency effect, t(19) = 2.2, P = 0.37. The LS pairs again showed a significant reverse congruency effect, t(19) = −2.8, P < 0.05.
The more anterior left temporal activation showed a similar pattern. The NN condition showed a significant effect of congruency, t(19) = 3.5, P < 0.01. The LN condition showed no effect of congruency, t(19) = 1.3, P = 0.22. The LS condition showed a significant reverse congruency effect, t(19) = −2.6, P < 0.05.
Because the interaction is driven by patterns of activation that differ significantly between pair types, it does not detect congruency effects within each condition independently. Therefore, we performed whole-brain t-tests (congruent vs. incongruent) for each pair type independently to add further nuance to the interaction results. Figure 3 illustrates a significant congruency effect in the left STG in response to both the LN pairs and the NN pairs. No effects of congruency (congruent > incongruent or the reverse) was found for the letter–letter speech sound pairs, even at the exceedingly liberal threshold of P < 0.05, uncorrected. Additionally, this analysis showed that, although voxels revealed by the interaction analysis in the left and right STG showed significantly greater modulation to incongruent LS pairs relative to congruent LS pairs, the voxel-wise congruency contrast did not reveal significant effects of incongruency in these regions.
Main Effect of Pair Type
A main effect of pair type (LS, LN, NN) was found in 3 regions of the cortex (Fig. 4). The right supramarginal gyrus (SMG) showed a greater BOLD response during the NN condition relative to the conditions utilizing letters. Conversely, 2 regions in and around the left fusiform gyrus showed more activity in response to letters than numerals. One region spanning parts of the left lingual gyrus and left fusiform gyrus showed significant activation in the LN condition relative to the other 2 conditions. The second region, located ventrolateral to the former and encompassing portions of the inferior occipital gyrus and the fusiform gyrus, showed stronger response to the 2 letter conditions than the numeral condition. Please see Figure 4 for details.
Main Effect of Congruency
A main effect of congruency was found across an extensive bilateral fronto-parietal network of brain regions, reported in full in Figure 5 and Table 1. The pattern of activity in each of these regions is characterized by an incongruency effect, or greater activation during the incongruent relative to the congruent pairs, across all 3 pair types.
|Right inferior temporal gyrus||60||−53||−3|
|Right middle frontal gyrus||39||28||39|
|Right inferior frontal gyrus||47||−2||18|
|Right intraparietal sulcus and surrounding parietal lobules||32||−44||39|
|Left middle and inferior frontal gyri||−43||16||33|
|Left intraparietal sulcus and surrounding parietal lobules||−40||−56||42|
|Left middle and inferior temporal gyri||−46||−38||3|
|Right inferior temporal gyrus||60||−53||−3|
|Right middle frontal gyrus||39||28||39|
|Right inferior frontal gyrus||47||−2||18|
|Right intraparietal sulcus and surrounding parietal lobules||32||−44||39|
|Left middle and inferior frontal gyri||−43||16||33|
|Left intraparietal sulcus and surrounding parietal lobules||−40||−56||42|
|Left middle and inferior temporal gyri||−46||−38||3|
A growing collection of neuroimaging research has implicated superior temporal regions in the integration of letters and their associated speech sounds. Specifically, these studies have demonstrated that auditory association regions in STC STC, and more specifically in STG and AAC, show greater activity in response to congruent LS pairs relative to pairs in which the letter does not match the sound. To date, however, most of these studies have examined LS integration in relatively transparent orthographies such as Dutch (van Atteveldt et al. 2004) or Finnish (Raij et al. 2000). It stands to reason that the transparency of the orthography could underlie the congruency-related neural activity reported, as LS pairs are the crucial and reliable building blocks in reading acquisition in transparent scripts, but possibly less so in the more opaque English orthography (Share 2008). It has been shown that reading acquisition produces different developmental (neuronal) processes in different writing systems (Ziegler and Goswami 2006; Ziegler et al. 2010; Brennan et al. 2012), but see (Vaessen et al. 2010). In the present study, we addressed the important open question as to whether similar neural patterns of congruency-dependent activation are present in the brains of individuals who have learned to read more opaque orthographies, such as that of English, in which a given letter can have many different possible sounds depending upon the context of other letters in the word.
Against this background, we used a passive fMRI design, based on the previous reported Dutch studies (van Atteveldt et al. 2004). We tested whether English readers show a congruency-dependent response to Letter-speech sounds (LS), similar to those reported in Dutch readers. In addition, we examined congruency effects for 2 other types of audiovisual pairs: Letter-Letter Names (LN) and Number-Number Names (NN). In contrast with LS pairs, LN and NN pairs have a high degree of transparency in English and, therefore, provided an important contrast for the LS pairs. Evidence of congruency-dependent effects was detected by a significant congruency by pair type interaction in regions of the bilateral temporal cortex, localized to the AAC located on the transverse temporal gyrus. This interaction was characterized by a crossover pattern. As illustrated in the bar charts of Figure 2, in response to the NN pairs, these regions showed a pattern of greater activity during the congruent relative to the incongruent conditions. However, in response to the LS condition, these regions showed a pattern of greater activity during the incongruent relative to the congruent conditions. No difference was found between congruent and incongruent pairs in response to the LN pairs.
Additional whole-brain t-tests for each pair type separately provided a pattern of results similar to those detected by the analysis of variance. A significant congruency effect was found in the AAC for the NN condition (see Fig. 3). The series of t-tests also revealed results that were not found through the interaction analysis. The AAC showed a congruency effect in the LN condition (see Fig. 3). In addition, a similar congruency effect was found in the right transverse temporal gyrus in response to NN pairs (see Fig. 3). No evidence of congruency effects in either direction were found in response to the LS pairs (not pictured). Even at liberal, uncorrected thresholds, no region showed greater activation for congruent relative to incongruent pairings when LS pairs were presented, or vice versa. The absence of (in)congruency effects in this whole-brain LS congruency contrast indicates that the stronger response to incongruent LS pairs at the cluster level (in the clusters identified by the Congruency * Pair Type interaction) does drive the interaction, but is not strong enough to be significant in a whole-brain analysis, in contrast to the congruency effect for the LN and NN pairs which was found to be significant at the whole brain level of analysis.
Before outlining our interpretation of these results, 2 other salient findings should be mentioned. Irrespective of congruency, 3 regions of the cortex showed a main effect of pair type (see Fig. 4). The right SMG showed significantly more activation in response to audiovisual numbers than audiovisual letters. Inversely, a region spanning portions of the left inferior occipital gyrus and left fusiform gyrus showed greater response to audiovisual letters than audiovisual numbers. Finally, a more dorsomedial activation was detected in the left lingual/left fusiform gyri, which showed greater activation to LN pairs relative to the other 2 types of audiovisual information. Convergent with previous studies, these effects could reflect processing demands specific to certain audiovisual symbols (Polk et al. 2002; Shum et al. 2013). In addition to the main effect of pair type, a main effect of congruency was found in a distributed network of frontal, parietal, and temporal regions, which characterized as a main effect of incongruency. Each region in this network exhibiting this effect showed greater activation in response to the incongruent relative to the congruent condition across all 3 types of audiovisual pairs.
Returning to our central investigation, the fact that the congruency-dependent response in the AAC is seen in some types of audiovisual pairs, but not (or weakly in opposite direction) in others, suggests that the difference in neural activation is specific to the nature of the audiovisual pairs. The congruency effect in the AAC was found for both letters and numerals, suggesting that the specific visual form could not account for why some audiovisual pairs evoked a congruency response and others did not. Instead, the most salient difference between English LS pairs, on the one hand, and English LN pairs and NN pairs, on the other hand, is the transparency of the audiovisual stimuli. Letters and numerals have completely transparent relationships with their associated names. In contrast, letters in the English orthography have much more opaque and, therefore, unreliable relationship with English speech sounds. This difference in statistical regularity between the different types of audiovisual pairs likely underlies the differences in brain activation. This is supported by earlier work in which it was found that speech processing is sensitive to statistical regularities within the auditory domain (Bonte et al. 2005). Interestingly, we used LS pairs that were most regular in English (Berndt et al. 1987), which suggests that the general irregularity of mappings at the LS level in English leads to an absence of, or at least a much weaker, tuning for even the most regular LS pairs.
It has been argued that the presence of a congruency effect reflects a stimulus-driven, automatic, and learned sensitivity to the congruency of audiovisual information that could provide important clues to how audiovisual symbols are processed in the brain (Goebel and van Atteveldt 2009; Blomert 2011). The automatic nature of the congruency effect was suggested by its absence during active matching tasks (van Atteveldt, Formisano, Goebel, et al. 2007), and the congruency effects observed using the mismatch negativity (MMN) (Froyen et al. 2008), which is generally believed to be an automatically generated auditory response (Näätänen et al. 2007). Taken together with data from Dutch participants (van Atteveldt et al. 2004, 2010), our findings strongly suggest that, through education, the AAC becomes tuned to audiovisual congruency only when the relationship between the auditory and visual information is highly regular. Our proposal does not suggest that this region is not involved in the processing of irregular audiovisual relationships, such as those found in English LS mappings. Instead, what differs between transparent and opaque audiovisual pairings is the bottom-up modulation of speech processing in the AAC by the congruency of the visual information. When the audiovisual information of a symbol is highly overlearned and highly reliable, as it is in Dutch LS pairs and English NN pairs, the auditory association area is modulated by the congruency of the audiovisual pairs.
In contrast to the congruency effect, which was observed in only transparent audiovisual pairs, all 3 audiovisual types showed greater activation in response to incongruent relative to congruent pairs in a network or networks of frontal, parietal, and inferior temporal regions. Similar regions have been implicated in other studies of audiovisual processing (Naumer et al. 2008). Especially, the inferior frontal cortex has repeatedly been shown to be activated specifically by semantically incongruent audiovisual stimuli (Doehrmann and Naumer 2008), which has been attributed to increased demands on cognitive control. Several accounts have been offered to explain these activations. For example, van Atteveldt, Formisano, Goebel, et al. (2007) showed a similar fronto-parietal network including anterior cingulate, bilateral inferior frontal sulcus, and right parietal regions, when participants performed an active audiovisual matching task. Drawing on supporting data from a multisensory integration study (Beauchamp et al. 2004), the authors argued that the fronto-parietal network reported in their study is likely involved in the top-down modulation needed to successfully complete the task. However, our data were observed in the absence of an active task, which makes an appeal to top-down modulation less compelling. Another possible interpretation of the fronto-parietal involvement shown in our study is that these regions are used to aid in the detection and resolution of conflicting information. These regions have been implicated in the resolution of conflict across a variety of tasks and types of stimuli (Roberts and Hall 2008). Accordingly, the detection of these regions in our study most likely reflects the recruitment of domain-general processes of conflict detection and cognitive control (Doehrmann and Naumer 2008) rather than processes specific to audiovisual integration. Of particular note, these regions responded to all incongruent stimuli, including the LS pairs. This implies that, although LS pairs did not show a congruency effect (congruent > incongruent), participants were processing the congruency of LS pairs in general. If this were not the case, then one would not see sensitivity to LS incongruency in these frontal-parietal regions. Moreover, the presence of an effect of incongruency across all audiovisual pair types in these regions, coupled with a congruency effect in STC only in the LN and NN pairs, strengthens our assertion that the congruency effect is specific to audiovisual pairs with a large degree of orthographic transparency.
Taken together, the present findings hint at an intriguing possibility with broad implications for theories of reading development. Research on the foundations of reading has lead to the contention that the establishment of fluent LS connections is the foundational principal upon which children learn to read (Ehri 2005). The importance of the brain circuitry underlying LS connections for reading skills has been empirically shown in studies of Dutch readers (Blau et al. 2009, 2010; Froyen et al. 2009). Blau et al. (2010) investigated LS integration directly as a function of reading ability and found that dyslexics as a group failed to show congruency sensitivity in STC, and across groups, the fMRI congruency effect was strongly correlated to reading skills. This supports the importance of intact integration of letters and speech sounds for fluent reading, that is, in the transparent alphabetic Dutch script. Because the orthography of Dutch is fairly transparent, LS associations are relatively easy to learn and automatize, and importantly, reliable once learned. This feature of the Dutch orthography likely affects the neural circuits underlying reading ability. Specifically, readers of transparent orthographies can rely upon the predictable correspondence between visual letters and their associated speech sounds. In response to this regularity, regions in AAC become tuned to the congruency of LS pairs. In other words, as individuals learn to read a transparent orthography, the auditory response of the AAC is boosted whenever a visual letter and an auditory speech sound “belong together” and suppressed if the auditory and visual information do not match. In contrast with Dutch readers, the data we presented above indicate that this modulation of the auditory response to speech sounds by the congruency of visual letters is weaker and in reverse direction in the brains of fluent readers of English. The relatively opaque nature of the English orthography provides less regularity in LS connections. Accordingly, auditory association regions show no sensitivity to the difference between congruent and incongruent LS pairs. This is not to say that English readers cannot determine whether a given LS pair is congruent or incongruent. Indeed, our data suggest a high degree of commonality in the brain response to the incongruent condition across all 3 audiovisual pair types. Rather, our point is more subtle. We suggest that the difference between English and Dutch readers is not the knowledge of the correct mappings between letters and speech sound pairs, but rather whether this knowledge is accompanied by stimulus-driven congruency effects in superior temporal (auditory association) cortex. Indeed, developmental studies have demonstrated that both Dutch-speaking children and adults are able to correctly indicate matching letters and speech sounds equally well (Blomert 2011). However, the neural signal during audiovisual matching tasks differs between children and adults (Froyen et al. 2009). Specifically, while adults show an automatic modulation of the MMN to deviant speech sounds, dependent upon the congruency of the LS pairs, such congruency-dependent modulation is absent in children. These data suggest that the ability to behaviorally detect audiovisual congruency is not dependent upon whether the (neural) detection is automatic. Against this background, we suggest that the weaker and different direction of congruency effect in STC in English readers might indicate that the audiovisual processing of LS pairs is less automatized than in Dutch readers. More direct comparisons of Dutch and English readers, for example, using an MMN paradigm, are needed to provide more conclusive answers regarding a difference in automaticity of LS processing.
Although the current work shows no (or only very subtle and in the opposite direction) sensitivity of STC to orthographic–phonological congruency at the smallest grain size—the LS pairs, other work suggests a role for STC in orthographic–phonological integration at larger grain size in English (Booth et al. 2002; 2014). This is consistent with the congruency effects for LN and NN pairs in the current study, as LN and NN pairs are also mapped at a coarser phonological level. In sum, our findings indicate that although the STC seems to play an important role in reading across alphabetic languages, the level of orthographic–phonological units to which it becomes tuned is adaptive to the transparency of the script.
In conclusion, our study demonstrates that the neural correlates underlying the processing of learned audiovisual pairs in English shows both similarities and differences with the neural correlates of the same abilities in Dutch readers. In LN pairs and numeral-number name pairs, which are defined by a high correspondence, English readers showed congruency-related modulation of the left STG—a region implicated in the processing of congruency in the transparent LS associations of Dutch. However, only a weak modulation (in reverse direction) was observed in response to the less transparent LS associations of English. These results add neurofunctional evidence to the suggestion that basic building blocks of literacy may be quite different in transparent and opaque orthographies, corroborating work of others that points to orthographic–phonological mappings in STC at larger grain size in English. Previous work has shown that LS integration is impaired in Dutch dyslexics (Blau et al. 2009, 2010), suggesting that letter–speech sound integration develops inadequately in dyslexic readers. Whereas this impaired LS integration may provide useful information to constrain the design of remedial interventions for Dutch dyslexics, our current findings point out that this recommendation may not be universal. Instead, generalizations of the initial findings from the Dutch language need to be done carefully, given that we do not find the same profile of brain responses during LS processing in the English language.
This research was supported by funding from the Canadian Institutes of Health Research (CIHR), The National Sciences and Engineering Research Council of Canada (NSERC), and Canada Research Chairs program (CRC) to D.A. N.V.A. is support by the Dutch Organization for Scientific Research (NWO, grant # 451-07-020).
We thank Nadia Nosworthy and Christian Battista for their assistance with the recording of the sounds used. Conflict of Interest: None declared.