Abstract

How the brain extracts words from auditory signals is an unanswered question. We recorded approximately 150 single and multi-units from the left anterior superior temporal gyrus of a patient during multiple auditory experiments. Against low background activity, 45% of units robustly fired to particular spoken words with little or no response to pure tones, noise-vocoded speech, or environmental sounds. Many units were tuned to complex but specific sets of phonemes, which were influenced by local context but invariant to speaker, and suppressed during self-produced speech. The firing of several units to specific visual letters was correlated with their response to the corresponding auditory phonemes, providing the first direct neural evidence for phonological recoding during reading. Maximal decoding of individual phonemes and words identities was attained using firing rates from approximately 5 neurons within 200 ms after word onset. Thus, neurons in human superior temporal gyrus use sparse spatially organized population encoding of complex acoustic–phonetic features to help recognize auditory and visual words.

Introduction

Delineating the fundamental acoustic–phonetic processing stages the brain uses to encode words is fundamental to understanding the neural processing of speech, but these stages are little understood. Although an early study (Creutzfeldt et al. 1989a) reported lateral temporal lobe neurons responding to speech, the specificity and coding properties of these cells remain unresolved: Do such neurons respond to equally complex nonspeech sounds (e.g. environmental sounds)? Are their receptive fields best described in terms of phonemes? Do they also respond to written words in a phonologically correlated manner?

Auditory cortical processing is hierarchical in nature; relatively simple features are progressively combined to build more complex representations in downstream areas (Hickok and Poeppel 2007; Hackett 2011), beginning with tonotopic frequency tuning in the primary auditory cortex (Howard et al. 1996; Bitterman et al. 2008) followed (in primates) by downstream neurons tuned to frequency-modulated sweeps and other complex spectrotemporal representations (Rauschecker 1998; Rauschecker and Scott 2009). In humans, hemodynamic neuroimaging studies suggest that this hierarchy applies to speech stimuli (Binder et al. 2000; Wessinger et al. 2001). However, these studies lack either the spatial resolution to determine the responses of single neurons to spoken words, or the temporal resolution to determine the sequence of neuronal firing. Thus, the hemodynamically observed spatial hierarchy may reflect feedback or recurrent activity rather than sequential feedforward processing.

The anatomical organization of this processing remains controversial. The traditional view suggests that posterior superior temporal cortex, near Wernicke's area, is the main region involved in processing speech sounds (Wernicke 1874; Geschwind and Levitsky 1968; Crone et al. 2001; Desai et al. 2008; Chang et al. 2010; Steinschneider et al. 2011). There is growing evidence, however, that anterior superior temporal cortex phonetically processes speech sounds and projects to more ventral areas comprising the auditory “what” stream (Scott et al. 2000, 2006; Arnott et al. 2004; Binder et al. 2004; Zatorre et al. 2004; Obleser et al. 2006, 2010; Warren et al. 2006; Saur et al. 2008; Rauschecker and Scott 2009; Perrone-Bertolotti et al. 2012).

Here, we report a detailed examination of the auditory and language responses of a large number of simultaneously recorded single units from the left anterior superior temporal gyrus (aSTG) of a 31-year-old right-handed man with epilepsy. A 96-channel microelectrode array recorded extracellular action potentials from over 140 layer III/IV neurons. The patient made semantic judgments of words referring to objects or animals in the auditory (SA) and visual (SV) modalities, compared spoken words with pictures (WN), repeated spoken words, and participated in spontaneous conversation. Controls included unintelligible vocoded speech, pure tones, environmental sounds, and time-reversed words.

Materials and Methods

Participant

A 31-year-old right-handed male with medically intractable epilepsy was admitted to the Massachusetts General Hospital for semichronic electrode implantation for surgical evaluation. The patient was left-hemisphere language dominant based on a WADA test and was a native English speaker with normal hearing, vision, and intelligence. His seizures were partial complex and typically began in mesial temporal depth electrode contacts. Surgical treatment removed the left anterior temporal lobe (including the site of the microelectrode implantation), left parahippocampal gyrus, left hippocampus, and left amygdala, resulting in the patient being seizure free at 1-year postresection. Formal neuropsychological testing 1-year postresection did not show any significant change in language functions, including naming and comprehension. The patient gave informed consent and was enrolled in this study under the auspices of Massachusetts General Hospital IRB oversight in accordance with the declaration of Helsinki.

Electrodes and Recording

A microelectrode array (Blackrock Microsystems, Salt Lake City, UT, USA), capable of recording the action potentials of single units, was implanted in the left aSTG. This 4 × 4 mm array consists of 100 (96 active) penetrating electrodes, each 1.5 mm in length with a 20-µm exposed platinum tip, spaced 400 µm apart. Recordings were obtained by a Blackrock NeuroPort data acquisition system at 30 kHz with bandpass filtering from 0.3 Hz to 7.5 kHz. The decision to implant the array in the superior temporal gyrus was based on clinical considerations; this was a region that was within the expected resection area and was indeed resected on completion of the intracranial electroencephalography (iEEG) investigation. The region surrounding the array was removed en bloc and submitted for histological processing. Staining with hematoxylin and eosin revealed that the tips of the electrodes were at the bottom of cortical layer III, close to layer IV, and that the surrounding cortical tissue was histologically normal.

In addition to this microelectrode, clinical intracranial macroelectrodes were implanted based on clinical considerations alone. Electrodes consisted of an 8 × 8 grid of subdural macroelectrode contacts spaced 1 cm apart (Adtech Medical, Racine, WI, USA), covering the left lateral cortex including frontal, temporal, and anterior parietal areas. iEEG was continuously recorded from these clinical electrodes at 500 Hz with bandpass filtering from 0.1 to 200 Hz. All electrodes were localized with respect to the patient's reconstructed cortical surface using the method described in Dykstra et al. (2012).

Auditory Tasks

The patient performed several auditory tasks designed to examine different aspects of speech and nonspeech sound processing. In SA, the participant pressed a button to spoken words referring to animals or objects that were larger than one foot in any dimension. Words were spoken by a male speaker, normalized in power and length (500 ms), and presented with a 2200-ms stimulus onset asynchrony (SOA). Eight hundred randomly ordered trials were evenly split between novel words presented only once for the entire experiment (400 trials), and repeated words which consisted of a set of 10 words repeated 40 times each. The 10 repeated words were “claw,” “cricket,” “flag,” “fork,” “lion,” “medal,” “oyster,” “serpent,” “shelf,” and “shirt.” Half of the trials required a button press yielding a 2 × 2 balanced design. Sounds were presented binaurally using Etymotic ER-1 earphones (Elk Grove Village, IL, USA).

SV was identical to SA in all respects except that the words were presented visually on a computer screen (for 300 ms). See Dale et al. (2000), Marinkovic et al. (2003), Halgren et al. (2006), Chan, Baker, et al. (2011), and Chan, Halgren, et al. (2011) further details and analysis of the SA and SV tasks.

In WN, the picture of an object was presented followed by a spoken word or noise. The picture (<5% visual angle) appeared for the entire trial duration of 1300 ms, and the auditory stimulus, either a congruously or incongruously paired word or noise stimulus, was presented 500 ms after picture onset. Words were single-syllable nouns recorded by a female native speaker. Noise stimuli were noise-vocoded, unintelligible, versions of the same words. Four conditions were presented in a random order: Picture matched-words (where the word referred to the picture), picture matched-noise (where the word used to generate the noise matched the picture), picture mismatched-words (the word referred to a different object than the picture), and picture mismatched-noise (the word used to generate the noise did not refer to the picture). The participant was asked to press a button to matches. To create the noise stimuli, band-passed and amplitude-modulated white noise was made to match the acoustic structure and sound level of a corresponding word. The power in each of 20 equal bands from 50 to 5000 Hz and the exact time versus power waveform for 50–247, 248–495, and 496–5000 Hz were matched between the noise and word stimuli (Shannon et al. 1995). Sounds (mean duration = 445 ± 63 ms; range = 304–637 ms; 44.1 kHz; normalized to 65 dB average intensity) were presented binaurally through Etymotic ER-1 earphones. A total of 1000 trials were presented. For more information, see Travis et al. (2012).

Several sets of nonspeech sounds were also presented to the patient. 7.2 s sequences of randomly selected pure tones were presented to explore responses to simpler acoustic stimuli. Tones were 100 ms in length, including 10 ms raised cosine-on and -off ramps, and were centered at 0.239, 0.286, 0.343, 0.409, 0.489, 0.585, 0.699, 0.836, 1, 1.196, 1.430, 1.710, 2.045, 2.445, 2.924, 3.497, 4.181, or 5.000 kHz. Tones were placed randomly in time and frequency within each band with an average within-band SOA of 800 ms (range: 100–1500 ms). Within each band, the exact frequency of any given tone was within an estimated equivalent rectangular bandwidth (ERB), where ERB = 24.7 × (4.37 × fc + 1), where fc is the center frequency of a given band in kHz.

In addition to the pure tones, the participant was presented the SA word stimuli and asked to repeat them out loud. The subject began speaking, on average, 411 ± 119 ms after the end of the stimulus, and the SOA was 3000 ms. The auditory word stimuli from the SA task were also time-reversed and presented to the participant (with an SOA of 2200 ms) who was asked to passively listen. Time-reversal of words preserves the spectral content of the stimuli, but changes the temporal structure of the sounds. A set of 30 environmental sounds, natural (e.g. birds chirping, waterfall) and manmade (e.g. clapping hands, breaking glass), were also presenting to the patient. Each sound was presented 5 times each in pseudorandom order with a 3500-ms SOA. Finally, a spontaneous conversation between the patient and researchers was recorded using a far-field microphone. The entire conversation was manually transcribed and all word-boundaries were marked.

Spike Sorting and Analysis

To extract spikes from the microelectrode recordings, continuous data were high-pass filtered at 250 Hz using a sixth-order Bessel filter, and an amplitude threshold of 4 standard deviations was used to chose action potential waveforms. Extracted spikes were manually sorted using Offline Sorter (Plexon, Dallas, TX, USA) in various feature spaces, including principal components, peak-valley amplitude, and nonlinear energy. Units were characterized as single or multi-units based on the quality of sorted clusters and the amplitude of waveforms. Initial analyses indicated no apparent differences between single and multi-units, so, unless indicated otherwise, reported results include both. Putative inhibitory interneurons were identified based on waveform shape, full-width at half-maximum, and valley-to-peak time (Bartho et al. 2004; Peyrache et al. 2012). Because experiments were performed over the course of 3 days, identified units varied across tasks. Sorting was performed simultaneously over multiple tasks when units were compared between them. Groups of spikes that demonstrated consistent clusters in principal component space, visually similar spike waveforms, and consistent auto/cross-correlations between multiple tasks were deemed to originate from the same unit. This grouping was only possible for putative single units.

To statistically determine and quantify the magnitude and latency of unit responses to particular stimuli, a nonparametric cluster-based statistical test was used. Similar to the nonparametric test utilized by Maris and Oostenveld (2007) for testing continuous data, statistics were computed for individual 30 ms bins of peristimulus time histograms (PSTHs). The binned firing rates of a prestimulus baseline period from −300 to 0 ms were compared with each poststimulus bin using a 2-sided t-test. Clusters of consecutive bins with significance of Pbin < 0.05 were found, and the summed T-statistic of each cluster was computed. The null distribution of cluster-level statistics was computed by randomly permuting the bins in time and recomputing clusters 1000 times. The cluster-level statistics were compared with the null distribution, and a cluster was deemed significant if the cluster-level probability was Pcluster < 0.05. The earliest bin in a statistically significant cluster was taken to be the response latency of that particular unit, and the magnitude of the response computed as the average firing rate within that cluster.

To quantify responses to phonemes, all phoneme boundaries were manually marked for all relevant stimuli and PSTHs were computed by aligning phoneme start times. Formants were computed using Wavesurfer (http://www.speech.kth.se/wavesurfer/) using 20-ms windows overlapped by 10 ms. The Carnegie Mellon University Pronouncing Dictionary was used to obtain phonetic transcriptions of words in the SV task (http://www.speech.cs.cmu.edu/cgi-bin/cmudict). Word frequency information for the SV task was obtained using the HAL corpus (Lund and Burgess 1996), resulting in a mean word frequency of 2849, median of 1422, and range of 2–37 798. Words below the median value of 1422 were grouped into the low-frequency class, and words above this median taken as the high-frequency class.

Spectrotemporal Receptive Field Estimation

The spectrotemporal receptive field (STRF) is the average time-course of spectral features preceding each action potential and characterizes that neuron's preferred time-frequency features. STRFs were calculated with 2 sets of spectral features: 1) Power in linearly spaced frequencies from 50 Hz to 4 kHz, and 2) Mel-frequency cepstral coefficients (MFCCs). MFCCs use a logarithmic frequency representation to approximate the human auditory system (Davis and Mermelstein 1980) and allow for the separation of the fundamental excitation frequency of vocal chords (known as the “source,” defining speech pitch and thus the speaker), from the shape and properties of articulatory chamber (known as the “filter,” defining phonological information and thus allows for word discrimination). MFCCs, were computed in MATLAB between 200 Hz and 8 kHz using a liftering exponent of 22. The first 13 coefficients were extracted, which carries most of the pitch-invariant phonological information, and these features were computed in 20-ms windows shifted by 10 ms.

The method described by Theunissen et al. (2001) was used to estimate the STRFs for each unit. This method compensates for correlations within the stimuli to generate the optimal linear filter, which best characterizes the relationship between the stimulus and firing rate. To predict the firing rate of these units, firing rate was first smoothed using a 120-ms Gaussian kernel, and the STRFs were computed using the novel words in the SA task. The resulting STRF was convolved with the time-course of stimulus features generated for the repeated words, yielding a predicted PSTH. This PSTH prediction was performed for each of the 10 repeated words in the SA and time-reversed word tasks.

Decoding

To decode either repeated words or phonemes, a set of features were computed from the unit firing rates for each trial. For the classification of repeated words, all units which demonstrated a statistically significant response to auditory words were used. For each unit, a time window was computed in which its firing rate significantly changed from baseline. Subsequently, for each trial, the number of spikes occurring within that time window was used as one of the features for the classifier. For phoneme decoding, this window was fixed from 50 to 250 ms after phoneme onset. This window was selected based on the peak firing rates seen in the PSTHs generated to individual phonemes.

To examine changes in information over time, either sliding windows or cumulative windows were used to compute firing rates. For sliding windows, the firing was computed in a 50-ms window for each unit beginning at 0 ms (the window therefore covered 0–50 ms) and after performing the decoding analysis, the window was shifted 10 ms forward. For the cumulative window analysis, 25-ms windows were used, and instead of shifting the window, subsequent nonoverlapping windows were concatenated to the growing feature vector between each round of decoding. This allowed for the analysis of information in a time frame of 0–25 up to 0–1000 ms.

A Naïve Bayes Classifier was used to decode word or phoneme-specific information from the computed features. The Naïve Bayes Classifier assumes that all features, f1, f2, … ,fn, are independent, making the joint probability of all features the product of the marginal probabilities. 

P(f1,f2,,fn)=P(f1)P(f2)P(fn)

This allows the classifier to assign a probability that a new trial belongs to Ci, based on a set of firing rates, f1, f2, … ,fn. The predicted class for this new trial, yˆ, is chosen to be the class with the highest conditional probability. 

yˆ=argmaxCiP(Ci|f1,f2fn)=argmaxCiP(Ci)j=1nP(fj|Ci)j=1nP(fi)

In the case of sliding windows, the features, f1, f2, … ,fn, are the firing rates of neurons 1 to n in a chosen window, while for cumulative windows, the features are the firing rates of these neurons in M different nonoverlapping windows, resulting in nM total features.

To train the classifier, for each combination of unit and class, a Poisson distribution of spike counts was estimated (via maximum likelihood), allowing for the computation of probabilities of observed firing rates. Accuracies were estimated via 10-fold cross-validation.

To test whether location in formant space provided equivalent information as phoneme identity, each phoneme was reclassified to the vowel that had the closest mean F1F2 value in this formant space. All classes were balanced such that they had the same number of instances before and after reassignment.

To test whether phonetic context affected the firing rates of the identified neurons, we attempted to decode the first vowel in the repeated words in the SA task (where the context was consistent), when compared with the first vowel in the new words (where the context varied across the words). A matched set of new words were chosen, so that an equivalent number of instances of each vowel were present. If these neurons only encoded information about the first vowel identity, equal decoding accuracy would be expected in both cases.

Results

Single-Unit Sorting Results

Units were manually sorted for each of the 7 experiments. Because experiments were performed over the course of 3 days, identified units varied from task to task. For analyses where units were compared between experiments, sorting was performed simultaneously over all tasks of interest. A total of 142 units were identified during the SA task, with 58 units characterized as likely single units and 84 potential multi-units. A total of 146 units were identified from the WN task (63 single and 83 multi-units), 166 during the presentation of the time-reversed words (77 single and 89 multi-units), 144 during the presentation of pure tone sequences (77 single and 67 multi-units), 169 during the repetition of auditory words (79 single and 90 multi-units), 171 during a spontaneous conversation (77 single and 94 multi-units), and 181 during the SV task (86 single and 95 multi-units). Single units demonstrated firing rates between 0.0004 and 11.1 spikes/s (mean = 0.37 spikes/s). 17% of all identified single units were putative inhibitory interneurons based on waveform shape, full-width at half-maximum, and valley-to-peak time (Bartho et al. 2004; Peyrache et al. 2012). The mean firing rate of these inhibitory cells was 1.96 spikes/s, compared with the mean firing rate of excitatory single units of 0.17 spikes/s. Responses were similar for single and multi-units, and for putative pyramidal cells and interneurons, and they are combined for subsequent analyses unless otherwise indicated.

Single-Unit Word Specificity

The patient was first asked to listen through headphones to a set of recorded spoken words that corresponded to concrete objects or animals and indicate if the item was larger than a foot in any dimension (SA task). Half of the trials involved 400 words that were only presented once during the experiment, while the other half involved 10 words that were presented 40 times each. Online visual inspection revealed that many units responded strongly to spoken word stimuli, with PSTHs demonstrating specificity for individual words. Two examples of such units are shown in Figure 1. These 2 cells showed bidirectional monosynaptic connectivity (based on cross-correlation, Fig. 1B) and strong word-specific responses to the repeated words in the SA task. Interestingly, the excitatory cell demonstrated narrower tuning than the inhibitory cell (Fig. 1C). Unit 6a, the putative inhibitory interneuron, demonstrated differences in the firing rate to the 10 repeated words with largest responses to “claw,” “cricket,” and “oyster.” For unit 6b, the putative pyramidal cell, while “claw” and “cricket” also evoked large responses, “oyster” did not.

Figure 1.

Units demonstrate differential firing to individual words. (A) PSTHs and raster plots of units 6a (top) and 6b (bottom) in response to word stimuli. (B) Autocorrelograms for units 6a and 6b (top and bottom, respectively) and crosscorrelogram for unit 6b in relation to unit 6a (middle) suggesting bidirectional, monosynaptic connectivity between an inhibitory interneuron (6a) and pyramidal cell (6b). (C) Firing rates for units 6a and 6b to each of the 10 repeated words demonstrate robust word-specific firing. (D) PSTHs for unit 6a in response to 3 example words with corresponding stimulus spectrogram and waveform plots below show differences in the magnitude and latency of response.

Figure 1.

Units demonstrate differential firing to individual words. (A) PSTHs and raster plots of units 6a (top) and 6b (bottom) in response to word stimuli. (B) Autocorrelograms for units 6a and 6b (top and bottom, respectively) and crosscorrelogram for unit 6b in relation to unit 6a (middle) suggesting bidirectional, monosynaptic connectivity between an inhibitory interneuron (6a) and pyramidal cell (6b). (C) Firing rates for units 6a and 6b to each of the 10 repeated words demonstrate robust word-specific firing. (D) PSTHs for unit 6a in response to 3 example words with corresponding stimulus spectrogram and waveform plots below show differences in the magnitude and latency of response.

In total, 66 of the 141 units exhibited a statistically significant response to auditory words in SA (P < 0.05, Fig. 2B). Fifty-nine units increased firing in response to words, while 7 units decreased their firing. Baseline firing of responsive units varied from 0 to 5.16 spikes/s (mean = 0.31 spikes/s), with changes in the firing rate ranging from 0.03 to 12.4 spikes/s (mean = 0.75 spikes/s). Excluding units with a baseline firing rate of 0, the firing rate increased by 625% on average (range = 335–2567%; Supplementary Fig. 1). Peak firing ranged from 0.31 to 14.3 spikes/s (mean = 0.50 spikes/s). Response latencies varied from 20 to 940 ms (mean = 308 ms) after word onset. Thirty-one units demonstrated differential firing to the 10 repeated words in the SA task (P < 0.05, Kruskal–Wallis, between 100 and 900 ms). Unlike cells in the anteroventral temporal lobe (Chan, Baker, et al. 2011), no cells in the anterior STG responded differentially to words referring either to animals versus manmade objects or to novel versus repeated words.

Figure 2.

Units display spatial organization for response and tuning properties. (A) Number of identified units on each electrode of the array during the SA task, organized spatially. A total of 141 units were identified. (B) The number of units on each electrode which demonstrated a statistically significant response to word stimuli greater than baseline. A total of 66 units responded to auditory word stimuli. (C) The distribution of response latencies across each channel of the array. The mean latency is shown for electrodes with more than one unit. (D) Significant spatial correlation of any response to words was found up to 800 µm (green), for response latency up to 600 µm (blue), and for individual-word selective responses up to 400 µm (red; t-test, P < 0.05). Inset shows the aSTG location of the microelectrode array.

Figure 2.

Units display spatial organization for response and tuning properties. (A) Number of identified units on each electrode of the array during the SA task, organized spatially. A total of 141 units were identified. (B) The number of units on each electrode which demonstrated a statistically significant response to word stimuli greater than baseline. A total of 66 units responded to auditory word stimuli. (C) The distribution of response latencies across each channel of the array. The mean latency is shown for electrodes with more than one unit. (D) Significant spatial correlation of any response to words was found up to 800 µm (green), for response latency up to 600 µm (blue), and for individual-word selective responses up to 400 µm (red; t-test, P < 0.05). Inset shows the aSTG location of the microelectrode array.

Tuning to a number of different features may lead to this specificity for particular words. At the lowest level, it is possible that these units simply respond to a particular frequency or sound intensity that is present in a subset of the presented words. It is also possible that these units are responding to specific acoustic features of spoken words, such as complex time-frequency components, combinations of formant frequencies, or even phoneme identity. At the highest levels, these units may encode the auditory representations of full words. We therefore tested the response of these units to a diverse set of acoustic stimuli that spanned a wide range of acoustic complexity.

Spatial Organization of Responses and Correlation with Gamma Activity

The regular spacing of the 10 × 10 microelectrode array allowed the spatial organization of unit response properties to be examined. Although identified units were uniformly distributed across the array (Fig. 2A), those responding to auditory word stimuli were clustered on one side (Fig. 2B). More specific properties, such as response latency (Fig. 2C) and word-specific response profiles, also demonstrated spatial correlations. To quantitatively explore this spatial organization, the spatial autocorrelations of various response properties (e.g. responsiveness, response latency, 10-word response profiles) were computed between units and plotted against the Euclidean distances separating the electrodes that recorded them in the 4 × 4 mm array. Correlation of responsiveness to words extended to recordings separated by 800 µm, correlation for response latencies was significant to 600 µm, and correlation for 10-word response profiles was observed up to distances of 400 µm (Fig. 2D). Despite these spatial correlations, macroelectrode electrocorticography (ECoG) directly over the microelectrode site failed to show language-specific responses (Supplementary Fig. 2).

This analysis shows that the response profiles across words were similar for multiple units recorded by a given microelectrode contact, and slightly correlated with the profiles of neurons recorded at adjacent contacts. In 11 cases, at least 2 units present on the same electrode had correlated 10-word response profiles. We also tested if gamma power (30–100 Hz), taken as a measure of population activity, between 100 and 700 ms at these contacts showed significantly different responses to different words. This was found in 2 of the 11 electrodes, thus indicating that the specificity for particular words can also be observed at the population level. These 2 electrodes also demonstrated significant increases in gamma power to words versus noise-vocoded speech. Furthermore, the firing rates of the units on these electrodes in response to the 10-repeated words were significantly correlated to changes in gamma power to those same words (Supplementary Fig. 3, r > 0.72, P < 0.001, Pearson). Broadband averaged local field potentials (LFPs) did not demonstrate significant word-specific responses on any electrode.

Responses to NonSpeech Sounds

To test whether these word-selective units were in fact responding to lower level features, we tested their response to a diverse set of acoustic stimuli that spanned a wide range of acoustic complexity. None of the units showed statistically significant responses when the patient listened to sequences of 100-ms pure tones ranging from 240 Hz to 5 kHz (P > 0.05, Fig. 3A).

Figure 3.

Units fail to respond to nonspeech sounds. (A) PSTHs for unit 6a to 100 ms pure tones at 18 different frequencies demonstrate no changes in the firing rate. (B) Unit 6a demonstrates a large response to auditory words but a much reduced response to word-matched noise. LFP waveforms are superimposed on the raster plot. (C) PSTHs to each of the 10 repeated words for the SA task played forward (black) or time-reversed (gray) demonstrate shifted latencies and relatively reduced responses in most of the cases. (D) Only unit 6a (black) and unit 36a (gray) demonstrated statistically significant responses to any of the presented environmental sounds (Monte-Carlo permutation test, P < 0.05). Unit 6a only responded to laughter, while unit 36a only responded to a baby crying, both human vocal sounds. Magnitude of responses was significantly lower than the response to spoken words.

Figure 3.

Units fail to respond to nonspeech sounds. (A) PSTHs for unit 6a to 100 ms pure tones at 18 different frequencies demonstrate no changes in the firing rate. (B) Unit 6a demonstrates a large response to auditory words but a much reduced response to word-matched noise. LFP waveforms are superimposed on the raster plot. (C) PSTHs to each of the 10 repeated words for the SA task played forward (black) or time-reversed (gray) demonstrate shifted latencies and relatively reduced responses in most of the cases. (D) Only unit 6a (black) and unit 36a (gray) demonstrated statistically significant responses to any of the presented environmental sounds (Monte-Carlo permutation test, P < 0.05). Unit 6a only responded to laughter, while unit 36a only responded to a baby crying, both human vocal sounds. Magnitude of responses was significantly lower than the response to spoken words.

The patient also listened to noise-vocoded speech that was acoustically matched to auditory word stimuli (WN). Noise-vocoded stimuli contained the same time-course of power in 3 frequency bands as the matched word, but the fine-scale spectral information within these bands was replaced with band-passed white noise. The subject decided if the word semantically matched the picture presented immediately beforehand. Only 3 of the 60 units that responded to words also responded to noise-vocoded speech. Furthermore, firing rates were 65% lower to noise-vocoded speech stimuli than to the matched words (Fig. 3B). No cell responded differentially to matched versus mismatched words.

The patient then passively listened to time-reversed auditory words from SA. Because time-reversing maintains frequency information and acoustic complexity, but changes the temporal structure of words, it has often been used in hemodynamic studies of word responses (Howard et al. 1992; Perani et al. 1996; Price et al. 1996; Hirano et al. 1997; Binder et al. 2000; Crinion et al. 2003). Vowel sounds are relatively preserved while consonants (especially stop consonants) are often distorted, and many of the sounds are not phonetically possible. Only 17% of the units responded to time-reversed words (compared with 47% responding to normal words), and the magnitude of the response was significantly smaller (0.21 spikes/s for time-reversed vs. 0.75 spikes/s for normal words, P < 0.05, Wilcoxon rank-sum). For several units, responses to time-reversed words were also significantly delayed (Fig. 3C). The responses to time-reversed words demonstrate that complete words are not necessary for activating these units; the small amplitude of these responses, as well as the lack of responses to tones and to environmental sounds (except, occasionally, vocalizations), is consistent with these cells being highly selective for speech sounds.

While these control stimuli elicit relatively smaller responses than spoken words, they are artificially constructed synthetic sounds. It is possible that the identified units respond equally well to naturally occurring environmental sounds. A set of 30 environmental sounds, both man-made and natural, were presented to the subject. Only 2 units demonstrated statistically significant responses to any of the stimuli (Fig. 3D), and then only responded to male laughter (unit 6a) or a baby crying (unit 36a). Interestingly, both of these stimuli are human vocalizations.

Responses to Phonemes

Formant values at the midpoint of each vowel were correlated to firing of each unit from 50 to 250 ms after vowel onset. Only unit 6a demonstrated significant correlations, with a moderate negative F1 correlation of ρ = −0.13 (Spearman, P < 0.01) and positive F2 correlation of ρ = 0.11 (Spearman, P < 0.05).

PSTHs were generated to each of the phonemes present in SA and WN. In one example, unit 6a clearly showed specific firing to several vowels beginning at approximately 70 ms and peaking at 100 ms (Fig. 4). These phonemes included the high-front vowels [ɪ], [i], [oɪ], [oʊ], and [u]. Several consonants, such as [p], [b], [t] and [f] also demonstrate increases in firing around 100 ms. Some of the phoneme PSTHs, such as [ŋ], demonstrate increases in firing before 0 ms, likely due to the common occurrence of specific preceding vowel sounds (e.g. [ɪ] as in words ending with “ing”). Overall, 22 units demonstrated significant responses to at least one phoneme, with 16 of the 22 units responding to more than one phoneme (Fig. 4C). Many of the unresponsive cells had firing rates too low to permit sensitive statistical tests. While more units exhibited significant responses to consonants (Fig. 4D), 9 of the 22 units responded to both consonants and vowels. Most responses were significantly modulated by phoneme position (first vs. second syllable) and/or amplitude (Supplementary Fig. 4).

Figure 4.

PSTHs demonstrate firing to a subset of phonemes. (A) PSTHs for unit 6a for each vowel phoneme approximately arranged in formant space. (B) PSTHs for each consonant. (C) Number of consonants or vowels each unit significantly responded to (P < 0.05). Each horizontal bar represents a different unit. Bars with an adjacent dot indicate single (vs. multi) units. (D) Distribution of units responding to each phoneme with vowels on the left and consonants on the right.

Figure 4.

PSTHs demonstrate firing to a subset of phonemes. (A) PSTHs for unit 6a for each vowel phoneme approximately arranged in formant space. (B) PSTHs for each consonant. (C) Number of consonants or vowels each unit significantly responded to (P < 0.05). Each horizontal bar represents a different unit. Bars with an adjacent dot indicate single (vs. multi) units. (D) Distribution of units responding to each phoneme with vowels on the left and consonants on the right.

Spectrotemporal Receptive Fields

Formants and phonemes are well-established intermediate features between the acoustic signal and the word. We also attempted to characterize the units' responses in a manner less constrained by a priori categories by computing STRFs for unit 6a. We computed STRFs using power within linearly spaced frequencies from 50 Hz to 4 kHz, and using MFCCs. MFCCs approximate the human auditory system (Davis and Mermelstein 1980) and allow for the separation of speech pitch from phonological information. We utilized the first 13 MFCCs, thus focusing on pitch-invariant information and discarding speaker-specific information. An STRF computed with linear frequency features shows a complex combination of low- (50–500 Hz) and high- (∼2.5 kHz) frequency components between 0 and 100 ms, contributing to the firing of unit 6a (Fig. 5A). Similarly, the STRF computed with MFCCs demonstrates a wide range of cepstral components, at a similar latency, contributing to the firing of this unit (Fig. 5B).

Figure 5.

STRFs can predict unit firing responses to words. (A) STRF for unit 6a to the 400 novel words from SA computed for linear frequency features from 50 Hz to 4 kHz, where 0 is the firing of the unit. Power after zero indicates the frequencies that predict the cell firing at the indicated delay. (B) STRF for unit 6a computed using the first 13 MFCCs and an energy term. (C) Predicted versus actual firing rates of unit 6a to the repeated words in the SA task. MFCC features resulted in a better prediction (R2 = 0.42 vs. 0.16). (D) Prediction of firing rates for reversed words using the STRF computed from MFCCs results in overestimation of firing rates with R2 = 0.14.

Figure 5.

STRFs can predict unit firing responses to words. (A) STRF for unit 6a to the 400 novel words from SA computed for linear frequency features from 50 Hz to 4 kHz, where 0 is the firing of the unit. Power after zero indicates the frequencies that predict the cell firing at the indicated delay. (B) STRF for unit 6a computed using the first 13 MFCCs and an energy term. (C) Predicted versus actual firing rates of unit 6a to the repeated words in the SA task. MFCC features resulted in a better prediction (R2 = 0.42 vs. 0.16). (D) Prediction of firing rates for reversed words using the STRF computed from MFCCs results in overestimation of firing rates with R2 = 0.14.

These representations were also able to predict the unit firing to a set of spoken words (Fig. 5C). The STRFs generated using the MFCCs better predicted the actual firing rates to each of the 10-repeated words than the linear frequency representation (R2 = 0.42 and 0.16, respectively). Despite this, both sets of features consistently underestimate the actual firing rates of this unit. The computed MFCC STRF was also used to predict the firing to the time-reversed words (Fig. 5D). In this case, the predicted firing rates tended to overestimate the actual firing rates, resulting in R2 = 0.14. The fact that this set of acoustic features fails to adequately predict time-reversed words suggests that this unit is responding to features in words that are destroyed by time-reversal.

Responses to Written Words

The patient also performed a visual word size judgment task (SV) that was equivalent to the SA task, but used written words presented on a computer screen instead of spoken words. 26% of 177 units significantly responded to written words in SV. Forty-six units were present in both SA and SV, 18 units responded to auditory words only, 9 to both visual and auditory words, and 19 to neither (Fig. 6A); no cell responded to visual words only. When responding to both, the latency to visual words was 170 ± 31 ms longer than to auditory words (Fig. 6C). On average, auditory words elicited an 8.04-fold increase in firing over baseline, while visual words elicited a 3.02-fold increase over baseline.

Figure 6.

Single units also respond to written words based on phonemes in pronunciation. (A) PSTHs and raster plots for 2 units (24b and 28d) for the presentation of written (top) or spoken (bottom) words. (B) Firing rates (between 0 and 1000 ms) to words containing a given phoneme are significantly correlated between visual and auditory words (Spearman, unit 24b: ρ = 0.54, unit 28d: ρ = 0.56, P < 0.01). Solid line represents best linear fit. (C) Response latency to written words is delayed by an average of 170 ms compared with auditory words (dashed line). Solid line represents no delay between modalities. (D) Correlation of firing rates between auditory words containing given phonemes between 0 and 1000 ms and written words containing the same phonemes in 300-ms sliding time windows. Windows for both visual and auditory words begin at the time postword onset indicated on the x-axis. The onset of significant correlation for unit 24b was in the 175- to 475-ms time window, and peak correlation was from 330 to 630 ms. For unit 28d, significant correlation started at 325–625 and peaked from 450 to 750 ms.

Figure 6.

Single units also respond to written words based on phonemes in pronunciation. (A) PSTHs and raster plots for 2 units (24b and 28d) for the presentation of written (top) or spoken (bottom) words. (B) Firing rates (between 0 and 1000 ms) to words containing a given phoneme are significantly correlated between visual and auditory words (Spearman, unit 24b: ρ = 0.54, unit 28d: ρ = 0.56, P < 0.01). Solid line represents best linear fit. (C) Response latency to written words is delayed by an average of 170 ms compared with auditory words (dashed line). Solid line represents no delay between modalities. (D) Correlation of firing rates between auditory words containing given phonemes between 0 and 1000 ms and written words containing the same phonemes in 300-ms sliding time windows. Windows for both visual and auditory words begin at the time postword onset indicated on the x-axis. The onset of significant correlation for unit 24b was in the 175- to 475-ms time window, and peak correlation was from 330 to 630 ms. For unit 28d, significant correlation started at 325–625 and peaked from 450 to 750 ms.

To explore whether the responses to these visual words were due to phonological properties of the words, we correlated mean firing rates between responses to spoken and written words containing the same phonemes from 0 to 1000 ms postonset. The phoneme tuning properties of 2 of the 9 units showed significant correlations with ρ = 0.54 for unit 24b and ρ = 0.56 for unit 28d (Spearman, P < 0.01; Fig. 6B). Correlation estimates for the other 7 units had low statistical power, because their firing rates were <1 spike/s. Behavioral studies have suggested that phonological recoding is stronger for low-frequency words (Seidenberg 1985). Words were divided into low- and high-frequency words using the median HAL frequency of 1422, and the between-modality correlation of phonetic encoding patterns was recomputed. For low-frequency words, the correlation remained high at ρ = 0.44 (unit 24b) and ρ = 0.55 (unit 28d) (Spearman, P < 0.01). For high-frequency words, the correlation became insignificant for unit 24b (ρ = −0.11, P > 0.05) and dropped for unit 28d (ρ = 0.34, P < 0.05).

The latency of this correlation was examined by computing the correlation coefficient between the firing rate in response to auditory words from 0 to 1000 ms containing given phonemes, and the firing rate in response to visual words containing the same phonemes in 300-ms sliding windows starting from 0 to 700 ms after stimulus onset (Fig. 6D). Significant correlations began175–475 ms and peaked at 330–630 ms for unit 24b, and began at 325–625 ms and peaked at 450–750 ms for unit 28d.

Diversity of Unit Tuning Allows for Decoding of Words

To characterize the amount and diversity of information present in the firing rates of the identified units, we attempted to decode (i.e. individually discriminate) the 10 repeated words from the SA task using unit responses. A Naïve Bayes classifier achieved peak accuracy of 39.25% using 28 units (chance = 10%), with near-maximum performance using only 10 units (Fig. 7A). The temporal evolution of word-selective firing was tested by decoding word identity from cumulative 25-ms windows starting at 0 ms (Fig. 7C). Within 200 ms, 34% accuracy (chance = 10%) was reached when using the top 5 units; adding up to 30 units improved accuracies at longer latencies.

Figure 7.

Units provide diverse information that allows for decoding of individual words and phonemes. (A) Accuracy of decoding 10 words (chance = 10%) obtained by sequentially adding firing rate information from one unit at a time. The black line is obtained by maximizing the accuracy at each step (i.e. selecting the most informative unit). The gray line demonstrates the accuracy when units are randomly added. (B) The decoding of either phonemes (black) or discretized formant classes (gray) demonstrates that these units provide more information on phoneme identity than formant space. (C) The word decoding accuracy using a cumulative set of 25 ms windows and the top 5, 10, 20, or 30 units identified by the analysis performed in A. Decoding is effective beginning shortly after 100 ms and rises rapidly. (D) First vowel decoding accuracy using a 50-ms sliding window, from the repeated words (gray) and from a matched set of new words (black). Higher performance for repeated words indicates that much more than single phoneme information is being expressed.

Figure 7.

Units provide diverse information that allows for decoding of individual words and phonemes. (A) Accuracy of decoding 10 words (chance = 10%) obtained by sequentially adding firing rate information from one unit at a time. The black line is obtained by maximizing the accuracy at each step (i.e. selecting the most informative unit). The gray line demonstrates the accuracy when units are randomly added. (B) The decoding of either phonemes (black) or discretized formant classes (gray) demonstrates that these units provide more information on phoneme identity than formant space. (C) The word decoding accuracy using a cumulative set of 25 ms windows and the top 5, 10, 20, or 30 units identified by the analysis performed in A. Decoding is effective beginning shortly after 100 ms and rises rapidly. (D) First vowel decoding accuracy using a 50-ms sliding window, from the repeated words (gray) and from a matched set of new words (black). Higher performance for repeated words indicates that much more than single phoneme information is being expressed.

The same analysis decoded vowels with a peak accuracy of 24.6% (chance = 13.6%; Fig. 7B). To test whether location in formant space provided equivalent information, each phoneme was reclassified to the vowel that had the closest mean F1F2 value in this formant space. All classes were balanced such that they had the same number of instances before and after reassignment. The accuracy of a decoder trained on this formant data yielded poorer accuracy at 22%. This suggests that these neurons encode phoneme identity better than formant values.

To test whether phonetic context affected the firing rates of the identified neurons, we attempted to decode the first vowel in the repeated words in the SA task (where the context was consistent), when compared with the first vowel in the new words (where the context varied across the words). A matched set of new words were chosen, so that an equivalent number of instances of each vowel were present. If these neurons only encoded information about the first vowel identity, equal decoding accuracy would be expected in both cases. However, the classifier achieved 38% accuracy with the repeated words (18% above chance) when compared with 28% accuracy with new words (8% above chance). This superior classification accuracy with the consistent context of the repeated words implies that these neurons are encoding more than single phoneme identity (Fig. 7D).

Speaker Invariance and Spontaneous Speech

The speaker for the SA stimuli was male, while the speaker for the WN stimuli was female, with different fundamental frequencies (113 ± 15 vs. 165 ± 35 Hz) and vowel formants (Supplementary Fig. 5). To test for speaker invariance, we analyzed the 37 words that were present in both SA and WN for units 6a and 6b. The spiking rates between 100 and 900 ms were significantly correlated for the 37 words (Pearson, ρ = 0.38, P < 0.05). Conversely, a paired t-test failed to demonstrate any statistical difference in their firing (P = 0.96, mean difference = 0.032 spikes/s).

To further characterize speaker invariance, a 40-min segment of spontaneous speech between the patient, his family, and the researchers was transcribed, and all word-boundaries were manually marked. The profile of firing across the 50 most commonly produced words was significantly correlated between speakers (ρ = 0.41, P < 0.01), while a paired t-test failed to indicate any significant difference in the firing rates between speakers (P = 0.35).

Self-vocalization Auditory Suppression

Previous studies have demonstrated suppression of auditory responses to self-vocalization (Creutzfeldt et al. 1989b; Houde et al. 2002; Heinks-Maldonado et al. 2005, 2006; Flinker et al. 2010; Baess et al. 2011). During the repetition experiment, a total of 162 units were identified, of which 42 responded to external speech. Of these 42 units, 30 showed no significant response to self-produced speech, 5 showed a reduced, but still significant, response (Wilxocon rank-sum, P < 0.05), and 7 showed no difference (Fig. 8). On average, the peak firing rate to self-produced speech was 2.43 spikes/s lower than to external speech (range = 0.21–13.1, corresponding to 11–100% reduction; Wilcoxon rank-sum, P > 0.05). The 42 units responding to external speech included 5 putative inhibitory interneurons. All 5 units decreased firing to self-produced speech relative to external speech (2.36 vs. 1.32 spikes/s). Additionally, averaged LFP of the corresponding electrodes demonstrated minimal responses to self-produced speech while showing large responses to external speech at all latencies (Fig. 8A).

Figure 8.

Suppression during self-initiated speech production. (A) Representative PSTH and raster plots of auditory responses during self-initiated speech (gray) versus external speech (black) during the repetition task for an excitatory cell (left) and inhibitory cell (right). Below, the averaged LFP and high-gamma power (70–100 Hz, bottom) for the corresponding electrodes demonstrate minimal changes to self-produced speech, suggesting suppression in earlier areas. (B) Firing rates of responding units during external speech versus self-produced speech. The unity line indicates no change in the firing rate between the 2 conditions with points lying above the line, indicating a reduction in firing during self-produced speech.

Figure 8.

Suppression during self-initiated speech production. (A) Representative PSTH and raster plots of auditory responses during self-initiated speech (gray) versus external speech (black) during the repetition task for an excitatory cell (left) and inhibitory cell (right). Below, the averaged LFP and high-gamma power (70–100 Hz, bottom) for the corresponding electrodes demonstrate minimal changes to self-produced speech, suggesting suppression in earlier areas. (B) Firing rates of responding units during external speech versus self-produced speech. The unity line indicates no change in the firing rate between the 2 conditions with points lying above the line, indicating a reduction in firing during self-produced speech.

Discussion

We simultaneously recorded from over 140 single units in the human left aSTG. Many cells demonstrated highly selective responses to spoken words, with little or no response to pure tones, environmental sounds, or vocoder-transformed speech. Typical cells fired specifically to particular words or subsets of phonemes, demonstrated the spatial organization of tuning properties, and were suppressed during self-produced speech. STRFs predicted responses to spoken words, and spontaneous conversation demonstrated invariance to speaker. Some units showed correlated responses to phonemic properties of written and spoken words.

Classically, spoken word processing begins in Heschl's gyrus, and moves posteriorly toward Wernicke's area as processing becomes more complex (Wernicke 1874; Boatman et al. 1995; Geschwind and Levitsky 1968; Crone et al. 2001; Arnott et al. 2004; Desai et al. 2008; Chang et al. 2010; Pasley et al. 2012). In contrast, more general dual-pathway models also include an anteroventrally directed “what” stream (Arnott et al. 2004; Hickok and Poeppel 2007; Saur et al. 2008; Rauschecker and Scott 2009), analogous to the ventral visual stream (Tanaka 1996). The tuning of single units in human aSTG to speech is consistent with the dual-stream model, challenges the classical model, and provides important data bridging single-unit animal neurophysiology and noninvasive human work. Speech-selective units in aSTG may thus be analogous to object-selective cells in inferotemporal cortex (Tanaka 1996); like those cells, we observed apparent columnar organization of the aSTG responses.

The “what” stream likely continues to anteroventral temporal neurons that fire differentially to words referring to objects versus animals (Chan, Baker, et al. 2011), as well as to particular words (Heit et al. 1988). It is striking that, although highly and selectively responsive to spoken words, firing by aSTG neurons does not reflect semantic modulations, suggesting a lack of top-down modulatory influences on speech recognition. Such influences are clearly visible behaviorally (Warren 1970; Ganong 1980), and many models posit strong top-down projections from the lexico-semantic stage to at least the phonemic stage (Morton 1969; McClelland and Elman 1986). However, other models find such feedback unnecessary (Norris et al. 2000), or suppose that it does not occur until after the first phoneme is identified (Marslen-Wilson 1987). Consistent with recent magnetoencephalography (MEG) findings (Travis et al. 2012), our results suggest that such influences may be absent, at least in this stage of processing, and those that lead to it.

The cells reported here demonstrate significant responses to specific subsets of phonemes, and in the most robust cell, to a subset of vowel sounds. Unit firing is invariant to speaker, regardless of differences in formant frequencies, F0, F1, and F2, underlying these vowel sounds. This suggests that these cells are tuned in vowel-space. Furthermore, decoding of phonemes resulted in significantly higher accuracies than that of formant-derived classes.

The results from the STRF analysis provide additional evidence that low-level acoustic features fail to fully characterize the response properties of these units. Power in frequency bands is a poor predictor of the firing rate when compared with MFCCs that model the phonetic and discard speaker-specific information in speech. However, even MFCCs fail to robustly predict the firing rate in time-reversed speech. This may suggest that MFCCs do not completely capture some other high-level acoustic features of words, and that time-reversal destroys or reduces the salience of these features. Phoneme identity is one such feature; time-reversing the acoustic waveform associated with a phoneme, especially consonants, often produces a percept that is nonphonemic in nature. It is possible that these units are tuned to high-level properties of phonemes that we are unable to completely characterize.

Another explanation for the limited ability of the MFCCs to predict the response to reversed speech is an insensitivity to nonlinear contextual effects, which augment their response to the phonemes in their receptive fields. Furthermore, any contextual effects that MFCCs do encode would not be triggered because they would occur in reversed words after the encoded phonemes rather than before. The inability of MFCCs to predict firing to reversed words is thus indirect evidence for contextual effects. Further evidence is the finding that decoding of the first vowel was over twice as strong (compared with baseline) when the context was consistent compared with when it changed over trials. Contextual effects could include not only adjacent phonemes, but also latency within the word. Cells varied widely in their response latencies to words, and many responded differentially to their preferred phonemes depending on when they occurred within the word. Another possible contextual effect was stress, which was confounded with latency in our data. Thus, our data suggest a population of units that respond to different sets of phonemes, modulated by their context and timing. These receptive fields could serve as building blocks for representing any word.

The current results also bear on the long-standing controversy regarding whether written words must activate their phonological representations in order to be understood. One theory posits that written words have direct access to meaning as well as indirect access via phonological recoding (Seidenberg 1985; Coltheart et al. 1993; Yvert et al. 2012). This theory suggests that skilled readers reading high-frequency words access lexical representations via a purely visual pathway before the phonological information has an opportunity to contribute. Other models suggest that written words necessarily undergo phonological processing before lexical identification, regardless of word frequency or task demands (Frost 1998). Studies have compared scalp event related potentials (ERPs) to written words, pseudohomophones (nonwords sounding like actual words), and control nonwords (Newman and Connolly 2004; Braun et al. 2009). The differential response to pseudohomophones is taken to represent a conflict between orthographic and phonological information, and their presence before 200 ms as evidence for obligatory early phonological recoding of written words (Braun et al. 2009). However, the observed effects are quite small, several studies have failed to find them (Ziegler et al. 1999; Newman and Connolly 2004) or to estimate their latency as approximately 300 ms (Niznikiewicz and Squires 1996), and their localization is unclear. The fusiform visual word form area produces its major response at 150–200 ms and is highly sensitive to the combinatorial frequency of letters (Binder et al. 2006; Vinckier et al. 2007), raising the possibility that ERP differences to pseudohomophones reflect visual characteristics.

While MEG has localized activity in the posterior superior temporal cortex to visual words beginning at around 200 ms (Dale et al. 2000; Marinkovic 2004), there is no evidence that this represents phonological recoding. Conversely, although functional magnetic resonance imaging identifies this general region as activated by written words in tasks that require phonological recoding (Fiebach et al. 2002), it is not possible to know if such activation is before or after lexical access. Another recent study demonstrates auditory area gamma-band responses in response to visual words at a latency of 700 ms in iEEG (Perrone-Bertolotti et al. 2012). In contrast to these previous studies, our results directly demonstrate phonological recoding: Unit firing that is correlated between spoken auditory phonemes and phonemes present in the idealized pronunciation of visual words. Lexico-semantic access for visual words is thought to occur by approximately 240 ms (Halgren et al. 2002; Kutas and Federmeier 2011). In our data, the firing of one cell reflecting phonological recoding of written words began slightly earlier, approximately 175 ms after word onset. Furthermore, high-frequency words demonstrated reduced correlation between phonemes in visual and auditory words, presumably reflecting a smaller need for phonological recoding. Thus, these data are consistent with the dual-route hypothesis of phonological recoding, in that we demonstrate that neural activity with the expected characteristics occurs in the aSTG at a latency that may allow it to contribute to word identification. However, it is possible that phonological recoding is not evoked by all words, since the auditory–visual correlation was greatly decreased for high-frequency words.

The use of the microelectrode array in this study allowed for the examination of spatial organization that previous studies were unable to explore. Interestingly, we found that nearby cells often had correlated response properties, but this correlation disappears at distances of over 1 mm. This may suggest that even in high-order processing areas that do not have a clear spatial or spectral “space” (such as orientation or frequency), nearby cortical columns perform similar processing tasks. It is also important to note that more general response characteristics (e.g. whether the unit responded to auditory word stimuli at all) showed broader spatial correlation than more specific response characteristics (e.g. 10-word response profile), which tended to exhibit narrower spatial correlations. This is similar to the columnar organization of inferotemporal visual cells that demonstrate the spatial organization of visual object selectivity. The consistency of firing profiles across the set of words for different units recorded at a given contact could be reflected in the population activity (high-gamma power) recorded by the same contact. However, the extent of such population activity is limited, inasmuch as macroelectrode ECoG recordings directly over the microelectrode site failed to show language-specific responses (Supplementary Fig. 2).

The proportion of the single units identified as putative inhibitory interneurons (17%) is consistent with previous anatomical and physiological observations in animals and humans (Bartho et al. 2004; Peyrache et al. 2012). However, the total number of spikes produced by putative inhibitory cells was greater due to their higher firing (1.96 spikes/s) compared with pyramidal cells (0.17), again similar to what has been observed in sleeping humans (Peyrache et al. 2012). The mean overall firing rate was similar to prior reports of semichronic human recordings using fixed electrodes (Ravagnati et al. 1979), but are about 10-fold lower than acute recordings with movable microelectrodes (Ojemann and Schoenfield-McNeill 1999), suggesting that these studies may be recording from a different population of neurons. This has critical importance for calculations of energy utilization in the human brain (Attwell and Laughlin 2001; Lennie 2003). The average firing rate we found is about 50-fold lower than what has been assumed (Attwell and Laughlin 2001), but is consistent with previous theoretical predictions based on the high energy cost of action potentials and consequent synaptic events (Lennie 2003). Low background firing is sometimes cited as an indication that a sparse encoding strategy is used by cortical cells for associative memory and complex stimuli (Olshausen and Field 2004). Sparse encoding was strongly suggested by our observation that the maximal or near-maximal prediction of phoneme or word identity could be achieved using the activity from only about 5 cells out of the approximately 150 that were isolated.

Studies using positron emission tomography, EEG, MEG and intraoperative microelectrodes have shown that auditory cortex is suppressed during self-produced speech, when compared with external speech perception, and it has been suggested that this is a result of speech-feedback monitoring (Creutzfeldt et al. 1989b; Paus et al. 1996; Numminen et al. 1999; Curio et al. 2000; Houde et al. 2002; Heinks-Maldonado et al. 2005, 2006; Christoffels et al. 2007; Tourville et al. 2008; Baess et al. 2011). These studies have suggested that this phenomenon occurs globally across auditory cortex; however, units in the primary auditory cortex of primates (Eliades and Wang 2005) have demonstrated a diversity of responses to self-produced vocalizations. A recent study has shown that neurons in the superior temporal region (including insula) demonstrate nonspecific vowel tuning during speech production (Tankus et al. 2012), and ECoG studies in humans have shown that different regions of auditory cortex demonstrate varying degrees of suppression (Flinker et al. 2010; Greenlee et al. 2011). In this study, we show that this variability is present to an even finer spatial scale; single units within a 4 × 4 mm area demonstrate variable amounts of suppression by self-produced speech. Our additional finding that putative inhibitory interneurons also exhibit reduced firing, and that LFPs to self-produced speech are suppressed from their onset, suggests that the suppression begins at an earlier processing stage and that decreased local firing is due to decreased input.

It is important to note that these recordings come from the unique case of a single patient with epilepsy. However, the cortical location containing the microelectrode was included in the final resection, and subsequent staining and histology failed to find abnormal pathology at the array site. Furthermore, the patient's seizures were found to start in medial temporal sites, making it less likely that aSTG was actively involved in seizure initiation. However, we cannot rule out the possibility that medications or long-standing epilepsy affected the responses we recorded.

Taken together, these data suggest that the aSTG contains a spatially organized processing unit specialized for extracting lexical identity from acoustic stimuli, lying midway between acoustic input in medial Heschl's gyrus, and supramodal semantic representations in the anteroventral temporal cortex. This module encodes high-order acoustic–phonetic information during the perception of both spoken and written words, suggesting that aSTG is involved in phonological recoding during reading. Single units robustly represent perceptual phonemic information, and it is possible that a small population of cells, each encoding a different set of phonemes in different phonological contexts, could represent the acoustic form of a specific word.

Supplementary Material

Supplementary material can be found at: http://www.cercor.oxfordjournals.org/.

Notes

Conflict of Interest: None declared.

References

Arnott
SR
Binns
MA
Grady
CL
Alain
C
Assessing the auditory dual-pathway model in humans
NeuroImage
 , 
2004
, vol. 
22
 (pg. 
401
-
408
)
Attwell
D
Laughlin
SB
An energy budget for signaling in the grey matter of the brain
J Cereb Blood Flow Metab
 , 
2001
, vol. 
21
 (pg. 
1133
-
1145
)
Baess
P
Horvath
J
Jacobsen
T
Schroger
E
Selective suppression of self-initiated sounds in an auditory stream: an ERP study
Psychophysiology
 , 
2011
, vol. 
48
 (pg. 
1276
-
1283
)
Bartho
P
Hirase
H
Monconduit
L
Zugaro
M
Harris
KD
Buzsaki
G
Characterization of neocortical principal cells and interneurons by network interactions and extracellular features
J Neurophysiol
 , 
2004
, vol. 
92
 (pg. 
600
-
608
)
Binder
JR
Frost
JA
Hammeke
TA
Bellgowan
PS
Springer
JA
Kaufman
JN
Possing
ET
Human temporal lobe activation by speech and nonspeech sounds
Cereb Cortex
 , 
2000
, vol. 
10
 (pg. 
512
-
528
)
Binder
JR
Liebenthal
E
Possing
ET
Medler
DA
Ward
BD
Neural correlates of sensory and decision processes in auditory object identification
Nat Neurosci
 , 
2004
, vol. 
7
 (pg. 
295
-
301
)
Binder
JR
Medler
DA
Westbury
CF
Liebenthal
E
Buchanan
L
Tuning of the human left fusiform gyrus to sublexical orthographic structure
NeuroImage
 , 
2006
, vol. 
33
 (pg. 
739
-
748
)
Bitterman
Y
Mukamel
R
Malach
R
Fried
I
Nelken
I
Ultra-fine frequency tuning revealed in single neurons of human auditory cortex
Nature
 , 
2008
, vol. 
451
 (pg. 
197
-
201
)
Boatman
D
Lesser
RP
Gordon
B
Auditory speech processing in the left temporal lobe: an electrical interference study
Brain Lang
 , 
1995
, vol. 
51
 (pg. 
269
-
290
)
Braun
M
Hutzler
F
Ziegler
JC
Dambacher
M
Jacobs
AM
Pseudohomophone effects provide evidence of early lexico-phonological processing in visual word recognition
Hum Brain Mapp
 , 
2009
, vol. 
30
 (pg. 
1977
-
1989
)
Chan
AM
Baker
JM
Eskandar
E
Schomer
D
Ulbert
I
Marinkovic
K
Cash
SS
Halgren
E
First-pass selectivity for semantic categories in human anteroventral temporal lobe
J Neurosci
 , 
2011
, vol. 
31
 (pg. 
18119
-
18129
)
Chan
AM
Halgren
E
Marinkovic
K
Cash
SS
Decoding word and category-specific spatiotemporal representations from MEG and EEG
NeuroImage
 , 
2011
, vol. 
54
 (pg. 
3028
-
3039
)
Chang
EF
Rieger
JW
Johnson
K
Berger
MS
Barbaro
NM
Knight
RT
Categorical speech representation in human superior temporal gyrus
Nat Neurosci
 , 
2010
, vol. 
13
 (pg. 
1428
-
1432
)
Christoffels
IK
Formisano
E
Schiller
NO
Neural correlates of verbal feedback processing: an fMRI study employing overt speech
Hum Brain Mapp
 , 
2007
, vol. 
28
 (pg. 
868
-
879
)
Coltheart
M
Curtis
B
Atkins
P
Haller
M
Models of reading aloud: dual-route and parallel-distributed-processing approaches
 , 
1993
, vol. 
Vol. 100
 
Washington, DC
American Psychological Association
Creutzfeldt
O
Ojemann
G
Lettich
E
Neuronal activity in the human lateral temporal lobe. I. Responses to speech
Exp Brain Res
 , 
1989a
, vol. 
77
 (pg. 
451
-
475
)
Creutzfeldt
O
Ojemann
G
Lettich
E
Neuronal activity in the human lateral temporal lobe. II. Responses to the subjects own voice
Exp Brain Res
 , 
1989b
, vol. 
77
 (pg. 
476
-
489
)
Crinion
JT
Lambon-Ralph
MA
Warburton
EA
Howard
D
Wise
RJ
Temporal lobe regions engaged during normal speech comprehension
Brain
 , 
2003
, vol. 
126
 (pg. 
1193
-
1201
)
Crone
NE
Boatman
D
Gordon
B
Hao
L
Induced electrocorticographic gamma activity during auditory perception. Brazier Award-winning article, 2001
Clin Neurophysiol
 , 
2001
, vol. 
112
 (pg. 
565
-
582
)
Curio
G
Neuloh
G
Numminen
J
Jousmaki
V
Hari
R
Speaking modifies voice-evoked activity in the human auditory cortex
Hum Brain Mapp
 , 
2000
, vol. 
9
 (pg. 
183
-
191
)
Dale
AM
Liu
AK
Fischl
BR
Buckner
RL
Belliveau
JW
Lewine
JD
Halgren
E
Dynamic statistical parametric mapping: combining fMRI and MEG for high-resolution imaging of cortical activity
Neuron
 , 
2000
, vol. 
26
 (pg. 
55
-
67
)
Davis
S
Mermelstein
P
Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences
Acoustics Speech Signal Process
 , 
1980
, vol. 
28
 (pg. 
357
-
366
)
Desai
R
Liebenthal
E
Waldron
E
Binder
JR
Left posterior temporal regions are sensitive to auditory categorization
J Cogn Neurosci
 , 
2008
, vol. 
20
 (pg. 
1174
-
1188
)
Dykstra
AR
Chan
AM
Quinn
BT
Zepeda
R
Keller
CJ
Cormier
J
Madsen
JR
Eskandar
EN
Cash
SS
Individualized localization and cortical surface-based registration of intracranial electrodes
NeuroImage
 , 
2012
, vol. 
59
 (pg. 
3563
-
3570
)
Eliades
SJ
Wang
X
Dynamics of auditory-vocal interaction in monkey auditory cortex
Cereb Cortex
 , 
2005
, vol. 
15
 (pg. 
1510
-
1523
)
Fiebach
CJ
Friederici
AD
Muller
K
von Cramon
DY
fMRI evidence for dual routes to the mental lexicon in visual word recognition
J Cogn Neurosci
 , 
2002
, vol. 
14
 (pg. 
11
-
23
)
Flinker
A
Chang
EF
Kirsch
HE
Barbaro
NM
Crone
NE
Knight
RT
Single-trial speech suppression of auditory cortex activity in humans
J Neurosci
 , 
2010
, vol. 
30
 (pg. 
16643
-
16650
)
Frost
R
Toward a strong phonological theory of visual word recognition: true issues and false trails
Psychol Bull
 , 
1998
, vol. 
123
 (pg. 
71
-
99
)
Ganong
WF
III
Phonetic categorization in auditory word perception
J Exp Psychol Hum Percept Perform
 , 
1980
, vol. 
6
 (pg. 
110
-
125
)
Geschwind
N
Levitsky
W
Human brain: left-right asymmetries in temporal speech region
Science
 , 
1968
, vol. 
161
 (pg. 
186
-
187
)
Greenlee
JD
Jackson
AW
Chen
F
Larson
CR
Oya
H
Kawasaki
H
Chen
H
Howard
MA
III
Human auditory cortical activation during self-vocalization
PLoS ONE
 , 
2011
, vol. 
6
 pg. 
e14744
 
Hackett
TA
Information flow in the auditory cortical network
Hear Res
 , 
2011
, vol. 
271
 (pg. 
133
-
146
)
Halgren
E
Dhond
RP
Christensen
N
Van Petten
C
Marinkovic
K
Lewine
JD
Dale
AM
N400-like magnetoencephalography responses modulated by semantic context, word frequency, and lexical class in sentences
NeuroImage
 , 
2002
, vol. 
17
 (pg. 
1101
-
1116
)
Halgren
E
Wang
C
Schomer
DL
Knake
S
Marinkovic
K
Wu
J
Ulbert
I
Processing stages underlying word recognition in the anteroventral temporal lobe
NeuroImage
 , 
2006
, vol. 
30
 (pg. 
1401
-
1413
)
Heinks-Maldonado
TH
Mathalon
DH
Gray
M
Ford
JM
Fine-tuning of auditory cortex during speech production
Psychophysiology
 , 
2005
, vol. 
42
 (pg. 
180
-
190
)
Heinks-Maldonado
TH
Nagarajan
SS
Houde
JF
Magnetoencephalographic evidence for a precise forward model in speech production
Neuroreport
 , 
2006
, vol. 
17
 (pg. 
1375
-
1379
)
Heit
G
Smith
ME
Halgren
E
Neural encoding of individual words and faces by the human hippocampus and amygdala
Nature
 , 
1988
, vol. 
333
 (pg. 
773
-
775
)
Hickok
G
Poeppel
D
The cortical organization of speech processing
Nat Rev
 , 
2007
, vol. 
8
 (pg. 
393
-
402
)
Hirano
S
Naito
Y
Okazawa
H
Kojima
H
Honjo
I
Ishizu
K
Yenokura
Y
Nagahama
Y
Fukuyama
H
Konishi
J
Cortical activation by monaural speech sound stimulation demonstrated by positron emission tomography
Exp Brain Res
 , 
1997
, vol. 
113
 (pg. 
75
-
80
)
Houde
JF
Nagarajan
SS
Sekihara
K
Merzenich
MM
Modulation of the auditory cortex during speech: an MEG study
J Cogn Neurosci
 , 
2002
, vol. 
14
 (pg. 
1125
-
1138
)
Howard
D
Patterson
K
Wise
R
Brown
WD
Friston
K
Weiller
C
Frackowiak
R
The cortical localization of the lexicons. Positron emission tomography evidence
Brain
 , 
1992
, vol. 
115
 
Pt. 6
(pg. 
1769
-
1782
)
Howard
MA
III
Volkov
IO
Abbas
PJ
Damasio
H
Ollendieck
MC
Granner
MA
A chronic microelectrode investigation of the tonotopic organization of human auditory cortex
Brain Res
 , 
1996
, vol. 
724
 (pg. 
260
-
264
)
Kutas
M
Federmeier
KD
Thirty years and counting: finding meaning in the N400 component of the event-related brain potential (ERP)
Ann Rev Psychol
 , 
2011
, vol. 
62
 (pg. 
621
-
647
)
Lennie
P
The cost of cortical computation
Curr Biol
 , 
2003
, vol. 
13
 (pg. 
493
-
497
)
Lund
K
Burgess
C
Producing high-dimensional semantic spaces from lexical co-occurrence
Behav Res Methods
 , 
1996
, vol. 
28
 (pg. 
203
-
208
)
Marinkovic
K
Spatiotemporal dynamics of word processing in the human cortex
Neuroscientist
 , 
2004
, vol. 
10
 (pg. 
142
-
152
)
Marinkovic
K
Dhond
RP
Dale
AM
Glessner
M
Carr
V
Halgren
E
Spatiotemporal dynamics of modality-specific and supramodal word processing
Neuron
 , 
2003
, vol. 
38
 (pg. 
487
-
497
)
Maris
E
Oostenveld
R
Nonparametric statistical testing of EEG- and MEG-data
J Neurosci Methods
 , 
2007
, vol. 
164
 (pg. 
177
-
190
)
Marslen-Wilson
WD
Functional parallelism in spoken word-recognition
Cognition
 , 
1987
, vol. 
25
 (pg. 
71
-
102
)
McClelland
JL
Elman
JL
The TRACE model of speech perception
Cogn Psychol
 , 
1986
, vol. 
18
 (pg. 
1
-
86
)
Morton
J
Interaction of information in word recognition
Psychol Rev
 , 
1969
, vol. 
76
 (pg. 
165
-
178
)
Newman
RL
Connolly
JF
Determining the role of phonology in silent reading using event-related brain potentials
Brain Res
 , 
2004
, vol. 
21
 (pg. 
94
-
105
)
Niznikiewicz
M
Squires
NK
Phonological processing and the role of strategy in silent reading: behavioral and electrophysiological evidence
Brain Lang
 , 
1996
, vol. 
52
 (pg. 
342
-
364
)
Norris
D
McQueen
JM
Cutler
A
Merging information in speech recognition: feedback is never necessary
Behav Brain Sci
 , 
2000
, vol. 
23
 (pg. 
299
-
325
discussion 325–370
Numminen
J
Salmelin
R
Hari
R
Subject's own speech reduces reactivity of the human auditory cortex
Neurosci Lett
 , 
1999
, vol. 
265
 (pg. 
119
-
122
)
Obleser
J
Boecker
H
Drzezga
A
Haslinger
B
Hennenlotter
A
Roettinger
M
Eulitz
C
Rauschecker
JP
Vowel sound extraction in anterior superior temporal cortex
Hum Brain Mapp
 , 
2006
, vol. 
27
 (pg. 
562
-
571
)
Obleser
J
Leaver
AM
Vanmeter
J
Rauschecker
JP
Segregation of vowels and consonants in human auditory cortex: evidence for distributed hierarchical organization
Front Psychol
 , 
2010
, vol. 
1
 pg. 
232
 
Ojemann
GA
Schoenfield-McNeill
J
Activity of neurons in human temporal cortex during identification and memory for names and words
J Neurosci
 , 
1999
, vol. 
19
 (pg. 
5674
-
5682
)
Olshausen
BA
Field
DJ
Sparse coding of sensory inputs
Curr Opin Neurobiol
 , 
2004
, vol. 
14
 (pg. 
481
-
487
)
Pasley
BN
David
SV
Mesgarani
N
Flinker
A
Shamma
SA
Crone
NE
Knight
RT
Chang
EF
Reconstructing speech from human auditory cortex
PLoS Biol
 , 
2012
, vol. 
10
 pg. 
e1001251
 
Paus
T
Perry
DW
Zatorre
RJ
Worsley
KJ
Evans
AC
Modulation of cerebral blood flow in the human auditory cortex during speech: role of motor-to-sensory discharges
Eur J Neurosci
 , 
1996
, vol. 
8
 (pg. 
2236
-
2246
)
Perani
D
Dehaene
S
Grassi
F
Cohen
L
Cappa
SF
Dupoux
E
Fazio
F
Mehler
J
Brain processing of native and foreign languages
Neuroreport
 , 
1996
, vol. 
7
 (pg. 
2439
-
2444
)
Perrone-Bertolotti
M
Kujala
J
Vidal
JR
Hamame
CM
Ossandon
T
Bertrand
O
Minotti
L
Kahane
P
Jerbi
K
Lachaux
J-P
How silent is silent reading? Intracerebral evidence for top-down activation of temporal voice areas during reading
J Neurosci
 , 
2012
, vol. 
32
 (pg. 
17554
-
17562
)
Peyrache
A
Dehghani
N
Eskandar
EN
Madsen
JR
Anderson
WS
Donoghue
JA
Hochberg
LR
Halgren
E
Cash
SS
Destexhe
A
Spatiotemporal dynamics of neocortical excitation and inhibition during human sleep
Proc Natl Acad Sci USA
 , 
2012
, vol. 
109
 (pg. 
1731
-
1736
)
Price
CJ
Wise
RJ
Warburton
EA
Moore
CJ
Howard
D
Patterson
K
Frackowiak
RS
Friston
KJ
Hearing and saying. The functional neuro-anatomy of auditory word processing
Brain
 , 
1996
, vol. 
119
 
Pt. 3
(pg. 
919
-
931
)
Rauschecker
JP
Cortical processing of complex sounds
Curr Opin Neurobiol
 , 
1998
, vol. 
8
 (pg. 
516
-
521
)
Rauschecker
JP
Scott
SK
Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing
Nat Neurosci
 , 
2009
, vol. 
12
 (pg. 
718
-
724
)
Ravagnati
L
Halgren
E
Babb
TL
Crandall
PH
Activity of human hippocampal formation and amygdala neurons during sleep
Sleep
 , 
1979
, vol. 
2
 (pg. 
161
-
173
)
Saur
D
Kreher
BW
Schnell
S
Kummerer
D
Kellmeyer
P
Vry
MS
Umarova
R
Musso
M
Glauche
V
Abel
S
, et al.  . 
Ventral and dorsal pathways for language
Proc Natl Acad Sci USA
 , 
2008
, vol. 
105
 (pg. 
18035
-
18040
)
Scott
SK
Blank
CC
Rosen
S
Wise
RJ
Identification of a pathway for intelligible speech in the left temporal lobe
Brain
 , 
2000
, vol. 
123
 
Pt. 12
(pg. 
2400
-
2406
)
Scott
SK
Rosen
S
Lang
H
Wise
RJ
Neural correlates of intelligibility in speech investigated with noise vocoded speech—a positron emission tomography study
J Acoust Soc Am
 , 
2006
, vol. 
120
 (pg. 
1075
-
1083
)
Seidenberg
MS
The time course of phonological code activation in two writing systems
Cognition
 , 
1985
, vol. 
19
 (pg. 
1
-
30
)
Shannon
RV
Zeng
FG
Kamath
V
Wygonski
J
Ekelid
M
Speech recognition with primarily temporal cues
Science
 , 
1995
, vol. 
270
 (pg. 
303
-
304
)
Steinschneider
M
Nourski
KV
Kawasaki
H
Oya
H
Brugge
JF
Howard
MA
III
Intracranial study of speech-elicited activity on the human posterolateral superior temporal gyrus
Cereb Cortex
 , 
2011
, vol. 
21
 (pg. 
2332
-
2347
)
Tanaka
K
Inferotemporal cortex and object vision
Ann Rev Neurosci
 , 
1996
, vol. 
19
 (pg. 
109
-
139
)
Tankus
A
Fried
I
Shoham
S
Structured neuronal encoding and decoding of human speech features
Nat Commun
 , 
2012
, vol. 
3
 pg. 
1015
 
Theunissen
FE
David
SV
Singh
NC
Hsu
A
Vinje
WE
Gallant
JL
Estimating spatio-temporal receptive fields of auditory and visual neurons from their responses to natural stimuli
Network
 , 
2001
, vol. 
12
 (pg. 
289
-
316
)
Tourville
JA
Reilly
KJ
Guenther
FH
Neural mechanisms underlying auditory feedback control of speech
NeuroImage
 , 
2008
, vol. 
39
 (pg. 
1429
-
1443
)
Travis
KE
Leonard
MK
Chan
AM
Torres
C
Sizemore
ML
Qu
Z
Eskandar
E
Dale
AM
Elman
JL
Cash
SS
, et al.  . 
Independence of early speech processing from word meaning
Cereb Cortex
 , 
2012
Vinckier
F
Dehaene
S
Jobert
A
Dubus
JP
Sigman
M
Cohen
L
Hierarchical coding of letter strings in the ventral stream: dissecting the inner organization of the visual word-form system
Neuron
 , 
2007
, vol. 
55
 (pg. 
143
-
156
)
Warren
JD
Scott
SK
Price
CJ
Griffiths
TD
Human brain mechanisms for the early analysis of voices
NeuroImage
 , 
2006
, vol. 
31
 (pg. 
1389
-
1397
)
Warren
RM
Perceptual restoration of missing speech sounds
Science
 , 
1970
, vol. 
167
 (pg. 
392
-
393
)
Wernicke
C
Der aphasische Symptomencomplex: eine psychologische Studie auf anatomischer Basis
 , 
1874
Breslau: Cohn & Weigert
Wessinger
CM
VanMeter
J
Tian
B
Van Lare
J
Pekar
J
Rauschecker
JP
Hierarchical organization of the human auditory cortex revealed by functional magnetic resonance imaging
J Cogn Neurosci
 , 
2001
, vol. 
13
 (pg. 
1
-
7
)
Yvert
G
Perrone-Bertolotti
M
Baciu
M
David
O
Dynamic causal modeling of spatiotemporal integration of phonological and semantic processes: an electroencephalographic study
J Neurosci
 , 
2012
, vol. 
32
 (pg. 
4297
-
4306
)
Zatorre
RJ
Bouffard
M
Belin
P
Sensitivity to auditory object features in human temporal neocortex
J Neurosci
 , 
2004
, vol. 
24
 (pg. 
3637
-
3642
)
Ziegler
JC
Benraiss
A
Besson
M
From print to meaning: an electrophysiological investigation of the role of phonology in accessing word meaning
Psychophysiology
 , 
1999
, vol. 
36
 (pg. 
775
-
785
)

Author notes

Eric Halgren and Sydney S. Cash are co-senior authors.