How the brain extracts words from auditory signals is an unanswered question. We recorded approximately 150 single and multi-units from the left anterior superior temporal gyrus of a patient during multiple auditory experiments. Against low background activity, 45% of units robustly fired to particular spoken words with little or no response to pure tones, noise-vocoded speech, or environmental sounds. Many units were tuned to complex but specific sets of phonemes, which were influenced by local context but invariant to speaker, and suppressed during self-produced speech. The firing of several units to specific visual letters was correlated with their response to the corresponding auditory phonemes, providing the first direct neural evidence for phonological recoding during reading. Maximal decoding of individual phonemes and words identities was attained using firing rates from approximately 5 neurons within 200 ms after word onset. Thus, neurons in human superior temporal gyrus use sparse spatially organized population encoding of complex acoustic–phonetic features to help recognize auditory and visual words.
Delineating the fundamental acoustic–phonetic processing stages the brain uses to encode words is fundamental to understanding the neural processing of speech, but these stages are little understood. Although an early study (Creutzfeldt et al. 1989a) reported lateral temporal lobe neurons responding to speech, the specificity and coding properties of these cells remain unresolved: Do such neurons respond to equally complex nonspeech sounds (e.g. environmental sounds)? Are their receptive fields best described in terms of phonemes? Do they also respond to written words in a phonologically correlated manner?
Auditory cortical processing is hierarchical in nature; relatively simple features are progressively combined to build more complex representations in downstream areas (Hickok and Poeppel 2007; Hackett 2011), beginning with tonotopic frequency tuning in the primary auditory cortex (Howard et al. 1996; Bitterman et al. 2008) followed (in primates) by downstream neurons tuned to frequency-modulated sweeps and other complex spectrotemporal representations (Rauschecker 1998; Rauschecker and Scott 2009). In humans, hemodynamic neuroimaging studies suggest that this hierarchy applies to speech stimuli (Binder et al. 2000; Wessinger et al. 2001). However, these studies lack either the spatial resolution to determine the responses of single neurons to spoken words, or the temporal resolution to determine the sequence of neuronal firing. Thus, the hemodynamically observed spatial hierarchy may reflect feedback or recurrent activity rather than sequential feedforward processing.
The anatomical organization of this processing remains controversial. The traditional view suggests that posterior superior temporal cortex, near Wernicke's area, is the main region involved in processing speech sounds (Wernicke 1874; Geschwind and Levitsky 1968; Crone et al. 2001; Desai et al. 2008; Chang et al. 2010; Steinschneider et al. 2011). There is growing evidence, however, that anterior superior temporal cortex phonetically processes speech sounds and projects to more ventral areas comprising the auditory “what” stream (Scott et al. 2000, 2006; Arnott et al. 2004; Binder et al. 2004; Zatorre et al. 2004; Obleser et al. 2006, 2010; Warren et al. 2006; Saur et al. 2008; Rauschecker and Scott 2009; Perrone-Bertolotti et al. 2012).
Here, we report a detailed examination of the auditory and language responses of a large number of simultaneously recorded single units from the left anterior superior temporal gyrus (aSTG) of a 31-year-old right-handed man with epilepsy. A 96-channel microelectrode array recorded extracellular action potentials from over 140 layer III/IV neurons. The patient made semantic judgments of words referring to objects or animals in the auditory (SA) and visual (SV) modalities, compared spoken words with pictures (WN), repeated spoken words, and participated in spontaneous conversation. Controls included unintelligible vocoded speech, pure tones, environmental sounds, and time-reversed words.
Materials and Methods
A 31-year-old right-handed male with medically intractable epilepsy was admitted to the Massachusetts General Hospital for semichronic electrode implantation for surgical evaluation. The patient was left-hemisphere language dominant based on a WADA test and was a native English speaker with normal hearing, vision, and intelligence. His seizures were partial complex and typically began in mesial temporal depth electrode contacts. Surgical treatment removed the left anterior temporal lobe (including the site of the microelectrode implantation), left parahippocampal gyrus, left hippocampus, and left amygdala, resulting in the patient being seizure free at 1-year postresection. Formal neuropsychological testing 1-year postresection did not show any significant change in language functions, including naming and comprehension. The patient gave informed consent and was enrolled in this study under the auspices of Massachusetts General Hospital IRB oversight in accordance with the declaration of Helsinki.
Electrodes and Recording
A microelectrode array (Blackrock Microsystems, Salt Lake City, UT, USA), capable of recording the action potentials of single units, was implanted in the left aSTG. This 4 × 4 mm array consists of 100 (96 active) penetrating electrodes, each 1.5 mm in length with a 20-µm exposed platinum tip, spaced 400 µm apart. Recordings were obtained by a Blackrock NeuroPort data acquisition system at 30 kHz with bandpass filtering from 0.3 Hz to 7.5 kHz. The decision to implant the array in the superior temporal gyrus was based on clinical considerations; this was a region that was within the expected resection area and was indeed resected on completion of the intracranial electroencephalography (iEEG) investigation. The region surrounding the array was removed en bloc and submitted for histological processing. Staining with hematoxylin and eosin revealed that the tips of the electrodes were at the bottom of cortical layer III, close to layer IV, and that the surrounding cortical tissue was histologically normal.
In addition to this microelectrode, clinical intracranial macroelectrodes were implanted based on clinical considerations alone. Electrodes consisted of an 8 × 8 grid of subdural macroelectrode contacts spaced 1 cm apart (Adtech Medical, Racine, WI, USA), covering the left lateral cortex including frontal, temporal, and anterior parietal areas. iEEG was continuously recorded from these clinical electrodes at 500 Hz with bandpass filtering from 0.1 to 200 Hz. All electrodes were localized with respect to the patient's reconstructed cortical surface using the method described in Dykstra et al. (2012).
The patient performed several auditory tasks designed to examine different aspects of speech and nonspeech sound processing. In SA, the participant pressed a button to spoken words referring to animals or objects that were larger than one foot in any dimension. Words were spoken by a male speaker, normalized in power and length (500 ms), and presented with a 2200-ms stimulus onset asynchrony (SOA). Eight hundred randomly ordered trials were evenly split between novel words presented only once for the entire experiment (400 trials), and repeated words which consisted of a set of 10 words repeated 40 times each. The 10 repeated words were “claw,” “cricket,” “flag,” “fork,” “lion,” “medal,” “oyster,” “serpent,” “shelf,” and “shirt.” Half of the trials required a button press yielding a 2 × 2 balanced design. Sounds were presented binaurally using Etymotic ER-1 earphones (Elk Grove Village, IL, USA).
SV was identical to SA in all respects except that the words were presented visually on a computer screen (for 300 ms). See Dale et al. (2000), Marinkovic et al. (2003), Halgren et al. (2006), Chan, Baker, et al. (2011), and Chan, Halgren, et al. (2011) further details and analysis of the SA and SV tasks.
In WN, the picture of an object was presented followed by a spoken word or noise. The picture (<5% visual angle) appeared for the entire trial duration of 1300 ms, and the auditory stimulus, either a congruously or incongruously paired word or noise stimulus, was presented 500 ms after picture onset. Words were single-syllable nouns recorded by a female native speaker. Noise stimuli were noise-vocoded, unintelligible, versions of the same words. Four conditions were presented in a random order: Picture matched-words (where the word referred to the picture), picture matched-noise (where the word used to generate the noise matched the picture), picture mismatched-words (the word referred to a different object than the picture), and picture mismatched-noise (the word used to generate the noise did not refer to the picture). The participant was asked to press a button to matches. To create the noise stimuli, band-passed and amplitude-modulated white noise was made to match the acoustic structure and sound level of a corresponding word. The power in each of 20 equal bands from 50 to 5000 Hz and the exact time versus power waveform for 50–247, 248–495, and 496–5000 Hz were matched between the noise and word stimuli (Shannon et al. 1995). Sounds (mean duration = 445 ± 63 ms; range = 304–637 ms; 44.1 kHz; normalized to 65 dB average intensity) were presented binaurally through Etymotic ER-1 earphones. A total of 1000 trials were presented. For more information, see Travis et al. (2012).
Several sets of nonspeech sounds were also presented to the patient. 7.2 s sequences of randomly selected pure tones were presented to explore responses to simpler acoustic stimuli. Tones were 100 ms in length, including 10 ms raised cosine-on and -off ramps, and were centered at 0.239, 0.286, 0.343, 0.409, 0.489, 0.585, 0.699, 0.836, 1, 1.196, 1.430, 1.710, 2.045, 2.445, 2.924, 3.497, 4.181, or 5.000 kHz. Tones were placed randomly in time and frequency within each band with an average within-band SOA of 800 ms (range: 100–1500 ms). Within each band, the exact frequency of any given tone was within an estimated equivalent rectangular bandwidth (ERB), where ERB = 24.7 × (4.37 × fc + 1), where fc is the center frequency of a given band in kHz.
In addition to the pure tones, the participant was presented the SA word stimuli and asked to repeat them out loud. The subject began speaking, on average, 411 ± 119 ms after the end of the stimulus, and the SOA was 3000 ms. The auditory word stimuli from the SA task were also time-reversed and presented to the participant (with an SOA of 2200 ms) who was asked to passively listen. Time-reversal of words preserves the spectral content of the stimuli, but changes the temporal structure of the sounds. A set of 30 environmental sounds, natural (e.g. birds chirping, waterfall) and manmade (e.g. clapping hands, breaking glass), were also presenting to the patient. Each sound was presented 5 times each in pseudorandom order with a 3500-ms SOA. Finally, a spontaneous conversation between the patient and researchers was recorded using a far-field microphone. The entire conversation was manually transcribed and all word-boundaries were marked.
Spike Sorting and Analysis
To extract spikes from the microelectrode recordings, continuous data were high-pass filtered at 250 Hz using a sixth-order Bessel filter, and an amplitude threshold of 4 standard deviations was used to chose action potential waveforms. Extracted spikes were manually sorted using Offline Sorter (Plexon, Dallas, TX, USA) in various feature spaces, including principal components, peak-valley amplitude, and nonlinear energy. Units were characterized as single or multi-units based on the quality of sorted clusters and the amplitude of waveforms. Initial analyses indicated no apparent differences between single and multi-units, so, unless indicated otherwise, reported results include both. Putative inhibitory interneurons were identified based on waveform shape, full-width at half-maximum, and valley-to-peak time (Bartho et al. 2004; Peyrache et al. 2012). Because experiments were performed over the course of 3 days, identified units varied across tasks. Sorting was performed simultaneously over multiple tasks when units were compared between them. Groups of spikes that demonstrated consistent clusters in principal component space, visually similar spike waveforms, and consistent auto/cross-correlations between multiple tasks were deemed to originate from the same unit. This grouping was only possible for putative single units.
To statistically determine and quantify the magnitude and latency of unit responses to particular stimuli, a nonparametric cluster-based statistical test was used. Similar to the nonparametric test utilized by Maris and Oostenveld (2007) for testing continuous data, statistics were computed for individual 30 ms bins of peristimulus time histograms (PSTHs). The binned firing rates of a prestimulus baseline period from −300 to 0 ms were compared with each poststimulus bin using a 2-sided t-test. Clusters of consecutive bins with significance of Pbin < 0.05 were found, and the summed T-statistic of each cluster was computed. The null distribution of cluster-level statistics was computed by randomly permuting the bins in time and recomputing clusters 1000 times. The cluster-level statistics were compared with the null distribution, and a cluster was deemed significant if the cluster-level probability was Pcluster < 0.05. The earliest bin in a statistically significant cluster was taken to be the response latency of that particular unit, and the magnitude of the response computed as the average firing rate within that cluster.
To quantify responses to phonemes, all phoneme boundaries were manually marked for all relevant stimuli and PSTHs were computed by aligning phoneme start times. Formants were computed using Wavesurfer (http://www.speech.kth.se/wavesurfer/) using 20-ms windows overlapped by 10 ms. The Carnegie Mellon University Pronouncing Dictionary was used to obtain phonetic transcriptions of words in the SV task (http://www.speech.cs.cmu.edu/cgi-bin/cmudict). Word frequency information for the SV task was obtained using the HAL corpus (Lund and Burgess 1996), resulting in a mean word frequency of 2849, median of 1422, and range of 2–37 798. Words below the median value of 1422 were grouped into the low-frequency class, and words above this median taken as the high-frequency class.
Spectrotemporal Receptive Field Estimation
The spectrotemporal receptive field (STRF) is the average time-course of spectral features preceding each action potential and characterizes that neuron's preferred time-frequency features. STRFs were calculated with 2 sets of spectral features: 1) Power in linearly spaced frequencies from 50 Hz to 4 kHz, and 2) Mel-frequency cepstral coefficients (MFCCs). MFCCs use a logarithmic frequency representation to approximate the human auditory system (Davis and Mermelstein 1980) and allow for the separation of the fundamental excitation frequency of vocal chords (known as the “source,” defining speech pitch and thus the speaker), from the shape and properties of articulatory chamber (known as the “filter,” defining phonological information and thus allows for word discrimination). MFCCs, were computed in MATLAB between 200 Hz and 8 kHz using a liftering exponent of 22. The first 13 coefficients were extracted, which carries most of the pitch-invariant phonological information, and these features were computed in 20-ms windows shifted by 10 ms.
The method described by Theunissen et al. (2001) was used to estimate the STRFs for each unit. This method compensates for correlations within the stimuli to generate the optimal linear filter, which best characterizes the relationship between the stimulus and firing rate. To predict the firing rate of these units, firing rate was first smoothed using a 120-ms Gaussian kernel, and the STRFs were computed using the novel words in the SA task. The resulting STRF was convolved with the time-course of stimulus features generated for the repeated words, yielding a predicted PSTH. This PSTH prediction was performed for each of the 10 repeated words in the SA and time-reversed word tasks.
To decode either repeated words or phonemes, a set of features were computed from the unit firing rates for each trial. For the classification of repeated words, all units which demonstrated a statistically significant response to auditory words were used. For each unit, a time window was computed in which its firing rate significantly changed from baseline. Subsequently, for each trial, the number of spikes occurring within that time window was used as one of the features for the classifier. For phoneme decoding, this window was fixed from 50 to 250 ms after phoneme onset. This window was selected based on the peak firing rates seen in the PSTHs generated to individual phonemes.
To examine changes in information over time, either sliding windows or cumulative windows were used to compute firing rates. For sliding windows, the firing was computed in a 50-ms window for each unit beginning at 0 ms (the window therefore covered 0–50 ms) and after performing the decoding analysis, the window was shifted 10 ms forward. For the cumulative window analysis, 25-ms windows were used, and instead of shifting the window, subsequent nonoverlapping windows were concatenated to the growing feature vector between each round of decoding. This allowed for the analysis of information in a time frame of 0–25 up to 0–1000 ms.
A Naïve Bayes Classifier was used to decode word or phoneme-specific information from the computed features. The Naïve Bayes Classifier assumes that all features, f1, f2, … ,fn, are independent, making the joint probability of all features the product of the marginal probabilities.
This allows the classifier to assign a probability that a new trial belongs to Ci, based on a set of firing rates, f1, f2, … ,fn. The predicted class for this new trial, , is chosen to be the class with the highest conditional probability.
In the case of sliding windows, the features, f1, f2, … ,fn, are the firing rates of neurons 1 to n in a chosen window, while for cumulative windows, the features are the firing rates of these neurons in M different nonoverlapping windows, resulting in nM total features.
To train the classifier, for each combination of unit and class, a Poisson distribution of spike counts was estimated (via maximum likelihood), allowing for the computation of probabilities of observed firing rates. Accuracies were estimated via 10-fold cross-validation.
To test whether location in formant space provided equivalent information as phoneme identity, each phoneme was reclassified to the vowel that had the closest mean F1–F2 value in this formant space. All classes were balanced such that they had the same number of instances before and after reassignment.
To test whether phonetic context affected the firing rates of the identified neurons, we attempted to decode the first vowel in the repeated words in the SA task (where the context was consistent), when compared with the first vowel in the new words (where the context varied across the words). A matched set of new words were chosen, so that an equivalent number of instances of each vowel were present. If these neurons only encoded information about the first vowel identity, equal decoding accuracy would be expected in both cases.
Single-Unit Sorting Results
Units were manually sorted for each of the 7 experiments. Because experiments were performed over the course of 3 days, identified units varied from task to task. For analyses where units were compared between experiments, sorting was performed simultaneously over all tasks of interest. A total of 142 units were identified during the SA task, with 58 units characterized as likely single units and 84 potential multi-units. A total of 146 units were identified from the WN task (63 single and 83 multi-units), 166 during the presentation of the time-reversed words (77 single and 89 multi-units), 144 during the presentation of pure tone sequences (77 single and 67 multi-units), 169 during the repetition of auditory words (79 single and 90 multi-units), 171 during a spontaneous conversation (77 single and 94 multi-units), and 181 during the SV task (86 single and 95 multi-units). Single units demonstrated firing rates between 0.0004 and 11.1 spikes/s (mean = 0.37 spikes/s). 17% of all identified single units were putative inhibitory interneurons based on waveform shape, full-width at half-maximum, and valley-to-peak time (Bartho et al. 2004; Peyrache et al. 2012). The mean firing rate of these inhibitory cells was 1.96 spikes/s, compared with the mean firing rate of excitatory single units of 0.17 spikes/s. Responses were similar for single and multi-units, and for putative pyramidal cells and interneurons, and they are combined for subsequent analyses unless otherwise indicated.
Single-Unit Word Specificity
The patient was first asked to listen through headphones to a set of recorded spoken words that corresponded to concrete objects or animals and indicate if the item was larger than a foot in any dimension (SA task). Half of the trials involved 400 words that were only presented once during the experiment, while the other half involved 10 words that were presented 40 times each. Online visual inspection revealed that many units responded strongly to spoken word stimuli, with PSTHs demonstrating specificity for individual words. Two examples of such units are shown in Figure 1. These 2 cells showed bidirectional monosynaptic connectivity (based on cross-correlation, Fig. 1B) and strong word-specific responses to the repeated words in the SA task. Interestingly, the excitatory cell demonstrated narrower tuning than the inhibitory cell (Fig. 1C). Unit 6a, the putative inhibitory interneuron, demonstrated differences in the firing rate to the 10 repeated words with largest responses to “claw,” “cricket,” and “oyster.” For unit 6b, the putative pyramidal cell, while “claw” and “cricket” also evoked large responses, “oyster” did not.
In total, 66 of the 141 units exhibited a statistically significant response to auditory words in SA (P < 0.05, Fig. 2B). Fifty-nine units increased firing in response to words, while 7 units decreased their firing. Baseline firing of responsive units varied from 0 to 5.16 spikes/s (mean = 0.31 spikes/s), with changes in the firing rate ranging from 0.03 to 12.4 spikes/s (mean = 0.75 spikes/s). Excluding units with a baseline firing rate of 0, the firing rate increased by 625% on average (range = 335–2567%; Supplementary Fig. 1). Peak firing ranged from 0.31 to 14.3 spikes/s (mean = 0.50 spikes/s). Response latencies varied from 20 to 940 ms (mean = 308 ms) after word onset. Thirty-one units demonstrated differential firing to the 10 repeated words in the SA task (P < 0.05, Kruskal–Wallis, between 100 and 900 ms). Unlike cells in the anteroventral temporal lobe (Chan, Baker, et al. 2011), no cells in the anterior STG responded differentially to words referring either to animals versus manmade objects or to novel versus repeated words.
Tuning to a number of different features may lead to this specificity for particular words. At the lowest level, it is possible that these units simply respond to a particular frequency or sound intensity that is present in a subset of the presented words. It is also possible that these units are responding to specific acoustic features of spoken words, such as complex time-frequency components, combinations of formant frequencies, or even phoneme identity. At the highest levels, these units may encode the auditory representations of full words. We therefore tested the response of these units to a diverse set of acoustic stimuli that spanned a wide range of acoustic complexity.
Spatial Organization of Responses and Correlation with Gamma Activity
The regular spacing of the 10 × 10 microelectrode array allowed the spatial organization of unit response properties to be examined. Although identified units were uniformly distributed across the array (Fig. 2A), those responding to auditory word stimuli were clustered on one side (Fig. 2B). More specific properties, such as response latency (Fig. 2C) and word-specific response profiles, also demonstrated spatial correlations. To quantitatively explore this spatial organization, the spatial autocorrelations of various response properties (e.g. responsiveness, response latency, 10-word response profiles) were computed between units and plotted against the Euclidean distances separating the electrodes that recorded them in the 4 × 4 mm array. Correlation of responsiveness to words extended to recordings separated by 800 µm, correlation for response latencies was significant to 600 µm, and correlation for 10-word response profiles was observed up to distances of 400 µm (Fig. 2D). Despite these spatial correlations, macroelectrode electrocorticography (ECoG) directly over the microelectrode site failed to show language-specific responses (Supplementary Fig. 2).
This analysis shows that the response profiles across words were similar for multiple units recorded by a given microelectrode contact, and slightly correlated with the profiles of neurons recorded at adjacent contacts. In 11 cases, at least 2 units present on the same electrode had correlated 10-word response profiles. We also tested if gamma power (30–100 Hz), taken as a measure of population activity, between 100 and 700 ms at these contacts showed significantly different responses to different words. This was found in 2 of the 11 electrodes, thus indicating that the specificity for particular words can also be observed at the population level. These 2 electrodes also demonstrated significant increases in gamma power to words versus noise-vocoded speech. Furthermore, the firing rates of the units on these electrodes in response to the 10-repeated words were significantly correlated to changes in gamma power to those same words (Supplementary Fig. 3, r > 0.72, P < 0.001, Pearson). Broadband averaged local field potentials (LFPs) did not demonstrate significant word-specific responses on any electrode.
Responses to NonSpeech Sounds
To test whether these word-selective units were in fact responding to lower level features, we tested their response to a diverse set of acoustic stimuli that spanned a wide range of acoustic complexity. None of the units showed statistically significant responses when the patient listened to sequences of 100-ms pure tones ranging from 240 Hz to 5 kHz (P > 0.05, Fig. 3A).
The patient also listened to noise-vocoded speech that was acoustically matched to auditory word stimuli (WN). Noise-vocoded stimuli contained the same time-course of power in 3 frequency bands as the matched word, but the fine-scale spectral information within these bands was replaced with band-passed white noise. The subject decided if the word semantically matched the picture presented immediately beforehand. Only 3 of the 60 units that responded to words also responded to noise-vocoded speech. Furthermore, firing rates were 65% lower to noise-vocoded speech stimuli than to the matched words (Fig. 3B). No cell responded differentially to matched versus mismatched words.
The patient then passively listened to time-reversed auditory words from SA. Because time-reversing maintains frequency information and acoustic complexity, but changes the temporal structure of words, it has often been used in hemodynamic studies of word responses (Howard et al. 1992; Perani et al. 1996; Price et al. 1996; Hirano et al. 1997; Binder et al. 2000; Crinion et al. 2003). Vowel sounds are relatively preserved while consonants (especially stop consonants) are often distorted, and many of the sounds are not phonetically possible. Only 17% of the units responded to time-reversed words (compared with 47% responding to normal words), and the magnitude of the response was significantly smaller (0.21 spikes/s for time-reversed vs. 0.75 spikes/s for normal words, P < 0.05, Wilcoxon rank-sum). For several units, responses to time-reversed words were also significantly delayed (Fig. 3C). The responses to time-reversed words demonstrate that complete words are not necessary for activating these units; the small amplitude of these responses, as well as the lack of responses to tones and to environmental sounds (except, occasionally, vocalizations), is consistent with these cells being highly selective for speech sounds.
While these control stimuli elicit relatively smaller responses than spoken words, they are artificially constructed synthetic sounds. It is possible that the identified units respond equally well to naturally occurring environmental sounds. A set of 30 environmental sounds, both man-made and natural, were presented to the subject. Only 2 units demonstrated statistically significant responses to any of the stimuli (Fig. 3D), and then only responded to male laughter (unit 6a) or a baby crying (unit 36a). Interestingly, both of these stimuli are human vocalizations.
Responses to Phonemes
Formant values at the midpoint of each vowel were correlated to firing of each unit from 50 to 250 ms after vowel onset. Only unit 6a demonstrated significant correlations, with a moderate negative F1 correlation of ρ = −0.13 (Spearman, P < 0.01) and positive F2 correlation of ρ = 0.11 (Spearman, P < 0.05).
PSTHs were generated to each of the phonemes present in SA and WN. In one example, unit 6a clearly showed specific firing to several vowels beginning at approximately 70 ms and peaking at 100 ms (Fig. 4). These phonemes included the high-front vowels [ɪ], [i], [oɪ], [oʊ], and [u]. Several consonants, such as [p], [b], [t] and [f] also demonstrate increases in firing around 100 ms. Some of the phoneme PSTHs, such as [ŋ], demonstrate increases in firing before 0 ms, likely due to the common occurrence of specific preceding vowel sounds (e.g. [ɪ] as in words ending with “ing”). Overall, 22 units demonstrated significant responses to at least one phoneme, with 16 of the 22 units responding to more than one phoneme (Fig. 4C). Many of the unresponsive cells had firing rates too low to permit sensitive statistical tests. While more units exhibited significant responses to consonants (Fig. 4D), 9 of the 22 units responded to both consonants and vowels. Most responses were significantly modulated by phoneme position (first vs. second syllable) and/or amplitude (Supplementary Fig. 4).
Spectrotemporal Receptive Fields
Formants and phonemes are well-established intermediate features between the acoustic signal and the word. We also attempted to characterize the units' responses in a manner less constrained by a priori categories by computing STRFs for unit 6a. We computed STRFs using power within linearly spaced frequencies from 50 Hz to 4 kHz, and using MFCCs. MFCCs approximate the human auditory system (Davis and Mermelstein 1980) and allow for the separation of speech pitch from phonological information. We utilized the first 13 MFCCs, thus focusing on pitch-invariant information and discarding speaker-specific information. An STRF computed with linear frequency features shows a complex combination of low- (50–500 Hz) and high- (∼2.5 kHz) frequency components between 0 and 100 ms, contributing to the firing of unit 6a (Fig. 5A). Similarly, the STRF computed with MFCCs demonstrates a wide range of cepstral components, at a similar latency, contributing to the firing of this unit (Fig. 5B).
These representations were also able to predict the unit firing to a set of spoken words (Fig. 5C). The STRFs generated using the MFCCs better predicted the actual firing rates to each of the 10-repeated words than the linear frequency representation (R2 = 0.42 and 0.16, respectively). Despite this, both sets of features consistently underestimate the actual firing rates of this unit. The computed MFCC STRF was also used to predict the firing to the time-reversed words (Fig. 5D). In this case, the predicted firing rates tended to overestimate the actual firing rates, resulting in R2 = 0.14. The fact that this set of acoustic features fails to adequately predict time-reversed words suggests that this unit is responding to features in words that are destroyed by time-reversal.
Responses to Written Words
The patient also performed a visual word size judgment task (SV) that was equivalent to the SA task, but used written words presented on a computer screen instead of spoken words. 26% of 177 units significantly responded to written words in SV. Forty-six units were present in both SA and SV, 18 units responded to auditory words only, 9 to both visual and auditory words, and 19 to neither (Fig. 6A); no cell responded to visual words only. When responding to both, the latency to visual words was 170 ± 31 ms longer than to auditory words (Fig. 6C). On average, auditory words elicited an 8.04-fold increase in firing over baseline, while visual words elicited a 3.02-fold increase over baseline.
To explore whether the responses to these visual words were due to phonological properties of the words, we correlated mean firing rates between responses to spoken and written words containing the same phonemes from 0 to 1000 ms postonset. The phoneme tuning properties of 2 of the 9 units showed significant correlations with ρ = 0.54 for unit 24b and ρ = 0.56 for unit 28d (Spearman, P < 0.01; Fig. 6B). Correlation estimates for the other 7 units had low statistical power, because their firing rates were <1 spike/s. Behavioral studies have suggested that phonological recoding is stronger for low-frequency words (Seidenberg 1985). Words were divided into low- and high-frequency words using the median HAL frequency of 1422, and the between-modality correlation of phonetic encoding patterns was recomputed. For low-frequency words, the correlation remained high at ρ = 0.44 (unit 24b) and ρ = 0.55 (unit 28d) (Spearman, P < 0.01). For high-frequency words, the correlation became insignificant for unit 24b (ρ = −0.11, P > 0.05) and dropped for unit 28d (ρ = 0.34, P < 0.05).
The latency of this correlation was examined by computing the correlation coefficient between the firing rate in response to auditory words from 0 to 1000 ms containing given phonemes, and the firing rate in response to visual words containing the same phonemes in 300-ms sliding windows starting from 0 to 700 ms after stimulus onset (Fig. 6D). Significant correlations began175–475 ms and peaked at 330–630 ms for unit 24b, and began at 325–625 ms and peaked at 450–750 ms for unit 28d.
Diversity of Unit Tuning Allows for Decoding of Words
To characterize the amount and diversity of information present in the firing rates of the identified units, we attempted to decode (i.e. individually discriminate) the 10 repeated words from the SA task using unit responses. A Naïve Bayes classifier achieved peak accuracy of 39.25% using 28 units (chance = 10%), with near-maximum performance using only 10 units (Fig. 7A). The temporal evolution of word-selective firing was tested by decoding word identity from cumulative 25-ms windows starting at 0 ms (Fig. 7C). Within 200 ms, 34% accuracy (chance = 10%) was reached when using the top 5 units; adding up to 30 units improved accuracies at longer latencies.
The same analysis decoded vowels with a peak accuracy of 24.6% (chance = 13.6%; Fig. 7B). To test whether location in formant space provided equivalent information, each phoneme was reclassified to the vowel that had the closest mean F1–F2 value in this formant space. All classes were balanced such that they had the same number of instances before and after reassignment. The accuracy of a decoder trained on this formant data yielded poorer accuracy at 22%. This suggests that these neurons encode phoneme identity better than formant values.
To test whether phonetic context affected the firing rates of the identified neurons, we attempted to decode the first vowel in the repeated words in the SA task (where the context was consistent), when compared with the first vowel in the new words (where the context varied across the words). A matched set of new words were chosen, so that an equivalent number of instances of each vowel were present. If these neurons only encoded information about the first vowel identity, equal decoding accuracy would be expected in both cases. However, the classifier achieved 38% accuracy with the repeated words (18% above chance) when compared with 28% accuracy with new words (8% above chance). This superior classification accuracy with the consistent context of the repeated words implies that these neurons are encoding more than single phoneme identity (Fig. 7D).
Speaker Invariance and Spontaneous Speech
The speaker for the SA stimuli was male, while the speaker for the WN stimuli was female, with different fundamental frequencies (113 ± 15 vs. 165 ± 35 Hz) and vowel formants (Supplementary Fig. 5). To test for speaker invariance, we analyzed the 37 words that were present in both SA and WN for units 6a and 6b. The spiking rates between 100 and 900 ms were significantly correlated for the 37 words (Pearson, ρ = 0.38, P < 0.05). Conversely, a paired t-test failed to demonstrate any statistical difference in their firing (P = 0.96, mean difference = 0.032 spikes/s).
To further characterize speaker invariance, a 40-min segment of spontaneous speech between the patient, his family, and the researchers was transcribed, and all word-boundaries were manually marked. The profile of firing across the 50 most commonly produced words was significantly correlated between speakers (ρ = 0.41, P < 0.01), while a paired t-test failed to indicate any significant difference in the firing rates between speakers (P = 0.35).
Self-vocalization Auditory Suppression
Previous studies have demonstrated suppression of auditory responses to self-vocalization (Creutzfeldt et al. 1989b; Houde et al. 2002; Heinks-Maldonado et al. 2005, 2006; Flinker et al. 2010; Baess et al. 2011). During the repetition experiment, a total of 162 units were identified, of which 42 responded to external speech. Of these 42 units, 30 showed no significant response to self-produced speech, 5 showed a reduced, but still significant, response (Wilxocon rank-sum, P < 0.05), and 7 showed no difference (Fig. 8). On average, the peak firing rate to self-produced speech was 2.43 spikes/s lower than to external speech (range = 0.21–13.1, corresponding to 11–100% reduction; Wilcoxon rank-sum, P > 0.05). The 42 units responding to external speech included 5 putative inhibitory interneurons. All 5 units decreased firing to self-produced speech relative to external speech (2.36 vs. 1.32 spikes/s). Additionally, averaged LFP of the corresponding electrodes demonstrated minimal responses to self-produced speech while showing large responses to external speech at all latencies (Fig. 8A).
We simultaneously recorded from over 140 single units in the human left aSTG. Many cells demonstrated highly selective responses to spoken words, with little or no response to pure tones, environmental sounds, or vocoder-transformed speech. Typical cells fired specifically to particular words or subsets of phonemes, demonstrated the spatial organization of tuning properties, and were suppressed during self-produced speech. STRFs predicted responses to spoken words, and spontaneous conversation demonstrated invariance to speaker. Some units showed correlated responses to phonemic properties of written and spoken words.
Classically, spoken word processing begins in Heschl's gyrus, and moves posteriorly toward Wernicke's area as processing becomes more complex (Wernicke 1874; Boatman et al. 1995; Geschwind and Levitsky 1968; Crone et al. 2001; Arnott et al. 2004; Desai et al. 2008; Chang et al. 2010; Pasley et al. 2012). In contrast, more general dual-pathway models also include an anteroventrally directed “what” stream (Arnott et al. 2004; Hickok and Poeppel 2007; Saur et al. 2008; Rauschecker and Scott 2009), analogous to the ventral visual stream (Tanaka 1996). The tuning of single units in human aSTG to speech is consistent with the dual-stream model, challenges the classical model, and provides important data bridging single-unit animal neurophysiology and noninvasive human work. Speech-selective units in aSTG may thus be analogous to object-selective cells in inferotemporal cortex (Tanaka 1996); like those cells, we observed apparent columnar organization of the aSTG responses.
The “what” stream likely continues to anteroventral temporal neurons that fire differentially to words referring to objects versus animals (Chan, Baker, et al. 2011), as well as to particular words (Heit et al. 1988). It is striking that, although highly and selectively responsive to spoken words, firing by aSTG neurons does not reflect semantic modulations, suggesting a lack of top-down modulatory influences on speech recognition. Such influences are clearly visible behaviorally (Warren 1970; Ganong 1980), and many models posit strong top-down projections from the lexico-semantic stage to at least the phonemic stage (Morton 1969; McClelland and Elman 1986). However, other models find such feedback unnecessary (Norris et al. 2000), or suppose that it does not occur until after the first phoneme is identified (Marslen-Wilson 1987). Consistent with recent magnetoencephalography (MEG) findings (Travis et al. 2012), our results suggest that such influences may be absent, at least in this stage of processing, and those that lead to it.
The cells reported here demonstrate significant responses to specific subsets of phonemes, and in the most robust cell, to a subset of vowel sounds. Unit firing is invariant to speaker, regardless of differences in formant frequencies, F0, F1, and F2, underlying these vowel sounds. This suggests that these cells are tuned in vowel-space. Furthermore, decoding of phonemes resulted in significantly higher accuracies than that of formant-derived classes.
The results from the STRF analysis provide additional evidence that low-level acoustic features fail to fully characterize the response properties of these units. Power in frequency bands is a poor predictor of the firing rate when compared with MFCCs that model the phonetic and discard speaker-specific information in speech. However, even MFCCs fail to robustly predict the firing rate in time-reversed speech. This may suggest that MFCCs do not completely capture some other high-level acoustic features of words, and that time-reversal destroys or reduces the salience of these features. Phoneme identity is one such feature; time-reversing the acoustic waveform associated with a phoneme, especially consonants, often produces a percept that is nonphonemic in nature. It is possible that these units are tuned to high-level properties of phonemes that we are unable to completely characterize.
Another explanation for the limited ability of the MFCCs to predict the response to reversed speech is an insensitivity to nonlinear contextual effects, which augment their response to the phonemes in their receptive fields. Furthermore, any contextual effects that MFCCs do encode would not be triggered because they would occur in reversed words after the encoded phonemes rather than before. The inability of MFCCs to predict firing to reversed words is thus indirect evidence for contextual effects. Further evidence is the finding that decoding of the first vowel was over twice as strong (compared with baseline) when the context was consistent compared with when it changed over trials. Contextual effects could include not only adjacent phonemes, but also latency within the word. Cells varied widely in their response latencies to words, and many responded differentially to their preferred phonemes depending on when they occurred within the word. Another possible contextual effect was stress, which was confounded with latency in our data. Thus, our data suggest a population of units that respond to different sets of phonemes, modulated by their context and timing. These receptive fields could serve as building blocks for representing any word.
The current results also bear on the long-standing controversy regarding whether written words must activate their phonological representations in order to be understood. One theory posits that written words have direct access to meaning as well as indirect access via phonological recoding (Seidenberg 1985; Coltheart et al. 1993; Yvert et al. 2012). This theory suggests that skilled readers reading high-frequency words access lexical representations via a purely visual pathway before the phonological information has an opportunity to contribute. Other models suggest that written words necessarily undergo phonological processing before lexical identification, regardless of word frequency or task demands (Frost 1998). Studies have compared scalp event related potentials (ERPs) to written words, pseudohomophones (nonwords sounding like actual words), and control nonwords (Newman and Connolly 2004; Braun et al. 2009). The differential response to pseudohomophones is taken to represent a conflict between orthographic and phonological information, and their presence before 200 ms as evidence for obligatory early phonological recoding of written words (Braun et al. 2009). However, the observed effects are quite small, several studies have failed to find them (Ziegler et al. 1999; Newman and Connolly 2004) or to estimate their latency as approximately 300 ms (Niznikiewicz and Squires 1996), and their localization is unclear. The fusiform visual word form area produces its major response at 150–200 ms and is highly sensitive to the combinatorial frequency of letters (Binder et al. 2006; Vinckier et al. 2007), raising the possibility that ERP differences to pseudohomophones reflect visual characteristics.
While MEG has localized activity in the posterior superior temporal cortex to visual words beginning at around 200 ms (Dale et al. 2000; Marinkovic 2004), there is no evidence that this represents phonological recoding. Conversely, although functional magnetic resonance imaging identifies this general region as activated by written words in tasks that require phonological recoding (Fiebach et al. 2002), it is not possible to know if such activation is before or after lexical access. Another recent study demonstrates auditory area gamma-band responses in response to visual words at a latency of 700 ms in iEEG (Perrone-Bertolotti et al. 2012). In contrast to these previous studies, our results directly demonstrate phonological recoding: Unit firing that is correlated between spoken auditory phonemes and phonemes present in the idealized pronunciation of visual words. Lexico-semantic access for visual words is thought to occur by approximately 240 ms (Halgren et al. 2002; Kutas and Federmeier 2011). In our data, the firing of one cell reflecting phonological recoding of written words began slightly earlier, approximately 175 ms after word onset. Furthermore, high-frequency words demonstrated reduced correlation between phonemes in visual and auditory words, presumably reflecting a smaller need for phonological recoding. Thus, these data are consistent with the dual-route hypothesis of phonological recoding, in that we demonstrate that neural activity with the expected characteristics occurs in the aSTG at a latency that may allow it to contribute to word identification. However, it is possible that phonological recoding is not evoked by all words, since the auditory–visual correlation was greatly decreased for high-frequency words.
The use of the microelectrode array in this study allowed for the examination of spatial organization that previous studies were unable to explore. Interestingly, we found that nearby cells often had correlated response properties, but this correlation disappears at distances of over 1 mm. This may suggest that even in high-order processing areas that do not have a clear spatial or spectral “space” (such as orientation or frequency), nearby cortical columns perform similar processing tasks. It is also important to note that more general response characteristics (e.g. whether the unit responded to auditory word stimuli at all) showed broader spatial correlation than more specific response characteristics (e.g. 10-word response profile), which tended to exhibit narrower spatial correlations. This is similar to the columnar organization of inferotemporal visual cells that demonstrate the spatial organization of visual object selectivity. The consistency of firing profiles across the set of words for different units recorded at a given contact could be reflected in the population activity (high-gamma power) recorded by the same contact. However, the extent of such population activity is limited, inasmuch as macroelectrode ECoG recordings directly over the microelectrode site failed to show language-specific responses (Supplementary Fig. 2).
The proportion of the single units identified as putative inhibitory interneurons (17%) is consistent with previous anatomical and physiological observations in animals and humans (Bartho et al. 2004; Peyrache et al. 2012). However, the total number of spikes produced by putative inhibitory cells was greater due to their higher firing (1.96 spikes/s) compared with pyramidal cells (0.17), again similar to what has been observed in sleeping humans (Peyrache et al. 2012). The mean overall firing rate was similar to prior reports of semichronic human recordings using fixed electrodes (Ravagnati et al. 1979), but are about 10-fold lower than acute recordings with movable microelectrodes (Ojemann and Schoenfield-McNeill 1999), suggesting that these studies may be recording from a different population of neurons. This has critical importance for calculations of energy utilization in the human brain (Attwell and Laughlin 2001; Lennie 2003). The average firing rate we found is about 50-fold lower than what has been assumed (Attwell and Laughlin 2001), but is consistent with previous theoretical predictions based on the high energy cost of action potentials and consequent synaptic events (Lennie 2003). Low background firing is sometimes cited as an indication that a sparse encoding strategy is used by cortical cells for associative memory and complex stimuli (Olshausen and Field 2004). Sparse encoding was strongly suggested by our observation that the maximal or near-maximal prediction of phoneme or word identity could be achieved using the activity from only about 5 cells out of the approximately 150 that were isolated.
Studies using positron emission tomography, EEG, MEG and intraoperative microelectrodes have shown that auditory cortex is suppressed during self-produced speech, when compared with external speech perception, and it has been suggested that this is a result of speech-feedback monitoring (Creutzfeldt et al. 1989b; Paus et al. 1996; Numminen et al. 1999; Curio et al. 2000; Houde et al. 2002; Heinks-Maldonado et al. 2005, 2006; Christoffels et al. 2007; Tourville et al. 2008; Baess et al. 2011). These studies have suggested that this phenomenon occurs globally across auditory cortex; however, units in the primary auditory cortex of primates (Eliades and Wang 2005) have demonstrated a diversity of responses to self-produced vocalizations. A recent study has shown that neurons in the superior temporal region (including insula) demonstrate nonspecific vowel tuning during speech production (Tankus et al. 2012), and ECoG studies in humans have shown that different regions of auditory cortex demonstrate varying degrees of suppression (Flinker et al. 2010; Greenlee et al. 2011). In this study, we show that this variability is present to an even finer spatial scale; single units within a 4 × 4 mm area demonstrate variable amounts of suppression by self-produced speech. Our additional finding that putative inhibitory interneurons also exhibit reduced firing, and that LFPs to self-produced speech are suppressed from their onset, suggests that the suppression begins at an earlier processing stage and that decreased local firing is due to decreased input.
It is important to note that these recordings come from the unique case of a single patient with epilepsy. However, the cortical location containing the microelectrode was included in the final resection, and subsequent staining and histology failed to find abnormal pathology at the array site. Furthermore, the patient's seizures were found to start in medial temporal sites, making it less likely that aSTG was actively involved in seizure initiation. However, we cannot rule out the possibility that medications or long-standing epilepsy affected the responses we recorded.
Taken together, these data suggest that the aSTG contains a spatially organized processing unit specialized for extracting lexical identity from acoustic stimuli, lying midway between acoustic input in medial Heschl's gyrus, and supramodal semantic representations in the anteroventral temporal cortex. This module encodes high-order acoustic–phonetic information during the perception of both spoken and written words, suggesting that aSTG is involved in phonological recoding during reading. Single units robustly represent perceptual phonemic information, and it is possible that a small population of cells, each encoding a different set of phonemes in different phonological contexts, could represent the acoustic form of a specific word.
Conflict of Interest: None declared.