The timbre of a sound plays an important role in our ability to discriminate between behaviorally relevant auditory categories, such as different vowels in speech. Here, we investigated, in the primary auditory cortex (A1) of anesthetized guinea pigs, the neural representation of vowels with impoverished timbre cues. Five different vowels were presented with durations ranging from 2 to 128 ms. A psychophysical experiment involving human listeners showed that identification performance was near ceiling for the longer durations and degraded close to chance level for the shortest durations. This was likely due to spectral splatter, which reduced the contrast between the spectral profiles of the vowels at short durations. Effects of vowel duration on cortical responses were well predicted by the linear frequency responses of A1 neurons. Using mutual information, we found that auditory cortical neurons in the guinea pig could be used to reliably identify several vowels for all durations. Information carried by each cortical site was low on average, but the population code was accurate even for durations where human behavioral performance was poor. These results suggest that a place population code is available at the level of A1 to encode spectral profile cues for even very short sounds.
Timbre is formally defined as what allows us to distinguish 2 sounds that have otherwise the same pitch, loudness, and duration (American Standards Association 1960; Plomp 1970). More importantly, timbre plays a critical role in the ability of human listeners to recognize acoustic events (Helmholtz and Ellis 1895; Town and Bizley 2013). As a simplified case for timbre in speech, vowels have been used in behavioral and neurophysiological investigations. Vowels are produced by modulating the shape of the vocal tract and differ according to the position of a small number of spectral resonances called formants (Fant 1960). Behaviorally, the first 2 formants are the most important ones to identify a vowel (e.g., Sakayori et al. 2002; Assmann and Nearey 2008), even though the timbre dimensions of vowels may be richer (Bloothooft and Plomp 1988). Recent behavioral evidence shows that mammals (rats, cats, ferrets) can distinguish between vowels (Hienz et al. 1998; Bizley et al. 2013; Perez et al. 2013). Thus, the timbre cues relevant to vowels are encoded not only in humans, but also in mammals, perhaps as a consequence of the similar functional organization of auditory pathways. Neural investigations have indeed documented detailed representations of vowels at many stages of the auditory system, from the auditory nerve (Sachs and Young 1979; Delgutte and Kiang 1984) to the auditory cortex (ACx) (Mesgarani et al. 2008; Walker et al. 2011) in animals and humans (Mesgarani et al. 2014).
In the present study, we used vowels with degraded timbre cues, using a technique termed “gating.” Gating involves extracting very short segments from longer duration sounds. Gray (1942) found that segments as short as 3–4 ms already lead to performances above chance, a result subsequently confirmed by (Robinson and Patterson 1995a). Furthermore, timbre identification of musical instruments can also be achieved above chance on a similar time scale (Robinson and Patterson 1995b; Suied et al. 2014). So, even if longer segment durations (10–30 ms) are required for near-ceiling vowel identification (Powell and Tosi 1970; Suen and Beddoes 1972), very short snippets of sound already contain informative timbre cues sufficient to partially identify vowels.
This poses a challenge for the neural representation of vowels and timbre in general. First, when sounds are shortened, line components in the spectrum of long-duration sounds, corresponding to individual harmonics, are replaced by the convolution between the component and the gating window. This so-called “spectral splatter” can reduce the acoustic contrast between sounds. Second, even though the ringing of cochlear filter will lengthen the effective duration of short stimuli (Uppenkamp et al. 2001), it can be expected that very short vowels will elicit only a few spikes in auditory neurons. This puts strong constraints on the neural code underlying perceptual decisions, ruling out for instance an average rate code in individual neurons (Thorpe et al. 1996).
In the present study, we recorded responses of neurons in the primary ACx of guinea pigs to 5 vowels (with a duration of 2–128 ms) and compared the resulting representation with human identification performance. We found that human performance was close to chance for a duration of 2 ms, and then gradually increased up to a plateau at 16 ms and longer, independent of pitch. In contrast, populations of ACx neurons remained able to discriminate vowels far above chance at all durations, probably because of cues located in the high-frequency range of vowels. Moreover, the neural responses could be predicted with good accuracy from simple linear modeling of spectro-temporal receptive fields (STRFs).
Materials and Methods
Seven participants (4 women), aged between 24 and 36 years, took part in the experiment. All had self-reported normal hearing and provided informed consent to participate in the study. The study was conducted in accordance with the guidelines of the declaration of Helsinki.
The stimuli of this experiment were taken from the RWC Database (Goto et al. 2003). They consisted of recordings of 4 talkers, each singing 5 vowels (/a/, /e/, /i/, /o/, /u/). Vowels were sung at 12 different pitches, from A3 (220 Hz) to G#4 (415 Hz). Only one pitch was selected at random on each trial. Stimuli were then gated by using a raised-cosine (Hanning) window. The duration of the gate could be 2, 4, 8, 16, 32, 64, or 128 ms. The starting point of the gating was chosen randomly at a time location between 0 and 100 ms of the original sample onset, so that the segment of the sound that was presented was essentially unique to each trial. The stimuli were normalized by their root-mean square values and further scaled by the square root of their durations, to achieve approximately equal loudness for the whole set (see details in Suied et al. (2014) and genuine method in Robinson and Patterson (1995a, 1995b)).
Stimuli were presented through an RME Fireface digital-to-analog converter at a 16-bit resolution at a 44.1-kHz sample rate. They were presented diotically through Sennheiser HD 250 Linear II headphones. The presentation level was set at 70 dB (A), as calibrated for the 128-ms sounds with a Bruel & Kjaer (2250) sound level meter and ear simulator (B&K 4153). Listeners were tested individually in a double-walled Industrial Acoustics (IAC) sound booth. They provided their response using the computer keyboard.
Participants had to identify the target that they just heard out of 5 possible choices, (5AFC: /a/, /e/, /i/, /o/, or /u/). The experiment started with a training session that included only the longest versions of the sounds (128 ms). The training was terminated when the participant reached a minimum of 90% recognition for the long sounds. All participants could reach this performance in <30 min. Then the main experiment started, where all durations were presented randomly interleaved. Visual feedback (“Right” or “Wrong”) was provided after each response. Fifty repetitions for each target were collected per participant.
The number of times across trials where each response was assigned to each vowel defines a confusion matrix of size 5 × 5 (possible responses × vowels). From this confusion matrix, performance was evaluated by the Mutual Information (MI) given by Shannon's formula:
In fact, MI = 0, or equivalently a 20% chance level, is only an asymptotic chance level because of the limited number of repetitions in the protocol. A random distribution of responses is unlikely to be perfectly uniform across vowels. Thus, the true significance level of average MI across participants was assessed as follows. Simulated confusion matrices with completely random assignment of the responses were generated, with the same number of repetitions per target (50) than in the experiment. We then computed the MI. The process was repeated 10 000 times to obtain the true distribution of MI for chance (which has a gamma-like shape). We derived the mean m = 0.048 bits and the standard deviation (SD) σ = 0.006 bits of this distribution. Thus, using the central limit theorem, we can infer that the average of n values of MI approximately follows a Gaussian law so that chance level, that is, the upper limit of the confidence interval for such average MI is m + Φ−1(1−α)*σ/√n, where Φ−1 is the inverse cumulative normal distribution function and α is the chosen risk. We chose the conservative level α = 2.5%/7 since comparisons will be made for all the 7 vowel durations (Bonferroni correction). With n = 7 participants, we obtained 0.054 bits as a threshold for average MI across participants. This threshold defines the chance level in the subsequent report of behavioral results.
Recordings were obtained in the left primary ACx of 12 adult pigmented guinea pigs (10 males, 2 females). Animals, weighting from 450 to 1100 g (3–10 months old), came from our own colony housed in a humidity (50–55%) and temperature (22–24 °C)-controlled facility on a 12 h/12 h light/dark cycle (light on at 7:30 AM) with free access to food and water. All surgical procedures were performed in compliance with the guidelines determined by the national (JO 887–848) and European (86/609/EEC) legislations on animal experimentation, which are similar to those described in the Guidelines for the Use of Animals in Neuroscience Research of the Society of Neuroscience. The experiments were performed using the procedures 32 and 34 validated by the Ethic committee N°59 (Paris Centre et Sud). Two to four days before each experiment, the animal's audiogram was determined by testing auditory brainstem responses (ABR) under isoflurane anesthesia (2.5%) as described in Gourévitch et al. (2009). The ABR was obtained by differential recordings between 2 subdermal electrodes (SC25-NeuroService) placed at the vertex and behind the mastoid bone. Averages of 500 responses were collected at 9 frequencies (between 0.5 and 32 kHz) presented between 70 and 0 dB SPL to obtain the audiogram. All the animals used in the present study showed audiogram in the range previously reported for healthy guinea pigs (Robertson and Irvine 1989; Gourévitch et al. 2009; Gourévitch and Edeline 2011).
The animal was anesthetized by an initial injection of urethane (1.2 g/kg, i.p.) supplemented by additional doses of urethane (0.5 g/kg, i.p.) when reflex movements were observed after pinching the hind paw (usually 4 times during the experiment). A single dose of atropine sulfate (0.06 mg/kg, s.c.) was given to reduce bronchial secretions. After placing the animal in a stereotaxic frame, skin and overlying muscle were removed to expose the skull and a local anesthetic (Xylocain 2%) was liberally injected in the wound. A large craniotomy was performed above the left temporal cortex. The opening was 8 mm wide starting at the intersection point between parietal and temporal bones and 8–10 mm height. The dura above the ACx was removed under binocular control and the cerebrospinal fluid was drained through the cysterna to prevent the occurrence of oedema. After the surgery, a pedestal in dental acrylic cement was built to allow an atraumatic fixation of the animal's head during the experiment. The stereotaxic frame supporting the animal was placed in a sound-attenuating chamber (IAC, model AC1). At the end of the experiment, a lethal dose of pentobarbital (>200 mg/kg, i.p.) was administered to the animal.
Data are from multiunit recordings collected in the primary ACx (area AI). Extracellular recordings were obtained from arrays of 16 tungsten electrodes (ø: 33 µm, <1 MΩ) composed of 2 rows of 8 electrodes separated by 1000 µm (350 µm between electrodes of the same row). A silver wire, used as ground, was inserted between the temporal bone and the dura matter on the contralateral side. The location of the primary ACx was estimated based on the pattern of vasculature observed in previous studies (Edeline and Weinberger 1993; Manunta and Edeline 1999; Wallace et al. 2000; Edeline et al. 2001). The raw signal was amplified 10 000 times (TDT Medusa) then was processed by an RX5 multichannel data acquisition system (TDT). The signal collected from each electrode was filtered (610–10 000 Hz) to extract multiunit activity (MUA). The trigger level was set for each electrode to select the largest action potentials from the signal. On-line and off-line examination of the waveforms indicates that the MUA collected here was made of action potentials generated by 3–8 neurons at the vicinity of the electrode. At the beginning of each experiment, we set the position of the electrode array in such a way that the 2 rows of 8 electrodes sample neurons responding from low to high frequency when progressing in the rostro-caudal direction (see example in Fig. 1 of (Gaucher et al. 2012) and in Fig. 6A of this paper). Typically, 3–4 electrode positions (each called a “recording session” in the following) were tested for a given animal. The final recording depth was 500–1000 µm, which corresponds to layer III and the upper part of layer IV according to Wallace and Palmer (2008).
Acoustic stimuli were generated in Matlab, transferred to a RP2.1-based sound delivery system (TDT) and sent to a Fostex speaker (FE87E). The speaker was placed at 2 cm from the guinea pig's right ear, a distance at which the speaker produced a flat spectrum (±3 dB) between 140 Hz and 36 kHz. Calibration of the speaker was made using noise and pure tones recorded by a Bruel & Kjaer microphone 4133 coupled to a preamplifier B&K 2169 and a digital recorder Marantz PMD671.
The time–frequency response profile (TFRP) of neurons was determined using 64 tone frequencies, covering 8 octaves (0.14–36 kHz), and presented at 75 dB SPL. Each frequency was repeated 8 times at a rate of 2.35 Hz in pseudorandom order. Tones of 6 possible durations 1, 2, 4, 8, 16, and 32 ms were used, the duration being measured at the half-amplitude of the sinusoidal envelope of the tone. All tones were normalized by their RMS and the 32 ms tones were presented at 75-dB SPL crest.
The same 5 vowels than in the psychoacoustics task were used, with identical normalization but with some differences due to practical constraints. Seven gate durations were used, ranging from 2 to 128 ms, and 3 onset times for the gate were chosen (at random) for each duration. Vowel identity, duration, and onset time were selected randomly at each presentation. Since the guinea pig audiogram typically ranges between 100–200 Hz and 40 kHz at 55–75 dB SPL, we only used the highest G#4 pitch (415 Hz) of vowels present in the behavioral study. Presentations were separated by a 500-ms silent gap. A single duration of a given vowel was presented 27 times. The presentation level was set at 75 dB (SPL), as calibrated for the 128-ms sounds with a Bruel & Kjaer (2235) sound level meter.
Inserting an array of 16 electrodes in the cortical tissue almost systematically induces a deformation of the cortex. At least a 15-min recovery time was allowed for the cortex to return to its initial shape, then the array was slowly lowered. Responses to pure tones (32-ms duration) were used to assess the quality of our recordings and to adjust the electrodes depth. When a clear tuning was obtained for at least 8 of the 16 electrodes, the recording session started. Acoustic stimuli were presented in the following order: pure tones described previously to determine the TFRP of neurons at durations 32–2 ms (in that order), followed by 3 min of spontaneous activity, followed by vowels and another TFRP with 32-ms tones. The stability of our recordings was assessed by comparing the neural response to the 32-ms tone presented at the beginning versus at the end of a session.
Quantification of responses to pure tones
Time–frequency response profiles (TFRPs) were obtained from MUA by constructing poststimulus time histograms (PSTHs) for each frequency with 1-ms time bins. All spikes falling in the averaging time window (starting at stimulus onset and lasting 100 ms) were counted. Thus, TFRPs were matrices of 100 bins in abscissa (time) multiplied by 64 bins in ordinate (frequency). They were smoothed with a uniform 3 × 3 bin window. The TFRP for 32-ms tones (the longest duration used here for pure tones) represents an approximation of the standard STRF for the cortical site, as STRFs are usually derived with long stimuli.
Peaks of significant response in TFRPs were automatically identified using the following procedure: a positive peak in the MU-based TFRP was defined as firing rates above the average level of the baseline activity plus 6 times the SD of the baseline activity. The baseline activity was estimated from the first 10 ms of TFRPs, which is a latency too short for containing evoked activities in the guinea pig primary ACx (see Wallace and Palmer 2008; Huetz et al. 2009). For a given site and a given tone duration, 2 measures were extracted from the peaks. First, the “bandwidth” was defined as the sum of all peaks width in octaves. Second, the “response duration” was the time difference between the first and last spike of the significant peaks. A cortical site was categorized as “sustained” if there was a significant correlation (correlation test, P < 0.05) between the response duration and the tone duration, it was “onset” otherwise. For each site, the best frequency (BF) was defined as the frequency where the highest firing rate was recorded in the TFRP for tones of 32 ms long.
Quantification of responses to vowels
Response duration to vowels was quantified as follows: PSTHs were first constructed as histograms of spikes across trials with a binwidth of 1 ms and then smoothed by a rectangular 3 bins window. Thresholding was then applied to estimate response duration. The threshold for each PSTH was the average bin values in the first 10 ms, plus 6 times the SD of these bin values. The width of bin areas above threshold defined the response duration of a cortical site to a given vowel.
The MI between neural responses and stimuli was then estimated. By analogy with the behavioral task, we used an indirect method (Rolls et al. 1997; Nelken et al. 2005; Schnupp et al. 2006) to build confusion matrix and then to compute the amount of information (Shannon 1948) contained in the cortical responses to vowels. This method quantifies how well the vowel's identity can be inferred from the neuronal responses. As this method is exhaustively described in Schnupp et al. (2006), we will present here only the main principles.
The method relies on a pattern recognition algorithm that is designed to guess which stimulus evoked a particular response pattern, by going through the following steps: for a given cortical site, a single response (test pattern) to a vowel, say /a/, was extracted and represented as a PSTH with a given bin size (different sizes were considered as discussed below). A mean response pattern was computed for each vowel (excluding the test pattern for /a/). The test pattern was assigned to the vowel of the closest mean response pattern (average squared difference between patterns, i.e., Euclidean distance). This operation was repeated for all the single responses available, generating a confusion matrix where each response was assigned to a given vowel. From this confusion matrix, the MI was derived by Shannon's formula as previously presented.
In our case, whatever the vowel duration, we selected the first 200 ms of the responses to evaluate MI, in order to use spike trains with the same duration. Neuronal responses can be represented using different time scales ranging from the whole response (firing rate) to a millisecond precision (precise temporal patterns). Here, the classification algorithm was applied at 9 different bin sizes ranging from 200 to 3 ms and the best MI was computed across all bin sizes.
In addition to the individual MI associated with a cortical site, we also estimated a population MI: instead of using spike trains from a given electrode, we applied all of the above computation steps to a concatenation of all the spike trains simultaneously recorded by several electrodes. Following the same procedure than for the behavioral study, we estimated m = 0.09 bits and the SD σ = 0.01 bits for the distribution of MI for random confusion matrices, which applies to both individual and population MI. As described above, the chance level for the average MI depends on the number n of averaged values so that it was 0.092 bits when all of the n = 410 recorded cortical sites were taken into account, while it was 0.096 bits for the population MI when the n = 37 recording sessions were used. The chance level for a single cortical site, that is, one confusion matrix, was estimated from the original MI distribution to be 0.123 bits. In the following, the average MI will be computed for various values of n and the corresponding chance level will be indicated in the figure legend.
We aimed at using the same confusion matrix method to estimate MI, for both neural and behavioral data. It should be acknowledged that several other possible estimators for MI are available but none has proved an obvious superiority (Strong et al. 1998; Borst and Theunissen 1999; Victor 2002; Nemenman et al. 2004; Barbieri et al. 2004; Nelken et al. 2005; Nelken and Chechik 2007). Moreover, it is important to note that the confusion matrix generated by clustering spike trains is not intended to correspond to the actual neural mechanisms used by the auditory system, nor is it necessarily the best use of the available information. Still, the MI derived in such a way is an objective evaluation of information that is available in the spike trains.
To assess the best possible performance on neural vowel discrimination, for a group of simultaneously recorded cortical neurons in ACx, simulated data were constructed. Excitatory TFRPs were first simulated using Gaussian bell curves in both log2-frequency and time dimensions (bin 0.5 ms × 1/12th of octave). The Gaussian bell curve in frequency was centered at the BF of the cortical site. The SD in frequency was chosen according to the wanted bandwidth of the response. Typically, a SD of 0.45 in log2-frequency was used for obtaining a bandwidth of 2 octaves. A SD of 4 ms was always used for the time dimension. The log spectrogram of vowels was then convolved by the simulated TFRPs for each frequency, and then summed across frequencies to obtain a firing rate probability as a function of time similar to a PSTH. Finally, the probability maximum across channels, vowels and vowel durations was set to a maximum of 0.11 in a bin and a baseline firing rate probability of 0.01 was added as noise. These 2 values were chosen to match the overall maximum evoked firing rate (median 222 sp/s, SD 314 sp/s) and average spontaneous firing rate (median 22.2 sp/s, SD 36 sp/s) obtained in real data for a bin size of 2 ms. Indeed, over 200 ms of response with time bins of 0.5 ms, 222 and 22 sp/s correspond to a probability per bin of 0.11 and 0.01, respectively. We then generated spikes from these probabilities by classical acceptance-rejection sampling method (generate a spike when a uniform random number is lower than the given probability) for 27 simulated trials, the same number of repeat as the experimental data.
A population of 16 cortical sites having different BFs was simulated. Based on the characteristic frequencies presented on Figure 2A–D in Wallace et al. (2000), who sampled the guinea pig primary ACx, we used the following distribution of BFs: (0.14–0.56) kHz: 11%; (0.56–2.25) kHz: 24%; (2.25–9) kHz: 23%; (9–36) kHz: 42%. From these simulated population data, we also obtained confusion matrices and MI estimates by the same method as for the experimental data.
Data Reporting and Statistical Testing
In all of the electrophysiology results, data are presented as mean values ± standard error of the mean (SEM). All statistical tests are paired t-tests unless otherwise specified. We used a significance level of 0.05 with Bonferroni corrections for multiple tests.
Human Performance for Short Vowels Identification
The behavioral responses were analyzed in terms of correct discrimination percentage derived from confusion matrices, expressed in terms of MI (see Materials and Methods). Results are shown on Figure 2. As expected, there was a large effect of duration on the MI. Although performance was close to chance for a duration of 2 ms, a comparison of the results with simulated random confusion matrices (see Materials and Methods) showed that the MI value was still significantly above chance at 2 ms. Interestingly, there is a diagonal structure to the confusion matrix at 2 ms, suggesting that even if errors were made, they were not equally spread across vowel labels. Then, performance increased for longer durations. At 8 ms, subjects were able to identify the correct vowel in more than 60% of cases (MI = 0.72 bits). Performance then plateaued, with about 90% correct identification and a MI about 2.15bits for 16 ms and longer. Figure 2C gives a breakdown of the performance, this time measured as the percent of correct identifications (Hits), according to the pitch that was presented (selected at random on each trial). There was no obvious effect of pitch on performance: a repeated-measures ANOVA, with pitch and duration as within-subjects variables, show no significant main effect of pitch [F1,6 = 4.16; P = 0.09], nor a significant interaction between pitch and duration [F7,42 = 2.03; P = 0.08]. This is consistent with Suied et al. (2014), who also failed to observe an effect of pitch of identification performance using a different set of gated target sounds. Furthermore, there was little evidence for any “learning” effect in our data (Supplementary Fig. 1): there was no significant influence of block number on the overall performance (see ANOVA tests in legend of Supplementary Fig. 1), and only the very first block of 5 presentations per vowel led to a trend towards lower performance.
Neural Receptive Fields for Short Pure Tones
To understand the neural responses to very short sounds, electrophysiological recording were performed. Twelve pigmented guinea pigs provided 410 cortical sites over 37 sessions (a session was a set of recordings obtained for a position of the electrode array in AI, see Materials and Methods). First, pure tones between 1 and 32 ms long were used as stimuli (see Materials and Methods).
Figure 3 displays examples of TFRPs, obtained at different pure-tone durations. The spectral splatter at short duration, which corresponds acoustically to a broadening of the pure-tone spectral representation, had visible consequences on the estimated TFRPs. In the example displayed in Figure 3A, a neuron having a low BF (0.56 kHz) showed a broader frequency response for short tone durations. In contrast, however, the TFRP of a high-frequency neuron (Fig. 3B) appeared hardly changed. These observations generalize to the overall neural population when comparing the bandwidth of TFRPs (see Materials and Methods) for 1 and 32 ms tone durations, as a function of the BF of the cortical site (Fig. 3C): the increase in bandwidth is larger at low than at high frequency. We interpret these observations with simple acoustic considerations: the spectral shape of the gating window has a fixed width on a linear frequency scale, but neural TFRPs are measured on a logarithmic scale. On such a logarithmic scale, the widening of the stimulus imposed by the gating window appears relatively larger for lower center frequencies compared with higher center frequencies.
Overall, the examples and group results of Figure 3 give a first indication that low-frequency neurons should respond to a broader frequency range with short-duration sounds. Such low-frequency neurons should be less informative of the fine spectral content of the stimuli. The electrophysiological data with pure tones thus seem predictable from a simple linear analysis, that is, the convolution of the “true” receptive field with the stimulus even at short duration. This will be more formally tested in the section below, “Linear model of spectral processing for vowels.”
Primary Auditory Cortex Responses to Short Vowels
Figure 4 displays individual examples of responses of 4 cortical sites to vowels at 2, 8, and 128 ms. Similar to what was observed with pure tones, high-frequency neurons showed little modification of their response with respect to vowel duration (Fig. 4A). Also, these high-frequency neurons responded differently to /o/ compared with /e/ (see the rasters in Fig. 4A). This is because the vowel /o/ contains almost no energy in frequencies above 1.5 kHz, in contrast to /e/ which has energy above 1.5 kHz. Based on this coarse cue alone, high-frequency neurons should discriminate at least 2 of 5 vowels even when at very short duration. This was confirmed by the relatively high value of MI at 2 ms (Fig. 4A, curve in the bottom left).
Neurons with very low BFs (i.e., BF <0.5 kHz) exhibited another type of behavior. There was only little energy at F0 for /a/ (see Fig. 1A), so neural discrimination of this vowel against the others was possible even at short durations (Fig. 4B).
Finally, neurons with BFs between 0.5 and 2 kHz showed similar responses for all vowels at short durations (Fig. 4C). As their BFs fell in the frequency band where all vowels contain most of their energy, and as spectral splatter made short-duration vowels acoustically similar, such neurons saw a regular decrease in their ability to encode vowel identity with decreasing duration. For a last class of neurons, which displayed multipeaked TFRPs (e.g., see Sutter and Schreiner (1991)), a large range of firing rates and latencies were obtained in the responses to vowels which led to very high MIs for long vowel durations (Fig. 4D). Supplementary Figures 2 and 3 propose an interpretation for the latency difference between responses to 128-ms-long vowels /o/ and /e/ (Fig. 4C,D). Two main factors can be identified. First, cortical sites had a higher threshold in the lower frequency range, excited by /o/, compared with the thresholds in the frequency range excited by /e/. This induces a delay before reaching the intensity required to evoke a response when /o/ is used as a stimulus. Second, a delay is always present (due to cortical and subcortical processing) when a neuron is activated by a stimulus far from its BF, which is the case for /o/ but not for /e/ and its broad spectral content. Overall, these examples suggest that BF is the determining factor in a neuron's ability to encode vowel identity at different durations.
Figure 5 summarizes the effect of BF on MI for all recording sites. Neurons having BFs in the low-frequency range, defined here as (0.14–2.25) kHz, showed the lowest MI for 2-ms-long vowels, compared with neurons with higher BFs (unpaired t-test between BFs below and above 2.25 kHz: P < 10−4). This is consistent with the individual examples of Figure 4. The MI increased with duration for all BFs above 0.56 kHz (ANOVA, P < 10−4 for all ranges [0.56–2.25] kHz, [2.25–9] kHz, [9–36] kHz; P = 0.25 for range [0.14–0.56] kHz). Although significant, the increase in MI from 2 to 128 ms was reduced by half for BFs in the highest range (9–36) kHz (Fig. 5A). In fact, high-frequency neurons only responded to /e/ and /a/ at most durations and therefore did not carry much more information for longer durations compared with short ones, as shown in the example of Figure 4A. Rather, neurons having BFs in the range (2.25–9) kHz, that is, corresponding to the higher formant frequencies in the 5 vowels, showed the best vowels discrimination abilities, for all durations, when tested against neurons with BFs in other frequency ranges (for all unpaired t-tests, P < 0.01).
On average, neurons generating sustained responses (N = 119) led to higher MI than “onset” neurons (N = 291) but only for vowels duration greater or equal to 16 ms (Fig. 5B, unpaired t-tests, P < 0.001). Finally, while longer vowels led to longer neural responses (Fig. 5C), the time windows for the spike trains analysis that led to highest MI remained unchanged with vowels duration, being mostly in the range (9–25) ms (Fig. 5D).
The previous analyses involve individual cortical sites, so they do not assess the representation available in primary ACx as a whole. Next, we computed the MI derived from simultaneously recorded cortical sites (population MI, see Materials and Methods). As an illustration of population MI, Figure 6A collates the recordings obtained from 8 cortical sites (among 16 sites recorded), aligned across the tonotopic map. When vowels were short (2 ms), cortical sites with high BFs were more informative: sites with low BFs responded to all vowels, while sites with high BFs only responded to /e/, as that was the only vowel containing high-frequency formants (Fig. 1B). In contrast, with a longer duration of 128 ms, all cortical sites were informative, responding selectively to only 1–3 vowels (Fig. 6B, bottom row). Therefore, calculating the confusion matrix and the MI for the population of the 16 recordings of Figure 6A showed reasonably good MI even at the shortest duration, and an increase in MI with vowel duration (Fig. 6C,D). Thanks to our sampling of BFs across virtually the entire tonotopic map at each position of the electrode array, the population MI was significantly greater than individual MI for all vowels durations (Fig. 7B, for all unpaired t-tests, P < 10−5; see also average confusion matrix over experiments in Fig. 7A).
Finally, we compared the MI derived from the behavioral task and from cortical recordings. As for the behavioral data, individual as well as population analyses of the neural data led to average MI significantly above-chance level for all vowel durations, even the shortest ones (see Materials and Methods for statistical significance assessment of average MI). A direct comparison showed that for vowel durations between 8 and 128 ms, the average MI of a population of ACx neurons was far below the MI displayed by human subjects (see Fig. 7B). Indeed, the almost-perfect performance of humans at 64/128 ms could only be approached by some subsets of population recordings (dashed line). Even the best cortical site was unable to reach human levels of MI (dotted line). At the other end of the duration scale, it was possible to discriminate very short vowels (2 and 4 ms) above the behavioral level of performance based on the pattern of discharges of a subpopulation of ACx neurons (Fig. 7B). This last result follows from the ability of high-frequency neurons to encode the unique high-frequency content of /e/. Thus, there were 2 types of mismatch between the neural and behavioral data: the neural discrimination based on a population spike code outperformed behavior at short durations, but it underperformed behavior at long durations.
Behavioral and Neural Error Patterns
We further attempted to compare the details of neural and behavioral response patterns, and relate those to the acoustics of the stimuli. A “spectral correlation” measure of similarity between a pair of vowels was computed, as the linear correlation coefficient between the spectra (with amplitude and frequency in log scales) of these 2 vowels. This was then correlated to behavioral and neural performance measures.
The first 2 rows of Figure 8 display the average confusion (i.e., the percentage of assignment to a vowel V1 whereas the vowel V2 was presented) derived from either behavioral responses or population cortical activity, as a function of spectral correlation. Behavioral performance was not predicted by the estimated spectral correlation. However, neural performance was significantly and positively correlated with the spectral correlation, for all vowel durations (Fig. 8 second row).
The last row of Figure 8 displays the correlation between neural and behavioral confusions. As could be predicted from the first 2 rows of Figure 8, there was little relationship between neural and behavioral confusion patterns. The vowel pairs that were the hardest to discriminate neurally were generally the ones with the highest spectral correlation, but this was not the case for behavioral discrimination. This suggests that neural discrimination of vowels in primary ACx of guinea pig was determined by a close-to-linear encoding of acoustic features, whereas behavioral discrimination in human was not.
Linear Model of Spectral Processing for Vowels
Finally, to formally assess the observation that neural encoding of short vowels was largely predictable by the linear STRF model, we produced artificial TFRPs and assessed simulated neural discrimination performance for all durations. Briefly, we first simulated excitatory TFRPs with Gaussian shape both in the frequency and time dimensions. A frequency bandwidth of 2 octaves (average of our data at 75 dB SPL) and a SD of 4 ms for response duration (see Materials and Methods) were used as typical values for the TFRP. Then, we convolved the log spectrogram of vowels by the simulated TFRPs to obtain a firing rate probability across time, mimicking a PSTH. We generated spikes from this probability function for 27 trials. The response from a population of 16 sites was typically simulated using a set of BFs matching the BF distribution reported by Wallace et al. (2000) in the field A1 of guinea pigs. We finally applied to these spike trains the same clustering method than for real data to get confusion matrices and MI measures.
Figure 9A–C displays an example of simulated evoked responses for 8 cortical sites as well as their discrimination performance. We varied the frequency bandwidth of TFRPs, the BFs distribution, and the population size in the model (Fig. 9D–F). This led to the following observations. First, predictably, the absence of spectral overlap between a TFRP and a vowel spectrogram leads to an absence of evoked response in the linear model, whatever the vowel duration. This is sometimes in contrast with real data, which suggests that at least at the cortical level, the onset of acoustic inputs may excite neurons whose frequency tuning curve does not necessarily overlap with the stimulus' spectral content: for instance the high-frequency neuron in Figure 4A shows an evoked response to the 2-ms-long vowel /o/ while its TFRP and the vowel spectrum (Fig. 1C) do not overlap at all. Second, using 16 sites with a bandwidth of 2 octaves, we observed a vowel discrimination performance significantly above chance at 2 ms (Fig. 9D, plain black line), consistent with real data (Fig. 7B). Third, performance increased with vowel duration, in line with results shown in Figure 6A. Fourth, using the guinea pig hearing range (0.14–36) kHz or the humans' range (0.07–18) kHz had almost no effect on performance (Fig. 9D and see Supplementary Fig. 4A for the same result using real data), but considering only low-frequency neurons largely reduced MI (Fig. 9D, dotted line). Only considering high-frequency neurons had less of a dramatic influence on MI except for 64 and 128 ms (Fig. 9D, dashed line). It is possible to make sense of this latter result with simple acoustic considerations. Indeed, high-frequency neurons carried information about /a/, /e/ and /i/ so performance was not degraded for short durations. Performance was limited at 64 and 128 ms, to the ability to discriminate between 3 vowels. This was because high-frequency neurons never respond to neither /o/ nor /u/.
Fifth, there was a clear effect of population size on MI (Fig. 9E). This effect was not obvious in the real data. We suggest that this is because of a high level of redundancy between cortical responses (Chechik et al. 2006; Gaucher et al. 2013), whereas the simulated neurons were fully independent from each other. More precisely, it is likely that high levels of coincident activity found between neighboring neurons or between neurons with overlapping frequency tuning (Eggermont and Smith 1996; Brosch and Schreiner 1999) reduce the variety of firing rate and latency patterns, which in turn reduces discrimination abilities of cortical sites. Thus, redundancy was presumably a limiting factor in the real data population code. Another way to test this hypothesis is to increase the bandwidth of simulated neurons: this induces more overlapping between the frequency receptive fields and therefore increases the redundancy between channels. Indeed, this manipulation degrades population discrimination performance, especially for long-duration vowels (Fig. 9F). Interestingly, this occurs even though an increased bandwidth allows an individual site to modulate its response to more vowels (as in the example of Fig. 4D) and to be in fact more informative about vowels (see Supplementary Fig. 4B for real data).
We collected human behavioral data and neural recordings in the guinea pig primary ACx in a discrimination paradigm, using 5 vowels at different sound durations. Discrimination accuracy decreased with sound duration in both cases. However, neural discrimination outperformed behavior for very short durations, whereas behavior outperformed neural discrimination for longer durations. Specifically, human listeners were only just above chance for durations of 2 ms but almost perfect for durations >16 ms (Fig. 2B). Neural assemblies distinguished accurately between some of the vowels even at the shortest durations, but, even at 128 ms, neural performance never matched human results. Simulations showed that a simple linear processing of the frequency content of vowels by a population of neurons may explain the main results from primary ACx recordings. Neither the linear model nor the neural recordings seem to be able to explain in details human behavioral performance.
Human Discrimination of Brief Sounds
Using a similar gating paradigm with a different sound set, Suied et al. (2014) measured the minimum duration needed for human listeners to recognize a target sound category (voices, string instruments, or percussion instruments), among a set of distractor sounds. Results showed minimum durations ranging from 4 to 16 ms, depending on the task and sound set, with generally better performance for the voice targets. Here, we observed a minimum duration of 2 ms for above-chance performance in 5AFC task involving vowel discrimination. This very short minimal duration could perhaps indicate a smaller acoustic variability within the sound set here compared with Suied et al. (2014). Alternately, this could reflect the fact that vowel identification is an overtrained task for human listeners.
Other studies focusing on vowel identification also found minimum durations in the range reported here. Gray (1942) found that a single glottal pulse, that is, a 2.5-ms segment in his experiments, was sufficient to identify vowels at better than chance levels. This was confirmed by Robinson and Patterson (1995a). Obviously, above-chance performance is not equivalent to good identification, as this is only achieved with durations >10 ms (Powell and Tosi 1970; Suen and Beddoes 1972). Nevertheless, our behavioral results confirm that human listeners are able to robustly identify vowels even in degraded acoustic signals.
Subcortical and Cortical Response to Vowels
Classical studies investigating the neural processing of vowels focused on auditory nerve recordings (Sachs and Young 1979; Sinex and Geisler 1983; Delgutte and Kiang 1984; Palmer et al. 1986; Palmer 1990; Miller et al. 1997). They showed that the average spectrum of the PSTHs of all AN fibers was dominated by formants frequencies (at least those below 4 kHz). In particular, responses of fibers with BFs near a formant frequency were phase locked to the largest harmonic near the formant. Moreover, units with a BF between 2 formant peaks showed modulation at the F0, reflecting the beating between several harmonics falling within the bandwidth of peripheral auditory filters.
Recently, the ability of neurons from different cortical auditory areas to discriminate between 4 vowels was investigated (Bizley et al. 2009; Walker et al. 2011). It was found that primary auditory cortical areas (AAF and A1) were the most informative ones. In these studies, the onset responses provided the highest MI. Informative neurons were found across a range of BFs in AAF and A1. Here, we found that sustained responses and mid-frequency BFs were more informative. However, methodological differences might have an impact on the observed results. For example, in Walker et al. (2011), MI was computed from the joint distribution spike rate × stimulus (rather than from a pattern recognition algorithm like here). In awake cats, the cortical response to human vowels and conspecific vocalizations was investigated with the same clustering method than us (Ma et al. 2013): the discrimination performance of 15 stimuli largely improved from about 1.9 bits of MI when considering individual neurons to above 3.17 bits when using more than 10 simultaneously recorded neurons.
Still, why did the performance of ACx neurons never approach human levels performance? As suggested by our simulations (Fig. 9), the size of the recorded neuronal population (16 cortical sites) was probably not large enough to reach maximum performance: it was only when the population contained >64 independent cortical sites that the neuronal performance approached the human one. However, neural responses are typically heavily correlated between each other both for variations of input signals (typically when receptive fields are overlapping) and for noise (fluctuations in the response of one neuron around its average). Such a correlation can strongly reduce the amount of information in a population code (Averbeck et al. 2006). Indeed, in a previous study using the same electrode implantation but a set of heterospecific and conspecific vocalizations instead of vowels, we noticed that population MI started to increase very slowly when more than 5 cortical sites were involved in the computation (Fig. 11 in (Gaucher et al. 2013)). Thus, as ACx neurons display largely overlapping TFRPs, it is likely that much more than 64 neurons are actually required to approach human performance levels. It is also possible that the neural code tested here to assess whether neurons discriminate between vowels is not the one underlying our perceptual abilities. The fact that our animals were anesthetized is also a possible factor. Finally, it is likely that the processing performed by neurons in primary auditory areas is further refined by the belt and parabelt areas, much more developed in humans than in guinea pigs. Consistent with imaging studies involving several auditory associative areas in vowel representation or discrimination (Formisano et al. 2008; Rauschecker and Scott 2009), it is probably in the subsequent stages of cortical processing that the cortical representation of the acoustic cues and the learned representation of the vowels meaning are merged to provide neural representations underlying human discrimination performance.
Neural Codes for Very Short Vowels Discrimination
A major difference between our study and the investigations of neural encoding of vowels cited above is our use of short-duration vowels, to degrade parametrically the timbre cues. We observed that the neural code for short-duration sounds could largely be predicted from response functions obtained at longer duration (simulated TFRPs), which suggested that a “place” code predicted by linear frequency tuning was still valid for short-duration sounds. As a result, spectral distance between vowels was correlated with the cortical discrimination accuracy (Fig. 8), which was also found in Perez et al. (2013).
Interestingly, even for the shortest vowel durations, the neural response was still sufficiently long to support a neural code based on firing rate. Indeed, based on MU activity, the mean duration of the response to the 2 ms vowels was 11.5 ms. It increased gradually to 24 ms for 128-ms vowels (Fig. 5C). These long responses may be due to filter ringing at low frequencies, or to subcortical and cortical processing for higher frequencies. So, even for a 2-ms-long vowel, neural discrimination appears possible based on the spikes count in the 15–25 first milliseconds of the response, as is the case for longer vowels (Perez et al. 2013).
What kind of mechanisms could explain the increase of performance for longer sound durations and hence longer neural responses? At least 2 coding strategies could be suggested. First, the integration of acoustical information could occur on a longer time window, to reduce spectral splatter. Second, the code would improve by combining successive fixed-size windows. Our results favor the second hypothesis: the time windows leading to highest MI remained unchanged with vowels duration, being in majority in the (8–25) ms range over all possible vowel durations between 2 and 128 ms (Fig. 5D). This window size is in line with several studies where the temporal resolution generating the highest MI was found between 4 and 20 ms (Schnupp et al. 2006; Engineer et al. 2008; Walker et al. 2008; Huetz et al. 2009; Gaucher et al. 2013).
Cue Weighting in Vowel Processing
If we assume that the processing performed by the tonotopic maps at the cortical and subcortical levels provides similar information for the shortest vowels in human and guinea pig, then why is human performance poorer for 2 ms vowels compared with neural predictions? In other studies, neurometric and psychophysical curves were found to be in good agreement for consonant sounds discrimination (Engineer et al. 2008) or vowel discrimination in rats (Perez et al. 2013).
Our simulations suggest that the discrepancy here is not a matter of a difference in hearing range between the 2 species (Fig. 9D, see also Supplementary Fig. 4A for real data), nor it is an effect of differences in tuning width (Fig. 9F). A methodological explanation could be that humans had to ignore random pitch variations to perform the vowel identification task, whereas neural recordings were all performed with a single pitch. In the behavioral data, pitch had no influence on discrimination accuracy (Fig. 2C), suggesting that human listeners were as accurate for all pitch values, including the one used for the physiology. However, we cannot rule out that performing the behavioral task with only one pitch may increase human performance. Conversely, because of the frequency selectivity of the neurons, pitch changes would introduce an additional source of variance to the neural responses, possibly limiting timbre identification performance (Bizley et al. 2009). Another possible explanation is that humans gave less weight to high-frequency cues for the vowel identification task. Even though timbre cues distinguishing between vowels were present in the responses of cells with BFs >2.25 kHz, this may not have been interpreted by the human brain as contributing to a vowel's identity. Consistent with this idea, the first 2 formants seem sufficient to identify vowels (Fry et al. 1962; Pols et al. 1969; Nearey 1989). Furthermore, when removing energy above F2, vowels can be correctly identified with a performance of 95% (Nusbaum and Morin 1992; Halberstam and Raphael 2004).
In summary, human performance at short duration may be considered suboptimal compared with the neural code, as it ignores timbre cues in the high-frequency regions, but it could also be considered adapted to the acoustics of vowels which do not in general carry critical discrimination information in the high-frequency range. Such acoustic cue weighting on the 2 first formants has also been found in birds (Ohms et al. 2012), a species where vocal learning is also fundamental, whereas monkeys gave more weight to the first formant (Sinnott et al. 1987). Humans, and other animal species, have to learn during their development what aspects of the signal serve as salient cues and how to use each cue. Thus, the specialization to detect specific formants might be a feature shared by many species that extensively use vocalizations for social interactions.
This work was supported by a grant from the National Research Agency (ANR2011 grant HearFin) to J.M.E. and an attractivity fellowship from Paris-Sud University to B.G. F.O. was supported by a fellowship from the Ministère de l′Education Nationale et de la Recherche (MENR).
We thank J.J. Eggermont and G. Shaw for their generous help with the acquisition software. Special thanks to Nathalie Samson, Pascale Leblanc-Veyrac, Fabien Lhericel, and Céline Dubois for taking care of the guinea pig colony. We also thank 2 anonymous reviewers for their helpful comments. Conflict of Interest: None declared.