Frequency and frequency modulation share the same predictive encoding mechanisms in human auditory cortex

Expectations can substantially influence perception. Predictive coding is a theory of sensory processing that aims to explain the neural mechanisms underlying the effect of expectations in sensory processing. Its main assumption is that sensory neurons encode prediction error with respect to expected sensory input. Neural populations encoding prediction error have been previously reported in the human auditory cortex (AC); however, most studies focused on the encoding of pure tones and induced expectations by stimulus repetition, potentially confounding prediction error with effects of neural habituation. Here, we systematically studied prediction error to pure tones and fast frequency modulated (FM) sweeps across different auditory cortical fields in humans. We conducted two fMRI experiments, each using one type of stimulus. We measured BOLD responses across the bilateral auditory cortical fields Te1.0, Te1.1, Te1.2, and Te3 while participants listened to sequences of sounds. We induced subjective expectations on the incoming sounds independently of stimulus repetition using abstract rules. Our results indicate that pure tones and FM-sweeps are encoded as prediction error with respect to the participants' expectations across auditory cortical fields. The topographical distribution of neural populations encoding prediction error to pure tones and FM-sweeps was highly correlated in left Te1.1 and Te1.2, and in bilateral Te3, suggesting that predictive coding is the general encoding mechanism in AC.


Introduction
Subjective expectations influence our perception of the world [1]. They facilitate perceiving noisy [2,3] or ambiguous [4,5] sensory input, and bias perception when inputs are overly expected [6]. Understanding the mechanisms integrating expectations with sensory input is an essential prerequisite for understanding perception. The predictive coding framework is a theory of sensory processing aiming to explain these mechanisms. Its main tenet is that sensory neurons encode prediction error with respect to an internal generative model of the sensory world [7][8][9][10].
Neurons [11][12][13][14] and neural populations [15,16] of the auditory cortex (AC) encode pure tones as prediction error. Prediction error is typically elicited using oddball paradigms, where predictable repetitions of a standard sound are rarely interrupted by a deviant. Individual neurons in the AC show reduced responses to repeated standards and recovered responses to deviants, a phenomenon that is called stimulus-specific adaptation (SSA) and typically interpreted as prediction error [17].
However, whether SSA truly represents prediction error is unclear: its phenomenology can be explained by habituation to local stimulus statistics [18][19][20] (see [21] for review, and [22] for a different Participants All participants were neurotypical normal-hearing German native speakers (see [23] and [59] for further details on inclusion criteria). Nineteen participants (12 female) between the ages of 24 and 34 (average 26.6) participated in the pure tone study; eighteen participants (12 female) between the age of 19 and 31 (average 24.6) participated in the FM-sweep study.

Stimuli
In the pure tone experiment, there were three pure tone stimuli of 50 ms duration (including 5 ms onset/offset ramps) and frequencies of 1455 Hz, 1500 Hz, and 1600 Hz. The tones were combined into six pairings of standard and deviant tones. Each of the resulting oddball sequences was consequently characterized by one of three possible absolute frequency differences characterizing the distance between standards and deviants (∆ = |f std −f dev | either 145 Hz, 100 Hz, or 45 Hz. See Figure 1A for a visualization of sound stimuli and sequences. In the FM-sweep experiment, there were three sinusoidal FM-sweeps with a duration of 50 ms (5 ms onset/offset ramps) and starting frequencies of 1000 Hz, 1070 Hz, or 1280 Hz. The FM-sweeps ended at either 1080 Hz, 1170 Hz, or 1200 Hz, respectively. The tones were again combined into six pairings of standard and deviant tones. An FM-sweep could deviate from the standard in its FM direction, FM rate, or both. Since all sweeps had the same duration, the defining property of the FM-sweeps was their frequency span ∆f -the difference between starting and ending frequencies. Each of the sequences was consequently characterized by one of three possible absolute frequency span differences characterizing the distance between standards and deviants (∆ = |∆f dev − ∆f std |). See Figure 1B for an exemplary illustration of the FM-sweep stimuli and sequences [59]. . D: two possible anticipated outcomes of the experiment; in h1 or habituation, it was assumed that high-level subjective expectations do not affect deviant responses; in h2 or prediction error, we expected sounds to be scaled by stimulus predictability and thus represent prediction error responses with respect to subjective expectations.

Experimental Design
The design was a variation of the oddball paradigm where abstract rules were used to manipulate participants' high-level expectations on the upcoming stimuli independent of the local statistical regularity of the presented sound sequences. Specifically, participants listened to sequences of eight sounds: seven repeated standards and one deviant that could occur in positions four, five, or six. The stimuli were separated by a 700 ms inter-stimulus interval. The inter-trial interval was jittered (minimum: 1500 ms, maximum: 11 s). All sound combinations were used equally often across runs assuring that all sound types were used as standards and deviants the same number of times.
Participants were told explicitly that all sequences would contain a deviant, and that the deviant would occur in one of the three aforementioned positions. The participants were instructed to respond to the presentation of the deviant via button press as fast and as accurately as possible. Deviants are equally likely to be placed in each of the three positions at the beginning of each trial. Thus, the probability of a deviant in position four after hearing three standards is 1/3. However, if the deviant was not in position 4, since deviants occur once in each sequence, the probability of hearing a deviant in position five after hearing four standards is 1/2. If the deviant was neither in position 4 nor 5, the probability of a deviant in position six after hearing five standards is 1 [23,59].
The pure tone experiment comprised four runs that were completed by all participants. The FMsweep data was collected in three sessions with three runs each; most participants completed 9 runs, one participant only completed eight due to technical reasons. In both experiments, a run contained six blocks of ten trials. Deviant positions were pseudo-randomized so that they all occurred 20 times in each run. The runs lasted for around ten minutes and were separated by a one-minute break. Practice trials were presented at the beginning of the first run to ensure task understanding. Interspersed null events were used to optimize the fit of the GLMs [60]. Further details can be found in [23] (pure tones) and [59] (FM-sweeps).

FMRI data acquisition
FMRI data were collected using EPI sequences and partial FoVs. Magnetic field strength and image resolution differed between data sets. Data from the pure tone experiment was collected using a Siemens Magnetom 7 Tesla scanner (Siemens Healthineers, Erlangen, Germany) with an 8-channel head coil and a voxel size of 1.5 mm isotropic; data from the FM-sweep experiment was collected using a Siemens Trio 3 Tesla scanner (Siemens Healthineers, Erlangen, Germany) with a 32-channel head coil and a voxel size of 1.75 mm isotropic. Interleaved slice acquisition was used in both data sets.
Physiological data (heart rate and respiration in the pure tone experiment, heart rate in the FM-sweep experiment) were collected and processed for use as regressors of no-interest during model estimation for both sound modalities.
All anatomical data was resampled to a resolution of 1 mm isotropic. We computed the boundaries between gray and white matter using Freesurfer's recon-all. These boundaries were later used for coregistration of the functional data to the participants' structural images. In the case of the pure tone experiment, we first computed a brain mask excluding voxels containing air, cerebrospinal fluid, scalp, and skull. This was necessary because MP2RAGE (but not MPRAGE) yields noisy signals outside the brain that interfere with the automatic processes of recon-all. The mask was calculated using Freesurfer's BET and SPM's Segment and was applied using FSLmath. Then, Freesurfer's recon-all was used to obtain gray and white matter boundaries, and ANTs was used to calculate the coregistration matrix between the anatomical data and the MNI152 symmetric template.

FMRI data
We used SPM's FieldMap Toolbox to calculate distortions due to magnetic field inhomogeneity. Then, motion and distortion correction was performed on the functional data separately for each session (SPM Realign and Unwarp). Nipype module's rapidart was used to detect artifacts from the realigned functional data to serve as regressors of no-interest in our design matrix during GLM estimation. The resulting functional data were smoothed (SPM Smooth) using a 2 mm FWHM Gaussian kernel. In the case of the pure tone data, the derivatives (i.e., log-evidences and beta maps) were registered to the anatomical space after fitting (see GLM Estimation and Bayesian Model Comparison). For FM-sweeps, the realigned functional data were registered to the anatomical space using Freesurfer's ApplyVolTransform before model estimation to ensure all data was available in the same space during model fitting.
The transformation matrix between functional and structural data was computed using Freesurfer's BBRegister using the white and gray matter boundaries computed as described above and the wholebrain EPI as an intermediate stage.
Te1.0 and Te1.1 have been proposed to correspond mostly to BA 41. Te1.2 also overlaps with BA 42 [67]. Comparing human and primate auditory fields, it was assumed that Te1.0 corresponds to the auditory core and Te1.1 and Te1.2 represent medial junction and lateral belt [67]. Te3 lies on the lateral surface of the superior temporal gyrus, is an auditory association area, part of BA 22 and might correspond to parabelt areas in primates [67]. However, functional differences between human auditory fields and their correspondence to primates' auditory fields are still unclear; e.g., [67][68][69].

GLM Estimation
First level analyses were performed using SPM's EstimateModel. Statistical analysis at the participantand group-level were conducted in MATLAB (The MathWorks Inc., Version 2020b) using custom code.
We estimated one GLM per participant. The model included six task regressors: std0 (the first standard in a sequence), std1 (standards before the deviant), std2 (standards after the deviant), dev4, dev5, and dev6 (deviants in positions four, five, and six). The first standard was modeled as a separate regressor to test for adaptation by comparing the estimates corresponding to std0 and std1/std2. std1 and std2 were parametrically modulated in a linear fashion according to their position relative to the deviant: values corresponding to std1 were assigned amplitudes from one to the total number of std1 in the sequence and std2 were assigned amplitudes from one to the total number of std2 in the sequence. This was done to account for a slight recovery of standard responses after the occurrence of a deviant. Thus, for example, in a sequence with a deviant in position four, std1 were assigned the amplitudes amp 1 = [1, 2] and std2 were assigned the amplitudes amp 2 = [1, 2, 3, 4]. Choosing increasing (i.e., amp = 1, 2, . . . ) instead of decreasing (amp = 4, 3, . . . ) amplitudes does not change the results since the parametric modulator is used as a regressor of no-interest. All amplitudes were z-standardized before model fitting.
Note that SPM does not prime positive over negative regressors (i.e., the results are symmetric under linear transformations of the amplitudes). In addition, we added physiological data, artifact regressors, and realignment parameters to the design matrix as regressors of no interest.
Model estimations for the pure tone data were done using the smoothed data in the native space of the functional data of the individual participants. For FM-sweeps, we estimated the models in the space of the participants' anatomical scans. After model estimation, the spatial transformations calculated before were applied to the resulting statistical maps. The statistical maps of the pure tone data were first registered to the participants' anatomical scans using Freesurfer's ApplyVolTransform and subsequently to the MNI152 symmetric template using ANTs' ApplyTransforms. For FM-sweeps, statistical maps were registered directly to MNI space.
The resulting beta estimates were z-standardized according to participant, experimental run, and ROI before the second level analyses to reduce variance specific to participants, runs, and ROIs.

Identifying Voxels Showing SSA
To localize voxels showing SSA, we first identified voxels within the anatomical ROIs showing adaptation (reduced responses to repeated standards) and deviant detection (stronger response to deviants compared to standards). Adapting voxels were identified using the contrast std0 > 0.5 std1 + 0.5 std2 and deviant detecting voxels were identified using the contrast dev4 > 0.5 std1 + 0.5 std2. We only included dev4 in the latter contrast since this is the only deviant for which predictive coding and habituation make the same prediction. We tested both contrasts using right-tailed rank-sum tests. Before conducting the rank-sum tests, we averaged single-voxel beta estimates for each experimental condition across all experimental runs.
We defined SSA regions for each stimulus type as the set of voxels showing significant adaptation and deviant detection: We computed voxel-wise p-values for SSA as the maximum of the uncorrected p-values for adaptation and deviant detection in each voxel; p SSA = max(p adaptation , p deviant detection ).
All voxels' p-values were subsequently controlled for the false discovery rate (FDR) using the Benjamini-Hochberg method [70] and thresholded at α = 0.05. Peak-level p-values were corrected for the FWE rate: we corrected for the number of voxels per ROI using Bonferroni-correction and for the total number of comparisons using the Holm-Bonferroni method [71].

Quantifying SSA Magnitude
To quantify SSA magnitude in each voxel of the anatomical ROIs, we computed the standardized voxelwise index of SSA (iSSA) following the procedure described in previous research [17,23]: We normalized the beta estimates for dev4, std1, and std2 to a range from zero to one, averaged these values in each voxel across participants and runs, and computed the index of SSA as iSSA = (dev4 − 0.5 std1 − 0.5 std2)/(dev4 + 0.5 std1 + 0.5 std2). To test if our voxel-wise results were reproducible across sound modalities, we computed Pearson correlations between SSA magnitude for pure tones and FM-sweeps in each anatomical ROI.

Classical Analysis
For both sound modalities, we conducted a classical statistical analysis to test for differences between responses to deviants in positions four, five, and six. To specifically investigate mechanisms driving SSA, we restricted this analysis to SSA clusters with significant (p < 0.05, FWE-corrected) peak-level p-values-the SSA ROIs.
We tested the pairwise differences between responses to deviants in different positions (dev4 > dev5, dev4 > dev6, and dev5 > dev6) in each SSA ROI using one-sided Wilcoxon sign rank tests at the grouplevel. Before testing the contrasts, we averaged data corresponding to each experimental condition across runs and voxels within each participant and SSA ROI. In line with the idea of prediction error encoding, we expected deviant responses to be stronger when deviants are less expected.
We also measured the effect size of adaptation and deviant detection by testing the contrasts std0 > std2 and dev4 > std2. Note that these contrasts are not independent of the contrasts used for SSA voxel selection. However, we included them here to be able to quantify the size of both effects. Additionally, we included the comparison of dev6 and std2 using two-tailed Wilcoxon sign rank tests. We included this analysis to test whether responses to fully predictable deviants were comparable to standard responses.
In line with predictive coding, we expected no statistically significant difference between responses to dev6 and std2. That is because both types of sound are fully predictable given our experimental design.
All p-values were corrected for multiple comparisons using the Holm-Bonferroni method [71].

Correlational Analysis and Linear Mixed-Effects Model
To investigate the hypothesized negative relationship between deviant predictability and deviant responses further, we estimated a linear mixed-effects model in each SSA ROI for each data set at the group-level. For pure tones, the model included deviant predictability as a fixed effect and random intercepts and slopes for experimental runs and participants: beta ∼ 1 + predictability + (1 + predictability|run) + (1 + predictability|participant) For FM-sweeps, we used the same model, but added experimental session as an additional random effect. All p-values were Bonferroni-corrected for the total number of SSA ROIs. To test if the grouplevel results were replicated at the participant-level, we computed Spearman's rank correlation between deviant predictability (1/3 for dev4, 1/2 for dev5, and 1 for dev6) and standardized beta estimates for dev4, dev5, and dev6 in each participant. Before computing the correlation coefficient ρ, beta estimates for the different deviant conditions in each voxel were averaged across experimental runs.

Bayesian Model Comparison
We constructed two models, each representing one potential encoding mechanism driving SSA (see Figure  1D. The models were defined using parametric amplitude modulation vectors that specified the predicted responses to all tones in each trial. h1) Habituation: SSA is based on stimulus repetition. Responses to standards undergo habituation over time and recover slightly after the deviant. Deviant responses are fully recovered and do not differ between deviants of differential predictability.
The h1 model (see Figure 1D, left) was specified by assigning the amplitude 1 to std0 and the deviant of a sequence. Standards before the deviant were assigned the amplitudes 1/n and standards after the deviant were assigned the amplitudes 1/(n − 1), where n is the position of the standard within the sequence (see Table 1 for the exact amplitudes). The h2 model (see Figure 1D, right) used an amplitude of 0.5 for std0 and an amplitude of p (p = probability of stimulus occurrence) for the rest of the tones. For example, a sequence with a deviant in position five was assigned the amplitude vector amp 0 = [1/2, 1, 1, 2/3, 1/2, 1, 1, 1]. Thus, std4 was assigned a value of 2/3 since a standard in position four is expected with a probability of 2/3, and dev5 was assigned a value of 1/2 because deviants in position five are expected with a probability of 1/2 after hearing four standards. See Table 1 for all amplitudes of h2. Since model estimation using parametric modulation is symmetric with respect to linear transformations of the amplitudes, the above-described amplitudes are equivalent to assuming a decreasing response with increasing sound predictability.
For each subject, we computed the log-evidence of the two models in each voxel of all anatomical ROIs using SPM's Bayesian Estimation functions in Nipype. Before model fitting, the amplitudes of all models were z-standardized according to experimental runs.
For pure tones, all models were estimated using the smoothed functional data in their native space. The log-evidence maps were registered to the individual T1 scans and then to the MNI152 symmetric template using Freesurfer's ApplyVolTransform and ANTs' ApplyTransforms, respectively. We combined the log-evidence-maps of all participants and calculated posterior probability maps for each model using custom code by Tabas et al. (2020) [23] following the methodology described in [72,73].
For FM-sweeps, all models were estimated using the smoothed functional data across sessions in the space of the participants' anatomical scans. The resulting log-evidence-map of each participant was registered to the MNI152 symmetric template using ANTs' ApplyTransforms and posterior probability maps were calculated as described before.
Then, we computed the Bayes factor K for both models in each voxel of our anatomical ROIs. To test if the spatial distribution of prediction error encoding is similar for pure tones and FM-sweeps, we computed Pearson correlations between the Bayes factor maps of both sound modalities in each anatomical ROI.

Results
Topography of SSA to pure tones and FM-sweeps in AC There was significant SSA to pure tones in bilateral Te1.0, bilateral Te1.1, and right Te3 (p < 0.008, FWE-corrected, see Table 2). Significant SSA clusters were present in all anatomical ROIs for FM-sweeps (p < 0.03, FWE-corrected, see Table 2). SSA magnitude was similar across all anatomical ROIs and both sound modalities.
Significant SSA clusters formed coherent fields, indicating a systematic spatial encoding pattern. For pure tones, the localization of SSA voxels was lateral within bilateral Te1.0, superior within bilateral Te1.1, and predominantly posterior in right Te3 (Figure 2A). For FM-sweeps, the majority of voxels in bilateral Te1.0 and Te1.1 showed significant SSA. SSA voxels in bilateral Te1.2 were localized posterolaterally. In Te3, SSA voxels were mostly found in posterior areas, mirroring the findings from the pure tone experiment ( Figure 2B).

Magnitude of SSA across cortical fields for pure tones and FM-sweeps
The SSA magnitude iSSA was similarly distributed across all anatomical ROIs and both sound modalities ( Figure 3). The topographic distributions of iSSA for pure tones and FM-sweeps showed a slight but significant positive correlation in Te1.0 L, Te1.1 L, and bilateral Te3 (r ≥ 0.04,p < 0.01, corrected for 8 comparisons; Table 3).    Table 3: Correlation between the voxel-wise SSA magnitude for pure tones and FM-sweeps in each anatomical ROI. r: Pearson correlation coefficient. We corrected p-values for eight comparisons using Bonferroni-correction.

Subjective expectations drive responses to pure tone and FM-sweep deviants
To test whether subjective expectations modulated the responses to deviants in different positions, we examined the relationship between deviant predictability and deviant responses. Generally, beta estimates qualitatively decreased with increasing deviant predictability in both sound modalities in accordance with the predictive coding hypothesis. We corroborated the effect quantitatively by conducting pair-wise statistical comparisons between the responses to each deviant position at the group-level (Supplementary Tables S1 and S1).  These pair-wise differences were further confirmed estimating an LMM at the group-level in each SSA ROI. In line with our hypothesis, we found a significant negative effect of deviant predictability on deviant responses in most SSA ROIs of both sound modalities (Table 4).   Neural responses to pure tones and FM-sweeps are best explained by predictive coding For both stimulus types, the prediction error model (h2) outperformed the habituation model (h1) in line with our hypothesis (see Figure 4 for the distribution of Bayes' K of both models for each anatomical ROI and stimulus type). Voxel-wise maps of Bayes K for h2 are shown in Supplementary Figure S5 for pure tones and Supplementary Figure S6 for FM-sweeps. Descriptively, voxels with higher Bayes factor formed spatially coherent fields, indicating a functional topographic organization.

Similar neural populations encode prediction error to pure tones and FMsweeps in some but not all cortical fields
Computing correlations between the K-maps of both stimulus types for both models, we found significant and positive correlation coefficients for h1 in all ROIs but left Te1.0 (r > 0.15, p < 10 − 4, Bonferronicorrected for 8 comparisons) and for h2 in left Te1.1, left Te1.2 and bilateral Te3. (r > 0.14, p < 10 − 5, Bonferroni-corrected for 8 comparisons; Table 5).

Discussion
The main tenet of the predictive coding framework is that sensory neurons encode prediction error with respect to an internal generative model of the world [7][8][9][10]. Here, we have tested two hypotheses: first, whether prediction error in different fields of the human AC is encoded with respect to a generative model informed by the task instructions and the encoding of subjective expectations. Second, whether the encoding of prediction error is equivalent for two stimulus types: pure tones and FM-sweeps. We conducted two fMRI experiments, one using each stimulus types. We used a modified oddball paradigm where we manipulated participants' expectations independently of local stimulus statistics. There were three key findings: First, we found significant SSA to pure tones and FM-sweeps across cortical fields, indicating that neural populations in the AC adapt to both, pure tones and FM-sweeps. Second, we found that neural adaptation was driven by the participant's expectations, demonstrating that SSA reflects prediction error with respect to a generative model that is informed by the task instructions. Third, we found that the populations encoding prediction error to pure tones and FM-sweeps overlap significantly in bilateral Te3, left Te1.1, and left Te1.2, demonstrating that, at least in those fields, both stimulus types share a common mechanism for the computation of prediction error. Together, our results suggest that predictive coding is the general encoding mechanism of acoustic information in AC.
Our results are the first robust evidence for prediction error encoding of FM-sweeps in human AC. Previous studies investigating prediction error encoding of FM-sweeps reported mixed results. Three studies reported a significant MMN to deviating FM-sweeps [56][57][58], but did not localize the source of the response. One study investigated sources in the AC, but found no significant result [53]. Three other studies reported increasing neuromagnetic responses to repeated FM-sweeps [52,54,55], in direct contradiction with predictive coding. One of these studies [55] reported the effect specifically in the AC. Differences in the stimulus features and ISIs used in these studies might have contributed to these contradictory results [54].
In animal research, SSA to FM direction was reported in A1 of rats [74]. SSA to FM is also present in the inferior colliculus of bats [75]. However, since FM sounds are essential for echolocation in bats [76], it is unclear whether this result is transferable to humans.
Our results indicate that similar neural populations in the human AC encode prediction error to pure tones and FM-sweeps: First, all SSA clusters formed spatially coherent fields across cortical ROIs for both pure tones and FM-sweeps, suggesting a systematic functional organization of SSA in the AC. Second, we showed that similar neural populations encode prediction error to pure tones and FM-sweeps in left Te1.1, left Te1.2, and bilateral Te3. Given that the data was collected using different scanners and different participants, our results suggest that prediction error is a stimulus-independent encoding mechanism in the AC.
Our results further showed that overlapping neural populations encode prediction error to two out of three auditory information-bearing units (IBUs): pure tones and FM-sweeps. The IBUs are the basic auditory information-bearing elements that conform information-carrying acoustic signals [41]. The result that overlapping neural populations encode prediction error to two out of three IBUs may suggest that the same predictive encoding mechanisms might underlie the processing of all information-carrying acoustic signals in the AC.
Animal studies have reported stronger SSA in secondary compared to primary auditory cortical areas [11,13]. However, recent evidence suggests that this pattern might be more complex than previously assumed: using single-cell recordings across auditory fields in rats, prediction error was largest in the posterior auditory field-a secondary auditory area. Primary areas and the secondary suprarhinal auditory field, on the other hand, showed stronger effects of stimulus repetition [77].
If the functional organization of the AC was similar in rodents and humans, we would expect stronger SSA in Te1.2 and Te3 compared to the remaining ROIs. However, we found similar SSA magnitudes across fields. These discrepancies may stem from a poor correspondence between auditory fields in humans and rodents [66,67]. However, our previous results showed that the distributions of SSA in the subcortical auditory pathway, closely replicated across mammals [78], also differed between humans and rodents [23,59].
Our results did however suggest a potential differential role of primary and higher-level human auditory cortical areas: First, we did not find significant SSA clusters in bilateral Te1.2 for pure tones; second, the proportion of SSA-voxels was comparably lower in Te3 for both stimulus types. Last, we found evidence for a reliable prediction error encoding topography across stimulus types in bilateral Te3, but not in Te1.0, and only unilaterally in Te1.1 and Te1.2. Together, these results suggest that the encoding of prediction error is stronger and potentially more modality-specific in primary auditory cortical areas, and more general in higher-order regions.
Formulations of the predictive coding framework disagree on whether predictions from generative model units inform prediction errors only at the immediate lower stage [10,79] or also at subsequent stages of the processing hierarchy (see [21] for a review of the empirical evidence on both standpoints). MMN studies showed that prediction error is elicited with respect to high-level expectations; namely: by the violation of complex statistical regularities (see [80] for review), the omission of expected sounds [81][82][83], and abstract expectations about the occurrence of deviating sounds [82]. However, since the generators of the MMN are partly located in the frontal cortex [80], MMN research is not suitable to clarify whether subjective expectations are used to compute prediction errors at lower levels of the auditory processing hierarchy.
We found prediction error encoding to be the dominant encoding principle for both stimulus types in all anatomical ROIs. This result suggests that high-level predictions informed by the task instructions, putatively computed in regions at higher processing stages than the sensory cortices, are used to compute prediction errors in the primary AC. We had previously shown that these same predictions are also used to compute prediction error in the human auditory midbrain and thalamus [23,59]. Previous studies also showed that prediction error in the AC was computed with respect to language-specific expectations (e.g., [84][85][86][87]). Together, the empirical evidence supports the hypothesis that high-level predictions are used to compute prediction errors along the entire processing hierarchy.
We have previously shown that SSA and prediction error encoding of pure tones and FM-sweeps are already present in the auditory thalamus and midbrain [23,59]. Since auditory cortical areas receive direct bottom-up input from the auditory thalamus [88], our results here might simply reflect ascending input from prediction error units in subcortical structures. On the other hand, subcortical SSA and prediction error signals might as well reflect top-down cortical input. Animal studies have shown that SSA in the auditory midbrain and thalamus persists under deactivation of the AC [37,89], at least in regions with strong SSA [90]. Further work is needed to clarify the interplay of bottom-up and top-down signalling in the computation of prediction error.
Our results suggest that predictive coding is the general mechanism underlying the encoding of information-bearing acoustic signals in AC. Impaired predictive processes in AC have been linked to speech processing disorders and clinical conditions such as developmental dyslexia (e.g., [91][92][93]), stuttering [94], autism spectrum disorder [95], psychosis [96], or schizophrenia [97]. Investigating how predictive coding is implemented in the human AC is essential for a mechanistic understanding of these disorders.   Figure S3: Spearman's rank correlation between deviant predictability and standardized beta estimates for each participant of the pure tone experiment. Deviant predictability is shown on the x-axis (1/3 for deviants in position four, 1/2 for deviants in position five, and 1 for deviants in position six). The y-axis shows the respective mean standardized beta estimates.   Table S1: Statistics of the group-level Wilcoxon sign rank tests for the pure tone data. The indicated hypotheses refer to the alternative hypotheses of the tests. The comparison of dev6 and std2 was conducted using two-tailed sign rank tests; all other contrasts were tested using one-sided sign rank tests. All p-values were corrected for 30 comparisons using the Holm-Bonferroni method. Effect size d:

Supplementary Material
Cohen's d.