Overestimating environmental volatility increases switching behavior and is linked to activation of dorsolateral prefrontal cortex in schizophrenia

Department of Psychiatry and Psychotherapy, Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health Max-Planck-Institute for Human Cognitive and Brain Sciences, Leipzig; Department of Child and Adolescent Psychiatry, Psychotherapy and Psychosomatics, University of Leipzig, Leipzig; Center for Social and Affective Neuroscience, Linköping University, Linköping Scuola Internazionale Superiore di Studi Avanzati (SISSA), Trieste, Italy Wellcome Trust Centre for Neuroimaging, Institute of Neurology, University College London, London; Max Planck UCL Centre for Computational Psychiatry and Ageing Research, London; Translational Neuromodeling Unit (TNU), Institute for Biomedical Engineering, University of Zurich & ETH Zurich, Switzerland; 9 Berlin School of Mind and Brain,Humboldt-Universität zu Berlin, Berlin; Berlin Institute of Health, Berlin, Germany Cluster of Excellence NeuroCure, Charite Universitätsmedizin, Berlin; *equal contribution


Introduction
Cognitive and motivational deficits represent important characteristics of schizophrenia that are associated with clinical and social outcome (1)(2)(3)(4)(5). Flexible reward-based learning and decision-making require the integration of cognition and motivation and are impaired in schizophrenia (6,7). These impairments are present at the onset of the disorder, are independent of lower general IQ, remain stable over time (8,9) and were proposed as neurocognitive markers with potential clinical utility (10). However, the underlying cognitive mechanisms and associated neural signatures remain to be understood.
Flexible reward-based learning and decision-making can be probed via variants of reversal learning. In such tasks, schizophrenia is characterized by increased switching between choice options (8,(11)(12)(13)(14)(15)(16). The underlying mechanisms of this instable behavior remain unknown but can be targeted by computational modeling (17). For example, in 'reinforcement learning' (RL; 18), choices are selected based on their expected value, which are learned trial-by-trial by weighing reward prediction errors (RPEs) with the learning rate. RPEs are closely aligned with phasic dopamine transients (19,20).
Considering enhanced presynaptic dopamine synthesis capacity in schizophrenia (21,22), this could translate into enhanced phasic dopamine transients in schizophrenia, which in turn might result in increased learning rates (23). Although this could theoretically account well for instable behavior observed in patients, increased learning rates were not found in schizophrenia (e.g. 16, for review see 17,23,24).
Theories of predictive coding (25) and hierarchical Bayesian inference have been used to postulate mechanisms underlying the symptoms of schizophrenia (26-29). Symptoms are understood as false inference about the world due to altered precision attributed to beliefs on different hierarchical levels. Dysfunction at higher levels, which are thought to extract and represent the general and stable features of the environment, might lead to experiencing the world as being more volatile, i.e. less stable and more surprising. This was applied with a strong focus on positive symptoms (30) and has received empirical evidence (e.g. 31). Applying this framework to reward-based learning, beliefs about the probability of rewards are learned at lower levels but are also determined by learning about the volatility of reward probabilities at higher levels (32). This higher-level environmental volatility directly influences trial-by-trial learning from lower-level RPEs by scaling the belief update (via a ratio of precisions from both levels). Thus, a higher-order belief about environmental volatility could induce rapid updates of lower-level beliefs about reward probabilities and promote enhanced switching behavior in patients.
In schizophrenia, striatal and prefrontal activation is reduced during reward anticipation and receipt (33-35) but this can be ameliorated by antipsychotics (36). Reduced striatal RPE activity was observed in unmedicated (16) but not in medicated patients (14,37). Here, we used a modified reward-based reversal-learning task during fMRI in schizophrenia patients and healthy controls. Detailed computational modeling was applied to the behavioral data by comparing RL and a hierarchical Bayesian learning model, the Hierarchical Gaussian Filter (HGF, 41, 42). Learning trajectories from computational modeling informed the analysis of fMRI data. We hypothesized that enhanced switching during decision-making in schizophrenia relate to higher-order beliefs about the volatility of the environment and examine the associated neural signatures by informing the analysis of fMRI data by computational modeling.

Materials and Methods
Participants and instruments. 46 medicated schizophrenia patients and 43 healthy controls (HC) were included in the study (see Table 1 Task. Participants performed a two-choice decision-making task requiring flexible behavioral adaptation ( Figure 1A; 45, 46, 47). In 160 trials, individuals decided between two cards each showing a different visual stimulus. The right or left location of stimuli was randomized over trials. After left or right button press (max. response time 1.5s), the selected card was highlighted and a monetary win (10 Eurocent coin) or a monetary loss (crossed 10 Eurocent coin) was shown for .5s. During the inter-trial interval, a fixation cross was presented (distributed jitter, range 1s-12.5s). If no response occurred in time, the message "too slow" appeared. One of two cards was initially assigned with a reward probability of 80% and a punishment probability of 20% and vice versa for the other card. Thus, the task had a simple higher-order structure ( Figure 1B): a perfect anti-correlation between the reward probabilities associated with the two choices; whenever one card was associated with a high reward probability of 80%, the other card would be associated with a low reward probability of 20%. Reward contingencies were stable for the first 55 trials ('pre-reversal') and for the last 35 trials ('post-reversal'). During the 'reversal' phase, reward contingencies changed four times, after 15 or 20 trials, respectively. In an instruction and training session before MRI scanning, participants were informed that one of two cards had a superior chance of winning money. They were told that depending on their choice they could either win 0.10€ or lose 0.10€ per trial and to win as much as possible as the total gain was paid out. 20 training trials were performed with a different set of cards and without reversal. After training, participants were instructed that reward probabilities could change over time and to track such changes to win as much as possible. No other information on reversals or the anticorrelated task structure was given.
Analysis of choice behavior. Behavioral performance was quantified by counting 'correct' choices of the stimuli with high (80%) reward probability and was analyzed using repeated-measures ANOVA including the between-subject factor 'group' and the withinsubject factor 'phase' (pre-reversal, reversal, post-reversal). Another repeated-measures ANOVA was used to test the effect of previous feedback on subsequent choices, i.e. repeating choices after reward ('win-stay') and punishment ('lose-stay').
Computational models of learning. Computational modeling methods in this paper comprised a model-free RL approach and a Bayesian learning model, the HGF. The HGF is grounded in the "Bayesian Brain hypothesis" and reflects the principles of predictive coding (25). For all models, RPE δ is used to update expectations for the chosen stimulus: Using the notation of RL, ! ! (!) represents the expectation for receiving reward or punishment by choosing the card c in trial !. R denotes the received outcome. RPE δ reflects the difference between received rewards and expectations. The update of ! ! (!) is equal to the RPE ! ! ! (!) weighted by a parameter !, the learning rate: The learning rate ! reflects a constant rate of change in values throughout the series of trials !. Thus, with learning rates close to 1, PEs strongly affect expectations, while learning rates close to 0 lead to little influence of PEs on expectations.
The HGF describes learning as a process of inductive inference under uncertainty. This model deploys hierarchically organized states, in which learning at a higher-level state determines learning at a lower-level state by dynamically adjusting the lower level's learning rate. In our case, the top level represents a trial-by-trial computation of environmental volatility, an estimate of how likely a change in action-outcome contingencies is to occur. This top-level estimate dynamically modulates the learning rate of lower-level RPEs trial-by-trial (see Figure 2A for a graphical illustration). More formally, the evolution of each state x ! is defined as a Gaussian random walk, where the change in each ! ! can be inferred, using a variational Bayesian scheme, giving the posterior mean µ ! ! and variance σ ! ! at each level (with i being an index of the level). In our implementation, action-outcome contingencies ! ! (!) evolved as a Gaussian random walk at the logit scale; the probability for a choice being rewarded (i.e., ! ! = 1) at a given is the logistic sigmoid function. Crucially, the step size of the Gaussian random walk of ! ! depends on the next higher level ! ! : The width of the Gaussian is defined by the parameters ! and !. The term !! ! (!) expresses the influence of the third-level environmental volatility ! ! (!) on the second level and ! captures inter-individual differences in coupling of the levels. ! ! (!) evolves in the same manner, except that the variance of the Gaussian is a constant ! (because there is no higher level): In our implementation, ! was fixed at the second level because we were particularly interested in learning about environmental volatility at the third level ! and its influence on the lower level !. Variational inversion of the HGF shows that trial-by-trial updates of posterior means at each level ! are proportional to the PE from the level below weighted by a precision ratio (compare equation 2 for the RL equivalent) For an exact derivation of the precision weights and the precision-weighted PEs, we refer to previous methodological papers (41,42). In addition to the three-level HGF (HGF3), we also include a two-level variant (HGF2) to test the hypothesized superiority of HGF3.
HGF and RL provide different ways to learn expectations about rewards. Decision models. For each trial !, Q (!) (RL) or µ !!! (!) (HGF) were transformed to choice probabilities by using a logistic function, the softmax: In binary choice tasks with correlated reward-probabilities, there is strong autocorrelation of choices (perseveration), which is captured by estimating the parameter ρ. 'rep' indicates whether the outcome of the previous action was a reward or punishment. Thus, ρ !"# and ρ !"# reflect differences in choice perseveration after rewards and punishment separately. The parameter β was fixed to 1. This parameter β captures how tightly choice probabilities follow learned reward expectations, i.e. whether individuals tend to exploit these expectations. Other decision models, i.e. estimating β as a parameter, were tested but had lower evidence (see Supplemental Material). We also tested the possibility that  For between-group second-level analysis, a random-effects model including t-contrasts of all five modeling-based trajectories (precision-weighted PEs ε 2 and ε 3 , precision weights ψ 2 and ψ 3 and the third-level volatility µ 3 was estimated.

Results
Sample. Both groups did not differ in age, gender and handedness (Table 1).
Behavioral data. Repeated-measures ANOVA on choices of the stimulus with 80% reward probability revealed that participants' performance differed between phases (dropping in the reversal phase, main effect of phase F=23.74, p<.001). Patients chose the better card less frequently irrespectively of task phases ( Figure 1C, main effect of group F=14.52, p<.001, phase x group interaction F=1.87, p=.16). Thus, the factor phase was dropped from further analyses.
Repeated-measures ANOVA on the probability of choice repetition showed that participants were overall more likely to stay with the previous action after rewards compared to losses (main effect of feedback F=369.80, p<.001). Patients were less likely to stick to the previous action but this did not depend on the feedback received in the previous trial ( Figure 1D, main effect of group F=27.77, p<.001, feedback x group interaction F=.02, p=.89). A potential mechanism underlying this instable behavior in schizophrenia remains unknown and the following computational modeling aims to bridge this gap.  Table 2). While this model (HGF3-DU-V) was clearly superior in HC alone (PP=64.4%, PXP=100%) and remained first-ranking in patients with a PP of 22.6%, the differences between models were marginal in the patient group (PXP=12.6%).
15 patients and 1 HC were identified as not fit better than chance by any of the models (S- Figure 1A). This led to inclusion of 31 patients and 42 HC for all further modeling-based analyses and altered BMS only marginally (Table 2). Repeating the analysis of win-stay and lose-stay behavior as reported before with three groups (HC, patients-fit, patients-nofit), revealed a group x feedback interaction (F=20.68, p<.001) showing that reduced lose-stay behavior characterized both patients-fit and patients-nofit while a pronounced reduction of win-stay behavior was present in patients-nofit (see Supplemental Material, S- Figure 1B-D).
Comparison of parameters (Table 3, Figure 2B and 2C) revealed that µ ! Notably, between-group findings on behavioral data were reproduced when running the task based on the inferred parameters, which represents an important validation of the model's ability to capture the observed behavioral data (see supplemental Material).
Replication in unmedicated patients. In an independent sample of unmedicated patients (n=24) and HC (n=24), who performed a comparable reversal learning task (16), we replicated between-group findings on learning parameters when fitting the same variant of HGF (HGF-DU-V). For statistics, see Table 3. This replication remained significant when excluding participants not fit better than chance (23 HC, 13 patients).
Relationship with symptom dimensions. We explored the relationship of the two learning parameters that differed between groups with measures of cognition (6 measures) and psychopathology (7 measures) within patients (Table 1). Regarding cognition, we found negative correlations between the initial estimate of the third-level environmental volatility  Table 2).

FMRI -task effects (pooled across groups)
. Activity related to ε 2 (a precision-weighted RPE) peaked in bilateral ventral striatum and ventromedial PFC among other regions (p-FWE wholebrain <.05, Figure 3A, S- Table 3) and also in midbrain (p-FWE midbrain-voi <.05, S- Table 9), a well-known network associated with RPEs. In a conjunction analysis, there was large overlap between ε 2 and the RPE from RL (S- Table 8; S- Figure 4). In contrast, third-level precision-weighted RPE (ε 3 ) was associated with activation in prefrontal and parietal regions as well as in the left insula ( Figure 3A, S- Table 4).
Environmental volatility (µ 3 ) co-varied with BOLD signal in similar areas to precision estimates ψ 2 and ψ 3 , i.e. bilateral insula, cingulate cortex, parietal cortex and thalamus (S- Figure 5). However, there was additional activation associated with µ 3 located in prefrontal regions in the superior, middle and inferior frontal gyrus and in middle temporal gyrus and globus pallidus ( Figure 4A, S- Table 7, S- Figure 5).

FMRI -between-group effects.
For environmental volatility µ 3 , a group difference between HC and patients was found in the right dlPFC (F-contrast, corrected for main effect of µ 3 over all participants, [x=34, y=44, z=24], F=19.89, z=4.24, p FWE =0.038, Figure   4B). Post-hoc analysis revealed stronger activity related to volatility in dlPFC of patients compared to HC (t=4.46, z=4.4, p FWE =0.019, Figure 4C). There was no other significant difference between the groups for any other regressors.

Discussion
To the best of our knowledge, this is the first study to apply hierarchical Bayesian learning to choice and fMRI data of schizophrenia patients during reward-based decision-making. We present two core findings: First, medicated patients both overestimated their prior higher-level belief about the volatility of the environment and exhibited an increased influence of volatility estimates on lower-level learning of actionoutcome contingencies. This provides a computational explanation of the increased switching behavior seen in patients with schizophrenia, both in this and in previous studies (8,(11)(12)(13)(14)(15)(16). We replicate this finding in an independent cohort of unmedicated patients. Second, medicated patients displayed higher dlPFC activity related to beliefs about environmental volatility. This points towards a prominent role of this region in promoting instable behavior in schizophrenia.
No differences were observed for other learning signatures such as lower-and higherlevel precision weights and precision-weighted PEs from the HGF as well as RPE from RL. While precision-weighted RPEs have not been investigated in schizophrenia so far, the results regarding RPE activity in schizophrenia are mixed. In medicated patients no significantly different RPE BOLD signal was observed in two recent studies (13,14) in line with our current finding. On the other hand, in unmedicated patients striatal RPE activity was reduced (16), which suggests an effect of antipsychotic medication similar as for reward anticipation (36, 57, 58). As

Supplemental Methods
Computational models of learning. In the case of the applied task, both HGF and RL provide different ways to learn expectations about rewards. In the main text, we describe how both algorithms update expectations of the chosen card c only ("single-update", SU), which implies that there is no update of the expectation about the unchosen card: Correspondingly, for the HGF: (2) ! !!!,!" Based on the anti-correlated task structure, one can implement a variant of each of the learning models updating the unchosen card simultaneously, i.e. an increase of the chosen card implies a decrease of the unchosen card !c ("double-update", DU), which can be written as: Both SU and DU variants of each learning model were fit to the choice data.
Decision models. In binary choice tasks with correlated reward-probabilities, there is strong autocorrelation of choices (perseveration). In our decision model (equation 5, main manuscript), this is captured by estimating a parameter ρ, which changes the inflection point of the sigmoid function, thus, biasing towards an overall tendency to stay or switch irrespectively of the learned expectations, which we split up to ρ !"# and ρ !"! to reflect differences in choice perseveration after rewards and punishment separately. This Classification of subjects not fit better than chance. In choice tasks like ours, a subject can be classified as fit better than chance when the negative log-likelihood (-LL) of the model significantly exceeds .55 given by exp(-LL/ntrials) corresponding to p<.05 when performing a binomial test. This procedure was applied to all individuals to avoid the possibility that between-group differences in model parameters are confounded by differences in model fit (1,(3)(4)(5)(6).
Functional Magnetic Resonance Imaging. Functional Magnetic Resonance Imaging (fMRI) imaging was performed using a 3 Tesla Siemens Trio scanner to acquire gradient echo T2*-weighted echo-planar images with blood oxygenation level dependent contrast.
Preprocessing of fMRI data. FMRI data were analyzed using SPM8 (http://www.fil.ion.ucl.ac.uk/spm/software/spm8/). For preprocessing, images were corrected for delay of slice time acquisition. Voxel-displacement maps were estimated based on field maps. All images were realigned to correct for motion and were also corrected for distortion and the interaction of distortion and motion. The images were spatially normalized into the Montreal Neurological Institute (MNI) space using the normalization parameters generated during the segmentation of each subject's anatomical T1 scan; spatial smoothing was applied with an isotropic Gaussian kernel of 6mm full width at half maximum. Prior to first-level statistical analysis, data were highpass filtered with a cutoff of 128s.

Supplemental Results
Behavioral data in HC, SZ-fit and SZ-nofit. Repeating the analysis of win-stay and losestay behavior with three groups (HC-fit, SZ-fit, SZ-nofit) revealed that all participants were more likely to repeat the previous action after rewards versus losses (main effect of feedback F=371.91, p= 8.65e-33) and patients were overall less likely to switch independent of the feedback received in the previous trial (significant main effect of group F=32.99, p= 2.48e-11). However, as mentioned in the main manuscript, there was also a significant group x feedback interaction (F=20.68, p=4.81e-08) showing that reduced lose-stay behavior characterized both SZ-fit and SZ-nofit (S- Figure 1). A reduction of win-stay behavior was pronounced in SZ-nofit only (S- Figure 1). Performing the same analysis with two groups (HC-fit, SZ-fit; main effect of feedback F=636.30, p=3.60e-37; main effect of group F=10.22, p=.0021) confirmed a significant interaction between group x feedback interaction (F=6.79, p=.0111). Thus, in these patients, the tendency to switch was particularly driven by enhanced switching after negative feedback. This is in line and was even corroborated by computational modeling showing that these patients showed a reduced tendency to stay after punishments but not after rewards that remained significant beyond the differences in learning (see Results section in main manuscript). In addition, the latter differences in learning provide a mechanistic explanation for the switching behavior observed in this and previous studies. For discussion, see main manuscript.
Clinical distinction between subgroups. When further exploring the subgroups, SF-nofit patients did not differ from SF-fit patients in any of the seven measures of positive and negative symptoms (  Figure 3A). There was no significant difference between the groups. However, this signal largely overlapped with the precision-weighted PE at the second level of the HGF ( Figure 3A, S- Table 3, S- Figure 3B), which represents, in the case of our tasks, a precision-weighted RPE. S- Figure 1. A) Classification above (black dots) and beIow (red crosses) chance and its influence on B) overall stay behavior and C) win-stay and D) lose-stay behavior.