To explain the high level of flexibility in primate decision-making, theoretical models often invoke reinforcement-based mechanisms, performance monitoring functions, and core neural features within frontal cortical regions. However, the underlying biological mechanisms remain unknown. In recent models, part of the regulation of behavioral control is based on meta-learning principles, for example, driving exploratory actions by varying a meta-parameter, the inverse temperature, which regulates the contrast between competing action probabilities. Here we investigate how complementary processes between lateral prefrontal cortex (LPFC) and dorsal anterior cingulate cortex (dACC) implement decision regulation during exploratory and exploitative behaviors. Model-based analyses of unit activity recorded in these 2 areas in monkeys first revealed that adaptation of the decision function is reflected in a covariation between LPFC neural activity and the control level estimated from the animal's behavior. Second, dACC more prominently encoded a reflection of outcome uncertainty useful for control regulation based on task monitoring. Model-based analyses also revealed higher information integration before feedback in LPFC, and after feedback in dACC. Overall the data support a role of dACC in integrating reinforcement-based information to regulate decision functions in LPFC. Our results thus provide biological evidence on how prefrontal cortical subregions may cooperate to regulate decision-making.

## Introduction

When searching for resources, animals can adapt their choices by reference to the recent history of successes and failures. This progressive process leads to improved predictions of future outcomes and to the adjustment of action values. However, to be efficient, adaptation requires dynamic modulations of behavioral control, including a balance between choices known to be rewarding (exploitation), and choices with unsure, but potentially better, outcome (exploration).

The prefrontal cortex is required for the organization of goal-directed behavior (Miller and Cohen 2001; Wilson et al. 2010) and appears to play a key role in regulating exploratory behaviors (Daw et al. 2006; Cohen et al. 2007; Frank et al. 2009). The lateral prefrontal cortex (LPFC) and the dorsal anterior cingulate cortex (dACC, or strictly speaking the midcingulate cortex, [Amiez et al. 2013]) play central roles, but it is unclear which mechanisms underlie the decision to explore and how these prefrontal subdivisions participate.

Computational solutions often rely on the meta-learning framework, where shifting between different control levels (e.g., shifting between exploration and exploitation) is achieved by dynamically tuning meta-parameters based on measures of the agent's performance (Doya 2002; Ishii et al. 2002; Schweighofer and Doya 2003). When applied to models of prefrontal cortex's role in exploration (McClure et al. 2006; Cohen et al. 2007; Krichmar 2008; Khamassi et al. 2011), this principle predicts that the expression of exploration is associated with decreased choice-selectivity in the LPFC (flat action probability distribution producing stochastic decisions) while exploitation is associated with increased selectivity (peaked probability distribution resulting in a winner-take-all effect). However, such online variations during decision-making have yet to be shown experimentally. Moreover, current models often restrict the role of dACC to conflict monitoring (Botvinick et al. 2001) neglecting its involvement in action valuation (MacDonald et al. 2000; Kennerley et al. 2006; Rushworth and Behrens 2008; Seo and Lee 2008; Alexander and Brown 2010; Kaping et al. 2011). dACC activity shows correlates of adjustment of action values based on measures of performance such as reward prediction errors (Holroyd and Coles 2002; Amiez et al. 2005; Matsumoto et al. 2007; Quilodran et al. 2008), outcome history (Seo and Lee 2007), and error-likelihood (Brown and Braver 2005). Variations of activities in dACC and LPFC between exploration and exploitation suggest that both structures contribute to the regulation of exploration (Procyk et al. 2000; Procyk and Goldman-Rakic 2006; Landmann et al. 2007; Rothe et al. 2011).

The present work assessed the complementarity of dACC and LPFC in behavioral regulation. We previously developed a neurocomputational model of the dACC–LPFC system to synthesize the data reviewed above (Khamassi et al. 2011, 2013). One important feature of the model was to include a regulatory mechanism by which the control level is modulated as a function of changes in the monitored performance. As reviewed above such a regulatory mechanism should lead to changes in prefrontal neural selectivity. This work thus generated experimental predictions that are tested here on actual neurophysiological data.

We recorded LPFC single-unit activities and made comparative model-based analyses with these data and dACC recordings that had previously been analyzed only at the time of feedback (Quilodran et al. 2008). We show that information related to different model variables (reward prediction errors, action values, and outcome uncertainty) are multiplexed in different trial epochs both in dACC and LPFC, with higher integration of information before the feedback in LPFC, and after the feedback in dACC. Moreover, LPFC activity displays higher mutual information with the animal's choice than dACC, supporting its role in action selection. Importantly, as predicted by prefrontal cortical models, we observe that LPFC choice selectivity covaries with the control level measured from behavior. Taken together with recent data (Behrens et al. 2007; Rushworth and Behrens 2008), our results suggest that the dACC–LPFC diad is implicated in the online regulation of learning mechanisms during behavioral adaptation, with dACC integrating reinforcement-based information to regulate decision functions in LPFC.

## Materials and Methods

Monkey housing, surgical, electrophysiological and histological procedures were carried out according to the European Community Council Directive (1986) (Ministère de l'Agriculture et de la Forêt, Commission nationale de l'expérimentation animale) and Direction Départementale des Services Vétérinaires (Lyon, France).

### Experimental Set Up

Two male rhesus monkeys (monkeys M and P) were included in this experiment. During recordings animals were seated in a primate chair (Crist Instrument Company Inc., USA) within arm's reach of a tangent touch-screen (Microtouch System) coupled to a TV monitor. In the front panel of the chair, an opening allowed the monkey to touch the screen with one hand. A computer recorded the position and accuracy of each touch. It also controlled the presentation via the monitor of visual stimuli (colored shapes), which served as visual targets (CORTEX software, NIMH Laboratory of Neuropsychology, Bethesda, Maryland). Eye movements were monitored using an Iscan infrared system (Iscan Inc., USA).

### Problem Solving Task

We employed a problem solving task (PS task; Fig. 1*A*) where the subject has to find by trial and error which of 4 targets is rewarded. A typical problem started with a *Search* period where the animal performed a series of incorrect search trials (INC) until the discovery of the correct target (first correct trial, CO1). Then a *Repetition* period was imposed where the animal could repeat the same choice during a varying number of trials (between 3 and 11 trials) to reduce anticipation of the end of problems. At the end of repetition, a signal to change (SC; a red flashing circle of 8 cm in diameter at the center of screen) indicated the beginning of a new problem, that is, that the correct target location would change with a 90% probability.

Each trial was organized as follows: a central target (lever) is presented which is referred to as trial start (ST); the animal then touches lever to trigger the onset of a central white square which served as fixation point (FP). After an ensuing delay period of about 1.8 s (during which the monkey is required to maintain fixation on the FP), 4 visual target items (disks of 5 mm in diameter) are presented and the FP is extinguished. The monkey then has to make a saccade towards the selected target. After the monkey has fixated on the selected target for 390 ms, all the targets turn white (go signal), indicating that the monkey can touch the chosen target. Targets turn gray at touch for 600 ms and then switch off. At offset, a juice reward is delivered after a correct touch. In the case of an incorrect choice, no reward is given, and in the next trial the animal can continue his search for the correct target. A trial is aborted in the case of a premature touch or a break in eye fixation.

### Behavioral Data

Performance in search and repetition periods was measured using the average number of trials performed until discovery of the correct target (including first correct trial) and the number of trials performed to repeat the correct response 3 times, respectively. Different types of trials are defined in a problem. During search the successive trials were labeled by their order of occurrence (indices: 1, 2, 3,…, until the first correct trial). Correct trials were labeled CO1, CO2,… and CO*n*. Arm reaction times and movement times were measured on each trial. Starting and ending event codes defined each trial.

Series of problems are grouped in sessions. A session corresponds to one recording file that contains data acquired for several hours (during behavioral sessions) to several tens of minutes (during neurophysiological recordings corresponding to one site and depth).

### Electrophysiological Recordings

Monkeys were implanted with a head-restraining device, and a magnetic resonance imaging-guided craniotomy was performed to access the prefrontal cortex. A recording chamber was implanted with its center placed at stereotaxic anterior level A+31. Neuronal activity was recorded using epoxy-coated tungsten electrodes. Recording sites labeled dACC covered an area extending over ∼6 mm (anterior to posterior), in the dorsal bank and fundus of the anterior part of the cingulate sulcus, at stereotaxic levels superior to *A*+30 (Fig. 1*B*). This region is at the rostral level of the midcingulate cortex as defined by Vogt et al. (2005). Recording sites in LPFC were located mostly on the posterior third of the principal sulcus.

### Data Analyses

All analyses were performed using Matlab (The Mathworks, Natick, MA).

### Theoretical Model for Model-Based Analysis

We compared the ability of several different computational models to fit trial-by-trial choices made by the animals. The aim was to select the best model to analyze neural data. The models tested (see list below) were designed to evaluate which among several computational mechanisms were crucial to reproduce monkey behavior in this task. The mechanisms are:

Elimination of non-rewarded targets tested by the animal during the search period. This mechanism could be modeled in many different ways, for example, using Bayesian models or reinforcement learning models. In order to keep our results comparable and includable within the framework used by previous similar studies (e.g., Matsumoto et al. 2007; Seo and Lee 2009; Kennerley and Walton 2011), we used reinforcement learning models (which would work with high learning rates—that is, close to 1—in this task) while noting that this would be equivalent to models performing logical elimination of non-rewarded targets or models using a Bayesian framework for elimination. This mechanism is included in Models 1–10 in the list below.

Progressive forgetting that a target has already been tested. This mechanism is included in Models 2–7 and 9–10.

Reset after the SC. This would represent information about the task structure and is included in Models 3–12. Among these models, some (i.e., Models 4, 6–10) also tend not to choose the previously rewarded target (called “shift” mechanism), and some (i.e., Models 5–10) also include spatial biases for the first target choice within a problem (called “bias” mechanism).

Change in the level of control from search to repetition period (after the first correct trial). This would represent other information about the task structure and is included in Models 9 and 10 (i.e., GQLSB2β and SBnoA2β). List of tested models:

Model QL (Q-learning)

We first tested a classical Q-learning (QL) algorithm which implements action valuation based on standard reinforcement learning mechanisms (Sutton and Barto 1998). The task involving 4 possible targets on the touch screen (upper-left: 1, upper-right: 2, lower-right: 3, lower-left: 4, Fig. 1*C*), the model had 4 possible action values (i.e., *Q*_{1}, *Q*_{2}, *Q*_{3}, and *Q*_{4} corresponding to the respective values associated with choosing target 1, 2, 3, and 4, respectively).

At each trial, the probability of choosing target *a* was computed by a Boltzmann softmax rule for action selection:

where the inverse temperature meta-parameter *β* (0 < *β*) regulates the exploration level. A small *β* leads to very similar probabilities for all targets (flat probability distribution) and thus to an exploratory behavior. A large *β* increases the contrast between the highest value and the others (peaked probability distribution), and thus produces an exploitative behavior.

At the end of the trial, after choosing target *a _{i}*, the corresponding value is compared with the presence/absence of reward so as to compute a reward prediction error (RPE) (Schultz et al. 1997):

where *r*(*t*) is the reward function modeled as being equal to 1 at the end of the trial in the case of success, and −1 in the case of failure. The reward prediction error signal *δ*(*t*) is then used to update the value associated with the chosen target:

where *α* is the learning rate. Thus the QL model employs 2 free meta-parameters: *α* and *β*.

2. Model Generalized Q-learning (GQL)

We also tested a generalized version of Q-learning (GQL) (Barraclough et al. 2004; Ito and Doya 2009) which includes a forgetting mechanism by also updating values associated with each non chosen target *b* according to the following equation:

where *κ* is a third meta-parameter called the forgetting rate $(0\kappa 1)$, and *Q*_{0} is the initial *Q*-value.

3. Model GQLnoSnoB (GQL with reset of

*Q*values at each new problem; no shift, no bias)

Since animals are over-trained on the PS task, they tend to learn the task structure: the presentation of the SC on the screen is sufficient to let them anticipate that a new problem will start and that most probably the correct target will change. In contrast, the 2 above-mentioned reinforcement learning models tend to repeat previously rewarded choices. We thus tested an extension of these models where the values associated with each target are reset to [0 0 0 0] at the beginning of each new problem (Model GQLnoSnoB).

4. Model GQLSnoB (GQL with reset including shift in previously rewarded target; no bias)

We also tested a version of the latter model where, in addition, the value associated with the previously rewarded target has a probability *P _{S}* of being reset to 0 at the beginning of the problem,

*P*being the animal's average probability of shifting from the previously rewarded target as measured from the previous session $(0.85PS0.95)$ (Fig. 2

_{S}*A*- middle). This model including the shifting mechanism is called GQLSnoB and has 3 free meta-parameters.

5. Model GQLBnoS (GQL with reset based on spatial biases; no shift)

In the fifth tested model (Model GQLBnoS), instead of using such a shifting mechanism, target *Q*-values are reset to values determined by the animal's spatial biases measured during search periods of the previous session; for instance, if during the previous session, the animal started 50% of search periods by choosing Target *1*, 25% by choosing *2*, 15% by choosing Target *3* and the rest of the time by choosing Target *4*, target values were reset to [*θ*_{1}; *θ*_{2}; *θ*_{3}; (1 − *θ*_{1} − *θ*_{2} − *θ*_{3})] where *θ*_{1} = 0.5, *θ*_{2} = 0.25 and *θ*_{3} = 0.15 at each new search of the next session. In this manner, *Q*-values are reset using a rough estimate of choice variance during the previous session. These 3 spatial bias parameters are not considered as free meta-parameters since they were always determined based on the previous behavioral session because they were found to be stable across sessions for each monkey (Fig. 2*A*, right).

6. Model $GQLSB$ (GQL with reset including shift in previously rewarded target and spatial biases)

We also tested a model which combines both shifting mechanism and spatial biases (Model $GQLSB$) and thus has 3 free meta-parameters.

7. Model SBnoA (Shift and Bias but the learning rate

*α*is fixed to 1)

Since the reward schedule is deterministic (i.e., choice of the correct target provides reward with probability 1), a single correct trial is sufficient for the monkey to memorize which target is rewarded in a given problem. We thus tested a version of the previous model where elimination of non-rewarded target is done with a learning rate *α* fixed to 1—that is, no degree of freedom in the learning rate in contrast with Model GQLSB. This meta-parameter is usually set to a low value (i.e., close to 0) in the Reinforcement Learning framework to enable progressive learning of reward contingencies (Sutton and Barto 1998). With *α* set to 1, the model SBnoA systematically performs sharp changes of *Q-*values after each outcome, a process which could be closer to working memory mechanisms in the prefrontal cortex (Collins and Frank 2012). All other meta-parameters are similar as in GQLSB, including the forgetting mechanism (equation 4) which is considered to be not specific to Reinforcement Learning but also valid for Working Memory (Collins and Frank 2012). Model SBnoA has 2 free meta-parameters.

8. Model SBnoF (Shift and Bias but no

*α*and no Forgetting)

To verify that the forgetting mechanism was necessary, we tested a model where both *α* and *κ* are set to 1. This model has thus only 1 meta-parameter: *β*.

9. Model GQLSB2β (with distinct exploration meta-parameters during search and repetition trials: respectively

*β*_{S}and*β*_{R})

To test the hypothesis that monkey behavior in the PS Task can be best explained by 2 distinct control levels during search and repetition periods, instead of using a single meta-parameter *β* for all trials, we used 2 distinct meta-parameters *β*_{S} and *β*_{R} so that the model used *β*_{S} in equation (1) during search trials and *β*_{R} in equation (1) during repetition trials. We tested these distinct search and repetition *β*_{S} and *β*_{R} meta-parameters in Model GQLSB2β which thus has 4 free meta-parameters compared with 3 in Model GQLSB.

10. Model SBnoA2β (with distinct exploration meta-parameters during search and repetition trials: resp.

*β*_{S}and*β*_{R})

Similarly to the previous model, we tested a version of Model SBnoA which includes 2 distinct *β*_{S} and *β*_{R} meta-parameters for search and repetition periods. Model SBnoA*2*β thus has 3 free meta-parameters.

11. and 12. Control models: ClockS (Clockwise search + repetition of correct target); RandS (Random search + repetition of correct target)

We finally tested 2 control models to test the contribution of the value updating mechanisms used in the previous models for the elimination of non-rewarded target (i.e., equation 3 with *α* used as a free meta-parameter in model GQLSB or set to 1 in Model SBnoA). Model *ClockS* replaces such mechanism by performing systematic clockwise searches, starting from the animal's favorite target—as measured in the spatial bias–, instead of choosing targets based on their values, and repeats the choice of the rewarded target once it finds it. Model RandS performs random searches and repeats choices of the rewarded target once it finds it.

### Theoretical Model Optimization

To compare the ability of models in fitting monkeys' behavior during the task, (1) we first separated the behavioral data into 2 datasets so as to optimize the models on the Optimization dataset (Opt) and then perform an out-of-sample test of these models on the Test dataset (Test), (2) for each model, we then estimated the meta-parameter set which maximized the log-likelihood (LL) of monkeys' trial-by-trial choices in the Optimization dataset given the model, (3) we finally compared the scores obtained by the models with different criteria: maximum LL and percentage of monkeys' choice predicted (%) on Opt and Test datasets, BIC, AIC, Log of posterior probability of models given the data and given priors over meta-parameters (LPP).

Separation of optimization (Opt) and test (Test) datasets

We used a cross-validation method by optimizing models' meta-parameters on 4 behavioral sessions (2 per monkey concatenated into a single block of trials per monkey in order to optimize a single meta-parameter set per animal; 4031 trials) of the PS task, and then out of sample testing these models with the same meta-parameters on 49 other sessions (57 336 trials). The out of sample test was performed to test models' generalization ability and to validate which model is best without complexity issues.

2. Meta-parameter estimation

The aim here was to find for each model *M* the set of meta-parameters *θ* which maximized the LL of the sequence of monkey choices in the Optimization dataset *D* given *M* and *θ*:

We searched for each model's LL_{opt} and *θ*_{opt} on the Optimization dataset with 2 different methods:

We first sampled a million different meta-parameters sets (drawn from prior distributions over meta-parameters such that *α*, *κ* are in [0;1], *β*, *β*_{S}, *β*_{R} are in −10 log([0;1])). We stored the LL_{opt} score obtained for each model and the corresponding meta-parameter set *θ*_{opt}.

We then performed another meta-parameter search through a gradient-descent method using the *fminsearch* function in Matlab launched at multiple starting points: we started the function from all possible combinations of meta-parameters in *α*, *κ* in {0.1;0.5;0.9}, *β*, *β*_{S}, *β*_{R} in {1;5;35}. If this method gave a better LL score for a given model, we stored it as well as the corresponding meta-parameter set. Otherwise, we kept the best LL score and the corresponding meta-parameter set obtained with the sampling method for this model.

3. Model comparison

In order to compare the ability of the different models to accurately fit monkeys' behavior in the task, we used different criteria. As typically done in the literature, we first used the maximized LL obtained for each model on the Optimization dataset (LL_{opt}) to compute the Bayesian information criterion (BIC_{opt}) and Akaike information criterion (AIC_{opt}). We also looked at the percentage of trials of the Optimization dataset where each model accurately predicts monkeys' choice (%_{opt}). We performed likelihood ratio tests to compare nested models (e.g., Model SBnoF and Model SBnoA).

To test models' generalization ability and to validate which model is best without complexity issues, we additionally compared models' LL on the Test dataset given the meta-parameters estimated on the Optimization dataset (LL_{test}), as well as models' percentage of trials of the Test dataset where the model accurately predicts monkeys' choice given the meta-parameters estimated on the Optimization dataset (%_{test}).

Finally, because comparing the maximal likelihood each model assigns to data can result in overfitting, we also computed an estimation of the log of the posterior probability over models on the Optimization dataset (LPP_{opt}) estimated with the meta-parameter sampling method previously performed (Daw 2011). To do so, we hypothesized a uniform prior distribution over models *P*(*M*); we also considered a prior distribution for the meta-parameters given the models *P*(*θ|M*), which was the distributions from which the meta-parameters were drawn during sampling. With this choice of priors and meta-parameter sampling, LPP_{opt} can be written as:

where *N* is the number of samples drawn for each model. To avoid numerical issues in Matlab when computing the exponential of large numbers, LPP_{opt} was computed in practice as:

Estimating models' posterior probability given the data can be seen as equivalent as computing a “mean likelihood”. And it has the advantage of penalizing both models that have a peaked posterior probability distribution (i.e., models with a likelihood which is good at its maximum but which decreases sharply as soon as meta-parameters slightly change) and models that have a large number of free meta-parameters (Daw 2011).

### Neural Data Analyses

#### Activity Variation Between Search and Repetition

To analyze activity variations of individual neurons between the search period and the repetition period, we computed an index of activity variation for each cell:

*A* is the cell mean firing rate during the early–delay epoch ([start + 0.1 s; start + 1.1 s]) over all trials of the search period, and *B* is the cell's mean firing rate in the same epoch during all trials of the repetition period.

To measure significant increases or decreases of activity in a given group of neurons, we considered the distribution of neurons' activity variation index. An activity variation was considered significant when the distribution had a mean significantly different from 0 using a one-sample *t*-test and a median significantly different from zero using a Wilcoxon Mann–Whitney *U* test for zero median. Then we employed a Kruskal–Wallis test to compare the distributions of activity during search and repetition, corrected for multiple comparison between different groups of neurons (Bonferroni correction).

#### Choice Selectivity

To empirically measure variations in choice selectivity of individual neurons, we analyzed neural activities using a specific measure of spatial selectivity (Procyk and Goldman-Rakic 2006). The activity of a neuron was classified as choice selective when this activity was significantly modulated by the identity/location of the target chosen by the animal (one-way ANOVA, *P* < 0.05). The target preference of a neuron was determined by ranking the average activity measured in the early–delay epoch ([start + 0.1 s; start + 1.1 s]) when this activity was significantly modulated by the target choice. We used for each unit the average firing rate ranked by values and herein named “preference” (*a*, *b*, *c*, *d* where *a* is the preferred and d the least preferred target). The ranking was first used for population data and structure comparisons. For each cell, the activity was normalized to the maximum and minimum of activity measured in the repetition period (with normalized activity = [activity − min]/[max − min]).

Second, to study changes in choice selectivity (tuning) throughout trials during the task, we used for each unit the average firing rate ranked by values (*a*, *b*, *c*, *d*). We then calculated the norm of a preference vector using the method of Procyk and Goldman-Rakic (2006) which is equivalent to computing the Euclidean distance within a factor of $2$: we used an arbitrary arrangement in a square matrix $[abcd]$ to calculate the vector norm:

For each neuron, the norm was divided by the global mean activity of the neuron (to exclude the effect of firing rate in this measure: preventing a cell A that has a higher mean firing rate than a cell B to have a higher choice selectivity norm when they are both equally choice selective).

The value of the preference vector norm was taken as reflecting the strength of choice coding of the cell. A norm equal to zero would reflect equal activity for the 4 target locations. This objective measure allows the extraction of one single value for each cell, and can be averaged across cells. Finally, to study variations in choice selectivity between search and repetition periods, we computed an index of choice selectivity variation for each cell:

where *C* is the cell's choice selectivity norm during search and *D* is the cell's choice selectivity norm during repetition.

To assess significant variations of choice selectivity between search and repetition in a given group of neurons (e.g., dACC or LPFC), we used: a *t*-test to verify whether the mean was different from zero; a Wilcoxon Mann–Whitney *U* test to verify whether the median was different from zero; then we used a Kruskal–Wallis test to compare the distributions of choice selectivity during search and repetition, corrected for multiple comparison between different groups of neurons (Bonferroni correction).

To assess whether variations of choice selectivity between search and repetition depended on the exploration level *β* measured in the animal's behavior by means of the model, we cut sessions into 2 groups: those where *β* was smaller than the median of *β* values (i.e., 5), and those where *β* was larger than this median. Thus, in these analyses, repetition periods of a session with *β* < 5 will be considered a relative exploration, and repetition periods of a session with *β* >5 will be considered a relative exploitation. We then performed 2-way ANOVAs (*β* × task phase) and used a Tukey's HSD post hoc test to determine the direction of the significant changes in selectivity with changing exploration levels, tested at *P* = 0.05.

#### Model-Based Analysis of Single-Unit Data

To test whether single units encoded information related to model computations, we used the following model variables as regressors of trial-by-trial activity: the reward prediction error [*δ*], the action value [*Q*] associated with each target and the outcome uncertainty [*U*]. The latter is a performance monitoring measure which assesses the entropy of the probability over the different possible outcomes (i.e., reward $r$ versus no reward $r\xaf$) at the current trial *t* given the set $T$of remaining targets: $U(t)=\u2212P(r|T)log\u2061(P(r|T))\u2212P(r\xaf|T)log\u2061(P(r\xaf|T))$. At the beginning of a new problem, when there are 4 possible targets, *U* starts at a low value since there is 75% chance of making an error. *U* increases trial after trial during the search period. It is maximal when there remain 2 possible targets because there is 50% chance of making an error. Then *U* drops after either the first rewarded trial or the third error trial—because the fourth target is necessarily the rewarded one—and remains at zero during the repetition period. We decided to use a regressor with this pattern of change because it is somewhat comparable to the description of changes in frontal activity previously observed during the PS task (Procyk et al. 2000; Procyk and Goldman-Rakic 2006).

We used *U* as the simplest possible parameter-free performance monitoring regressor for neural activity. This was done in order to test whether dACC and LPFC single-unit could reflect performance monitoring processes in addition to responding to feedback and tracking target values. But we note that the profile of *U* in this task would not be different from other performance monitoring measures such as the outcome history that we previously used in our computational model for dynamic control regulation in this task (Khamassi et al. 2011), or such as the vigilance level in the model of Dehaene et al. (1998) which uses error and correct signals to update a regulatory variable (increased after errors and decreased after correct trials). We come back to possible interpretations of neural correlates of *U* in the discussion.

To investigate how neural activity was influenced by action values [*Q*], reward prediction errors [*δ*] as well as the outcome uncertainty [*U*], we performed a multiple regression analysis combined with a bootstrapping procedure, focusing our analyses on spike rates during a set of trial epochs (Fig. 1*C*): prestart (0.5 s before trial start); poststart (0.5 s after trial start); pretarget (0.5 s before target onset); post-target (0.5 s after target onset); the action epoch defined as pretouch (0.5 s before screen touch); prefeedback (0.5 s before feedback onset); early-feedback (0.5 s after feedback onset); late-feedback (1.0 s after feedback period); inter trial interval (ITI; 1.5 s after feedback onset).

The spike rate *y*(*t*) during each of these intervals in trial *t* was analyzed using the following multiple linear regression model:

where $Qk(t),(k\u2208{1\u20264})$ are the action values associated with the 4 possible targets at time *t*, *δ*(*t*) is the reward prediction error, *U*(*t*) is the outcome uncertainty, and $\rho i,(i\u2208{1\u2026n})$ are the regression coefficients.

*δ*, *Q*, and *U* were all updated once in each trial. *δ* was updated at the time of feedback, so that regression analyses during prefeedback epochs were done using *δ* from the previous trial, while analyses during postfeedback epochs used the updated *δ. Q* and *U* were updated at the end of the trial so that regression analyses in all trial epochs were done using the *Q*-values and *U* value of the current trial.

Note that the action value functions of successive trials are correlated, because they are updated iteratively, and this violates the independence assumption in the regression model. Therefore, the statistical significance for the regression coefficients in this model was determined by a permutation test. For this, we performed a shuffled permutation of the trials and recalculated the regression coefficients for the same regression model, using the same meta-parameters of the model obtained for the unshuffled trials. This shuffling procedure was repeated 1000 times (bootstrapping method), and the *P* value for a given independent variable was determined by the fraction of the shuffles in which the magnitude of the regression coefficient from the shuffled trials exceeded that of the original regression coefficient (Seo and Lee 2009), corrected for multiple comparisons with different model variables in different trial epochs (Bonferroni correction).

To assess the quality of encoding of action value information by dACC and LPFC neurons, we also performed a multiple regression analysis on the activity of each neuron related to *Q*-values after excluding trials where the preferred target of the neuron was chosen by the monkey. This analysis was performed to test whether the activity of such neurons still encodes *Q*-values outside trials where the target is selected. Similarly, to evaluate the quality of reward prediction error encoding, we performed separate multiple regression analyses on correct trials only versus error trials only. This analysis was performed to test whether the activity of such neurons quantitatively discriminate between different amplitudes of positive reward prediction errors and between different amplitudes of negative reward prediction errors. In both cases, the significance level of the multiple regression analyses was determined with a bootstrap method and a Bonferroni correction for multiple comparisons.

Finally, to measure possible collinearity issues between model variables used as regressors of neural activity, we used Brian Lau's Collinearity Diagnostics Toolbox for Matlab (http://www.subcortex.net/research/code/collinearity-diagnostics-matlab-code (Lau 2014) [date last accessed 20 May 2014]). We extracted the variation inflation factors computed with the coefficient of determination obtained when each regressor was expressed as a function of the other regressors. We also computed the condition indexes (CONDIND) and variance decomposition factors (VDF) obtained in the same analysis. A strong collinearity between regressors was diagnosed when CONDIND ≥ 30 and more than 2 VDFs>0.5. A moderate collinearity was diagnosed when CONDIND ≥ 10 and more than 2 VDFs >0.5. CONDIND ≤ 10 indicated a weak collinearity.

#### Principal Component Analysis

To determine the degree to which single-unit activity segregated or integrated information about model variables, we performed a principal component analysis (PCA) on the 3 correlation coefficients $\rho i,(i\u2208{4\u20266})$ obtained with the multiple regression analysis and relating neural activity with the 3 main model variables (reward prediction error *δ*, outcome uncertainty *U*, and the action value *Q _{k}* associated with the animal's preferred target

*k*). For each trial epoch, we pooled the coefficients obtained for all neurons in correlation with these model variables. Each principal component being expressed as a linear combination of the vector of correlation coefficients of neuron activities with these 3 model variables, the contribution of different model variables to each component gives an idea as to which extent cell activity is explained by an integrated contribution of multiple model variables. For instance, if a PCA on cell activity in the early–delay period produces 3 principal components that are each dependent on a different single model variable (e.g., PC1 = 0.95

*Q*+ 0.01

*δ*+ 0.04

*U*; PC2 = 0.1

*Q*+ 0.8

*δ*+ 0.1

*U*; PC3 = 0.05

*Q*+ 0.05

*δ*+ 0.9

*U*), then activity variations are best explained by separate influences from the information conveyed by the model variables. If in contrast, the PCA produces principal components which strongly depend on multiple variables (e.g., PC1 = 0.5

*Q*+ 0.49

*δ*+ 0.01

*U*; PC2 = 0.4

*Q*+ 0.1

*δ*+ 0.5

*U*; PC3 = 0.2

*Q*+ 0.4

*δ*+ 0.4

*U*), then variations of the activities are best explained by an integrated influence of such information (see Supplementary Fig. 1 for illustration of different Principal Components resulting from artificially generated data showing different levels of integration between model variables).

We compared the normalized absolute values of the coefficients of the 3 principal components so that a coefficient close to 1 denotes a strong correlation while a coefficient close to 0 denotes no correlation. To quantify the integration of information about different model variables in single-unit activities, for each neuron *k*, we computed an entropy-like index (ELI) of sharpness of encoding of different model variables based on the distributions of regression coefficients between cell activities and model variables:

Where *c _{i}* is the absolute value of the

*z*-scored correlation strength

*ρ*with model variable

_{i}*i*. A neuron with activity correlated with different model variables with similar strengths will have a high ELI; a neuron with activity highly correlated with only one model variable will have a low ELI. We compared the distributions of ELIs between dACC and LPFC in each trial epoch using a Kruskal–Wallis test.

Finally, we estimated the contribution of each model variable to neural activity variance in each epoch and compared it between dACC and LPFC. To do so, we first normalized the coefficients for each principal component in each epoch. These coefficients being associated with 3 model variables *Q*, *δ*, and *U*, this provided us with a contribution of each model variable to each principal component in each epoch. We then multiplied them by the contribution of each principal component to the global variance in neural activity in each epoch. The result constituted a normalized contribution of each model variable to neural activity variance in each epoch. We finally computed the ELI of these contributions. We compared the set of epoch-specific ELI between dACC and LPFC with a Kruskal–Wallis test.

#### Mutual Information

We measured the mutual information between monkey's choice at each trial and the firing rate of each individual recorded neuron during the early–delay epoch ([ST + 0.1 s; ST + 1.1 s]). The mutual information $I(S;R)$ was estimated by first computing a confusion matrix (Quian Quiroga and Panzeri 2009), relating at each trial *t*, the spike count from the unit activity in the early–delay epoch (as “predicting response” *R*) and the target chosen by the monkey (i.e., 4 targets as “predicted stimulus” *S*). Since neuronal activity was recorded during a finite number of trials, not all possible response outcomes of each neuron to each stimulus (target) have been sufficiently sampled. This is called the “limited sampling bias” which can be overcome by subtracting a correction term from the plug-in estimator of the mutual information (Panzeri et al. 2007). Thus we subtracted the Panzeri Treves (PT) correction term (Treves and Panzeri 1995) from the estimated mutual information $I(S;R)$:

where *N* is the number of trials during which the unit activity was recorded, $R\xaf$ is the number of relevant bins among the *M* possible values taken by the vector of spike counts and computed by the “bayescount” routine provided by (Panzeri and Treves 1996), and $RS\xaf$ is the number of relevant responses to stimulus (target) *s*.

Such measurement of information being reliable only if the activity was recorded during a sufficient number of trials per stimulus presentation, we restricted this analysis to units that verified the following condition (Panzeri et al. 2007):

where $NS$ is the minimum number of trials per stimulus (target).

Finally, to verify that such a condition was sufficiently restrictive to exclude artifactual effects, for each considered neuron we constructed 1000 pseudo response arrays by shuffling the order of trials at fixed target stimulus, and we recomputed each time the mutual information in the same manner (Panzeri et al. 2007). Then we verified that the average mutual information obtained with such shuffling procedure was close to the PT bias correction term computed with equation (15) (Panzeri and Treves 1996).

## Results

Previous studies have emphasized the role of LPFC in cognitive control and dACC in adjustment of action values based on measures of performance such as reward prediction errors, error-likelihood and outcome history. In addition, variations of activities in the 2 regions between exploration and exploitation suggest that both contribute to the regulation of the control level during exploration. Altogether neurophysiological data suggest particular relationships between dACC and LPFC, but their respective contribution during adaptation remains unclear and a computational approach to this issue appears highly relevant. We recently modeled such relationships using the meta-learning framework (Khamassi et al. 2011). The network model was simulated in the PS task (Quilodran et al. 2008) where monkeys have to search for the rewarded target in a set of 4 on a touch-screen, and have to repeat this rewarded choice for at least 3 trials before starting a new search period (Fig. 1*A*). In these simulations, variations of the model's control meta-parameter (i.e., inverse temperature *β*) produced variations of choice selectivity in simulated LPFC in the following manner: a decrease of choice selectivity (exploration) during search; an increase of choice selectivity (exploitation) during repetition. This resulted in a globally higher mean choice selectivity in simulated LPFC compared with simulated dACC, and in a covariation between choice selectivity and the inverse temperature in simulated LPFC but not in simulated dACC (Khamassi et al. 2011). This illustrates a prediction of computational models on the role of prefrontal cortex in exploration (McClure et al. 2006; Cohen et al. 2007; Krichmar 2008) which has not yet been tested experimentally.

### Characteristics of Behaviors

To assess the plausibility of such computational principles we first analyzed animals' behavior in the PS task. During recordings, monkeys performed nearly optimal searches, that is, rarely repeated incorrect trials (INC), and on average made errors in <5% of repetition trials. Although the animals’ strategy for determining the correct target during search periods was highly efficient, the pattern of successive choices was not systematic. Analyses of series of choices during search periods revealed that monkeys used either clockwise (e.g., choosing target 1 then 2), counterclockwise, or crossing (going from one target to the opposite target in the display, e.g., from 1 to 3) strategies, with a slightly higher incidence for clockwise and counterclockwise strategies, and a slightly higher incidence for clockwise over counterclockwise strategy (Percent clockwise, counterclockwise, crossing and repeats were 38%, 36%, 25%, 1% and 39%, 33%, 26%, 2% for each monkey respectively, measured for 9716 and 4986 transitions between 2 targets during search periods of 6986 and 3227 problems respectively). Rather than being systematic or random, monkeys' search behavior appeared to be governed by more complex factors: shifting from the previously rewarded target in response to the SC at the beginning of most new problems (Fig. 2*A*, middle); spatial biases that is, more frequent selection of preferred targets in the first trial of search periods (Fig. 2*A*, right); and efficient adaption to each choice error as argued above. This indicates a planned and controlled exploratory behavior during search periods. This is also reflected in an incremental change in reaction times during the search period, with gradual decreases after each error (Fig. 2*B*). Moreover, reaction times shifted from search to repetition period after the first reward (CO1), suggesting a shift between 2 distinct behavioral modes or 2 levels of control (Monkey M: Wilcoxon Mann–Whitney *U* test, *P* < 0.001; Monkey P: *P* < 0.001; Fig. 2*B*).

#### Model-Based Analyses

Behavioral analyses revealed that monkeys used nearly optimal strategies to solve the task, including shift at problem changes, which are unlikely to be solved by simple reinforcement learning. In order to identify the different elements that took part in monkey's decisions and adaptation during the task we compared the fit scores of several distinct models to trial-by-trial choices after estimating each model's free meta-parameters that maximize the LL separately for each monkey (see Materials and methods). We found that models performing either a random search or a clockwise search and then simply repeating the correct target could not properly reproduce monkeys’ behavior during the task, even when the clockwise search was systematically started by the monkeys’ preferred target according to its spatial biases (Models $RandS$ and $ClockS$; Table 1 and Fig. 2*D*). Moreover, the fact that monkeys most often shifted their choice at the beginning of each new problem in response to the SC (Fig. 2*A*, middle) prevented a simple reinforcement learning model (Q-learning) or even a generalized reinforcement learning model from reproducing monkey's behavior (respectively QL and GQL in Table 1). Indeed, these models obviously have a strong tendency to choose the previously rewarded target without taking into account the SC to a new problem. Behavior was better reproduced with a combination of generalized reinforcement learning and reset of target values at each new problem (shifting the previously rewarded target and taking into account the animal's spatial biases measured during the previous session; that is, Models GQLSB, GQLSB2β, SBnoA, SBnoA2β in Fig. 2*D* and Table 1). We tested control models without spatial biases, without problem shift, and with neither of them, to show that they were both required to fit behavior (respectively GQLSnoB, GQLBnoS, and GQLnoSnoB in Table 1). We also tested a model with spatial biases and shift but without progressive updating of target values nor forgetting—that is, $\alpha =1,\kappa =1$ (Model SBnoF, which is a restricted and nested version of Model SBnoA with 1 less meta-parameter) and found that it was not as good as SBnoA in fitting monkeys' behavior, as found with a likelihood ratio test at *P* = 0.05 with one degree of freedom.

Models | r^{a} | RL^{b} | N_{P}^{c} | Opt −LL^{d} | Opt NL^{e} | Opt %^{f} | Opt −LPP^{g} | Opt BIC/2^{h} | Opt AIC/2^{i} | Test −LL^{d} | Test NL^{e} | Test %^{f} |
---|---|---|---|---|---|---|---|---|---|---|---|---|

GQLSB2β | Y | Y | 4 | 3290 | 0.5921 | 83.47 | 3459 | 3360 | 3298 | 29732 | 0.5830 | 74.17 |

SBnoA2β | Y | N | 3 | 3385 | 0.5831 | 84.13 | 3422 | 3438 | 3391 | 30901 | 0.5708 | 73.11 |

GQLSB | Y | Y | 3 | 3355 | 0.5859 | 83.80 | 3502 | 3408 | 3361 | 29539 | 0.5850 | 73.43 |

SBnoA | Y | N | 2 | 3454 | 0.5768 | 84.29 | 3480 | 3489 | 3458 | 30613 | 0.5738 | 72.59 |

SBnoF | Y | N | 1 | 3586 | 0.5648 | 84.43 | 3604 | 3604 | 3588 | 32169 | 0.5578 | 71.61 |

GQLBnoS | Y | Y | 3 | 3721 | 0.5528 | 78.59 | 3847 | 3773 | 3727 | 33274 | 0.5467 | 69.47 |

GQLSnoB | Y | Y | 3 | 3712 | 0.5536 | 76.66 | 3843 | 3764 | 3718 | 31501 | 0.5646 | 70.12 |

GQLnoSnoB | Y | Y | 3 | 4253 | 0.5079 | 69.14 | 4292 | 4305 | 4259 | 35376 | 0.5262 | 66.60 |

GQL | N | Y | 3 | 5590 | 0.4104 | 65.10 | 5994 | 5643 | 5596 | 49282 | 0.4089 | 53.20 |

QL | N | Y | 2 | 5960 | 0.3869 | 44.92 | 7755 | 5995 | 5964 | 59734 | 0.3382 | 48.78 |

ClockS | Y | N | 2 | 5249 | 0.4333 | 70.92 | 5841 | 5284 | 5253 | 47504 | 0.4223 | 58.71 |

RandS | Y | N | 1 | 4607 | 0.4800 | 69.43 | 4621 | 4624 | 4609 | 39488 | 0.4884 | 63.73 |

Models | r^{a} | RL^{b} | N_{P}^{c} | Opt −LL^{d} | Opt NL^{e} | Opt %^{f} | Opt −LPP^{g} | Opt BIC/2^{h} | Opt AIC/2^{i} | Test −LL^{d} | Test NL^{e} | Test %^{f} |
---|---|---|---|---|---|---|---|---|---|---|---|---|

GQLSB2β | Y | Y | 4 | 3290 | 0.5921 | 83.47 | 3459 | 3360 | 3298 | 29732 | 0.5830 | 74.17 |

SBnoA2β | Y | N | 3 | 3385 | 0.5831 | 84.13 | 3422 | 3438 | 3391 | 30901 | 0.5708 | 73.11 |

GQLSB | Y | Y | 3 | 3355 | 0.5859 | 83.80 | 3502 | 3408 | 3361 | 29539 | 0.5850 | 73.43 |

SBnoA | Y | N | 2 | 3454 | 0.5768 | 84.29 | 3480 | 3489 | 3458 | 30613 | 0.5738 | 72.59 |

SBnoF | Y | N | 1 | 3586 | 0.5648 | 84.43 | 3604 | 3604 | 3588 | 32169 | 0.5578 | 71.61 |

GQLBnoS | Y | Y | 3 | 3721 | 0.5528 | 78.59 | 3847 | 3773 | 3727 | 33274 | 0.5467 | 69.47 |

GQLSnoB | Y | Y | 3 | 3712 | 0.5536 | 76.66 | 3843 | 3764 | 3718 | 31501 | 0.5646 | 70.12 |

GQLnoSnoB | Y | Y | 3 | 4253 | 0.5079 | 69.14 | 4292 | 4305 | 4259 | 35376 | 0.5262 | 66.60 |

GQL | N | Y | 3 | 5590 | 0.4104 | 65.10 | 5994 | 5643 | 5596 | 49282 | 0.4089 | 53.20 |

QL | N | Y | 2 | 5960 | 0.3869 | 44.92 | 7755 | 5995 | 5964 | 59734 | 0.3382 | 48.78 |

ClockS | Y | N | 2 | 5249 | 0.4333 | 70.92 | 5841 | 5284 | 5253 | 47504 | 0.4223 | 58.71 |

RandS | Y | N | 1 | 4607 | 0.4800 | 69.43 | 4621 | 4624 | 4609 | 39488 | 0.4884 | 63.73 |

^{a}Resetting action values at the beginning of each new problem (yes or no).

^{b}Reinforcement learning (RL) mechanisms or not.

^{c}Number of free meta-parameters.

^{d}Negative Log Likelihood.

^{e}Normalized likelihood over all trials.

^{f}Percentage of trials where the model correctly predicted monkey choice.

^{g}Negative log likelihood.

^{h}Bayesian information criterion.

^{i}Akaike information criterion.

Although Models GQLSB, GQLSB2β, SBnoA, SBnoA2β were significantly better than other tested models along all used criteria (maximum likelihood [Opt-LL], BIC score, AIC score, log of posterior probability [LPP], out-of-sample test [Test-LL] in Table 1), these 4 versions gave similar fit performance. In addition, the best model was not the same depending on the considered criterion: Model GQLSB2β was the best according to LL, BIC, and AIC scores, and second best according to LPP and Test-LL scores; Model SBnoA2β was the best according to LPP score; Model GQLSB was the best according to Test-LL score.

As a consequence, the present dataset does not allow to decide whether allowing a free meta-parameter *α* (i.e., learning rate) in model GQLSB and GQLSB2β is necessary or not in this task, compared with versions of these models where *α* is fixed to 1 (Model SBnoA and SBnoA2β) (Fig. 2*D* and Table 1). This is due to the structure of the task—where a single correct trial is sufficient to know which is the correct target—which may be solved by sharp updates of working memory rather than by progressive reinforcement learning (although a small subset of the sessions were better fitted with $\alpha \u2208[0.3;0.9]$ in Model GQLSB, thus revealing a continuum in the range of possible *α*s, see Supplementary Fig. 2). We come back to this issue in the Discussion.

Similarly, models that use distinct control levels during search and repetition (Models GQLSB2β and SBnoA2β) could not be distinguished from models using a single parameter (Models GQLSB and SBnoA) in particular because of out-of-sample test scores (Table 1).

Nevertheless, model-based analyses of behavior in the PS task suggest complex adaptations possibly combining rapid updating mechanisms (i.e., *α* close to 1), forgetting mechanisms and the use of information about the task structure (SC; first correct feedback signaling the beginning of repetition periods). Model GQLSB2β here combines these different mechanisms in the more complete manner and moreover won the competition against the other models according to 3 criteria out of 5. Consequently, in the following we will use Model GQLSB2β for model-based analyses of neurophysiological data and will systematically compare the results with analyses performed with Models GQLSB, SBnoA, SBnoA2β to verify that they yield similar results.

In summary, the best fit was obtained with Models SBnoA, SBnoA2β, GQLSB, GQLSB2β which could predict over 80% of the choices made by the animal (Table 1). Figure 2*A* shows a sample of trials where Model SBnoA can reproduce most monkey choices, and illustrating the sharper update of action values in Model SBnoA (with *α* = 1) compared with Model GQLSB (where the optimized *α* = 0.7). When freely simulated on 1000 problems of the PS task—that is, the models learned from their own decisions rather than trying to fit monkeys' decisions, the models made 38.23% clockwise search trials, 32.41% counterclockwise, 29.22% crossing and 0.15% repeat. Simulations of the same models without spatial biases produced less difference between percentages of clockwise, counterclockwise and crossing trials, unlike monkeys: 33.98% clockwise, 32.42% counterclockwise, 33.53% crossing and 0.07% repeat.

#### Distinct Control Levels Between Search and Repetition

To test whether behavioral adaptation could be described by a dynamical regulation of the *β* meta-parameter (i.e., inverse temperature) between search and repetition, we analyzed the value of the optimized 2 distinct free meta-parameters (*β*_{S} and *β*_{R}) in Models GQLSB2β and SBnoA2β (see Fig. 2*E*,*C* and Supplementary Fig. 2). The value of the optimized *β*_{S} and *β*_{R} meta-parameters obtained for a given monkey in a given session constituted a quantitative measure of the control level during that session. Such level was non-linearly linked to the number of errors the animal made. For instance, a *β*_{R} of 3, 5, or 10 corresponded to ∼20%, 5%, and 0% errors, respectively, made by the animal during repetition periods (Fig. 2*C*).

Interestingly, the distributions of *β*_{S} and *β*_{R} obtained for each recording session showed dissociations between search and repetition periods in a large number of sessions. We found a unimodal distribution for the *β* meta-parameter during the search period (*β*_{S}), reflecting a consistent level of control in the animal behavior from session to session. In contrast, we observed a bimodal distribution for the *β* meta-parameter during the repetition period (*β*_{R}; Fig. 2*E*). In Figure 2*E*, the peak on the right of the distribution (large *β*_{R}) corresponds to a subgroup of sessions where behavior shifted between different control levels from search to repetition periods. This shift in the level of control could be interpreted as a shift from exploratory to exploitative behavior, an attentional shift or a change in the working memory load, as we discuss further in the Discussion. Nevertheless this is consistent with the hypothesis of a dynamical regulation of the inverse temperature *β* between search and repetition periods in this task (Khamassi et al. 2011, 2013). The bimodal distribution for *β*_{R} illustrates the fact that during another subgroup of sessions (small *β*_{R}), the animal's behavior did not shift to a different control level during repetition and thus made more errors. Such bimodal distribution of the *β* meta-parameter enables to separate sessions in 2 groups and to compare dACC and LPFC activities (see below) during sessions where decisions displayed a shift and during sessions where no such clear shift occurred. Interestingly, the bimodal distribution of *β*_{R} is not crucially dependent of the optimized learning rate *α* since a similar bimodal distribution was obtained with Model SBnoA2β and since the optimized *β*_{S} and *β*_{R} values in the 2 models were highly correlated (*N* = 277; *β*_{S}: *r* = 0.9, *P* < 0.001; *β*_{R}: *r* = 0.96, *P* < 0.001; see Supplementary Fig. 2).

### Modulation of Information Coding

To evaluate whether a behavioral change between search and repetition was accompanied by changes in LPFC activity and choice selectivity, we analyzed a pool of 232 LPFC single-units (see Fig. 1*B* for the anatomy) in animals performing the PS task, and compared the results with 579 dACC single-unit recordings which have been only partially used for investigating feedback-related activity (Quilodran et al. 2008). We report here a new study relying on comparative analyses of dACC and LPFC responses, the analysis of activities before the feedback—especially during the delay period, and the model-based analysis of these neurophysiological data. The results are summarized in Supplementary Table 1.

#### Average Activity Variations Between Search and Repetition

Previous studies revealed differential prefrontal fMRI activations between exploitation (where subjects chose the option with maximal value) and exploration trials (where subjects chose a nonoptimal option) (Daw et al. 2006). Here a global decrease in average activity level was also observed in the monkey LPFC from search to repetition. For early–delay activity, the average index of variation between search and repetition in LPFC was negative (mean: −0.05) and significantly different from zero (mean: *t*-test *P* < 0.001, median: Wilcoxon Mann–Whitney *U* test *P* < 0.001). The average index of activity variation in dACC was not different from zero (mean: −0.008; *t*-test *P* >0.35; median: Wilcoxon Mann–Whitney *U* test *P* >0.25). However, close observation revealed that the nonsignificant average activity variation in dACC was due to the existence of equivalent proportions of dACC cells showing activity increase or activity decrease from search to repetition, leading to a null average index of variation (Fig. 3*A*,*B*; 17% vs. 20% cells respectively). In contrast, more LPFC single units showed a decreased activity from search to repetition (18%) than an increase (8%), thus explaining the apparent global decrease of average LPFC activity during repetition. The difference in proportion between dACC and LPFC is significant (Pearson *χ*^{2} test, 2 df, *t* = 13.0, *P* < 0.01) and was also found when separating data for the 2 monkeys (see Supplementary Fig. 3). These changes in neural populations thus suggest that global nonlinear dynamical changes occur in dACC and LPFC between search and repetition instead of a simple reduction or complete cessation of involvement during repetition.

#### Modulations of Choice Selectivity Between Search and Repetition

As shown in Figure 3*A*, a higher proportion of neurons showed a significant choice selectivity in LPFC (155/230, 67%) than in dACC (286/575, 50%; Pearson *χ*^{2} test, 1 df, *t* = 20.7, *P* < 0.001)—as measured by the vector norm in equation (10). Interestingly, the population average choice selectivity was higher in LPFC (0.80) than in dACC (0.70; Kruskal–Wallis test, *P* < 0.001; see Fig. 3*C*). When pooling all sessions together, this resulted in a significant increase in average choice selectivity in LPFC from search to repetition (mean variation: 0.04; Wilcoxon Mann–Whitney *U* test *P* < 0.01; *t*-test *P* < 0.01; Fig. 3*C*).

Strikingly, the significant increase in LPFC early–delay choice selectivity from search to repetition was found only during sessions where the model fit dissociated control levels in search and repetition (i.e., sessions with large *β*_{R} [*β*_{R} >5]; Kruskal–Wallis test, 1 df, *χ*^{2} = 6.45, *P* = 0.01; post hoc test with Bonferroni correction indicated that repetition >search). Such an effect was not found during sessions where the model reproducing the behavior remained at the same control level during repetition (i.e., sessions with small *β*_{R} [*β*_{R} < 5]; Kruskal–Wallis test, *P* >0.98) (Fig. 4, bottom).

Interestingly, choice selectivity in LPFC was significantly higher during repetition for sessions where *β*_{R} was large (mean choice selectivity = 0.91) than for sessions where *β*_{R} was small (mean choice selectivity = 0.73; Kruskal–Wallis test, 1 df, *χ*^{2} = 12.5, *P* < 0.001; post hoc test with Bonferroni correction; Fig. 4, bottom). Thus, LPFC early–delay choice selectivity clearly covaried with the level of control measured in the animal's behavior by means of the model.

There was also an increase in dACC early–delay choice selectivity between search and repetition consistent with variations of *β*, but only during sessions where the model capturing the animal's behavior made a strong shift in the control level (*β*_{R} >5; mean variation = 0.035, Kruskal–Wallis test, 1 df, *χ*^{2} = 5.22, *P* < 0.05; post hoc test with Bonferroni correction indicated that repetition >search; Fig. 4-top). However, overall, dACC choice selectivity did not follow variations of the control level. Two-way ANOVAs either for (*β*_{S} × task phase) or for (*β*_{R} × task phase) revealed no main effect of *β* (*P* >0.2), an effect of task period (*P* < 0.01), but no interaction (*P* >0.5). And there was no significant difference in ACC choice selectivity during repetition between sessions with a large *β*_{R} (mean choice selectivity = 0.69) and sessions with a low one (mean choice selectivity = 0.75; Kruskal–Wallis test, 1 df, *χ*^{2} = 3.11, *P* >0.05).

At the population level, increases in early–delay mean choice selectivity from search to repetition were due both to an increase of single unit selectivity, and to the emergence in repetition of selective units that were not significantly so in search (Fig. 3*A*). Importantly, the proportion of LPFC early–delay choice selective neurons during repetition periods of sessions where *β*_{R} was small (55%) was significantly smaller than the proportion of such LPFC neurons during sessions where *β*_{R} was large (72%; Pearson *χ*^{2} test, 1 df, *t* = 7.19, *P* < 0.01). In contrast, there was no difference in proportion of dACC early–delay choice selective neurons during repetition between sessions where *β*_{R} was small (38%) and sessions where *β*_{R} was large (35%; Pearson *χ*^{2} test, 1 df, *t* = 0.39, *P* >0.5; Fig. 4*B*). These analyses thus show a significant difference between dACC and LPFC neural activity properties. LPFC mean choice selectivity as well as LPFC proportion of choice selective cells varied between search and repetition in accordance with the control level measured in the behavior by means of the computational model, while such effect was much weaker in dACC. These results are robust since they could also be obtained with Model SBnoA2β (see Supplementary Fig. 4A). Data separated for the 2 monkeys also reflected the contrast between the 2 structures (see Supplementary Fig. 4B).

#### Mutual Information Between Neural Activity and Target Choice

Generally, computational models of the dACC–LPFC system make the assumption that LPFC is central for the decision output. LPFC activity should thus be more tightly related to the animal's choice than dACC activity. Here, in 63 LPFC neurons recorded during a sufficient number of presentations of each target choice (see Materials and methods), the average mutual information—corrected for sampling bias—was more than twice as high (*I*_{LPFC} = 0.10 bit) as in 85 dACC cells (*I*_{ACC} = 0.04 bit; Kruskal–Wallis test, *P* < 0.001) (Fig. 3*D*). This effect appeared to be the result of the activity of a small subset of LPFC activity—in both monkeys (see Supplementary Fig. 3D)—with a high mutual information with choice. To verify that the applied restriction on the number of sampling trials was accurate, we constructed 1000 shuffled pseudo response arrays for each single unit and measured the average mutual information obtained with this shuffling procedure. For the 63 LPFC and 85 dACC selected neurons, the difference between the averaged shuffled information and the bias correction term was very small (mean = 0.01 bit), while it was high in non-selected neurons (mean = 0.08 bit). Thus the difference in estimated information between dACC and LPFC was not due to a limited sampling bias in the restricted number of analyzed neurons. We can conclude that, in agreement with computational models of the dACC–LPFC system, neural recordings show a stronger link between LPFC activity and choice than between dACC activity and choice.

### Neural Activity Correlated with Model Variables

Following model-based analyses of behavior we tested whether single unit activity in LPFC and dACC differentially reflect information similar to variables in Model GQLSB2β by using the time series of these variables as regressors in a general linear model of single-unit activity (multiple regression analysis with a bootstrapping control—see Materials and methods) (Fig. 6). In dACC and LPFC, respectively, 397/579 (68.6%) cells and 145/232 (62.5%) cells showed a correlation with at least one of the model's variables in at least one of the behavioral epochs: prestart, delay, pretarget, post-target, pretouch, prefeedback, early-feedback, late-feedback, and inter-trial interval (ITI). More precisely, we found a larger proportion of cells in LPFC than in dACC correlated with at least one model variable in the post-target epoch (Fig. 6*E*; Pearson *χ*^{2} tests, *T* = 3.89, *P* < 0.05), and a larger proportion of cells in dACC than in LPFC correlated with at least one model variable in the early-feedback epoch (Pearson *χ*^{2} test, *T* = 7.90, *P* < 0.01). Differences in proportions of LPFC and dACC neurons correlated with different model variables during pre- or postfeedback epochs were also observed for the 2 monkeys separately (see Supplementary Fig. 6), and when the model-based analysis was done with Models GQLSB, SBnoA or SBnoA2β (see Supplementary Fig. 5). Collinearity diagnostics between model variables revealed a weak collinearity in 306/308 recording sessions, a moderate collinearity in 1 session and a strong collinearity in 1 session (see Supplementary Fig. 9), thus excluding the possibility that these results could be an artifact of collinearity between model variables.

Figure 5*A* shows an example dACC post-target activity negatively correlated with the action value associated with choosing Target #4 (Fig. 5*A*-top). The raster plot and peristimulus histogram for this activity show lower firing rate in trials where the animal chose Target #4 than in trials where he chose one of the other targets (Fig. 5*A*, middle). Plotting the trial-by-trial evolution of the post-target firing rate of the neuron reveals sharp variations following action value update and distinct from the time series of the other model variables *δ* and *U* (Fig. 5*A*, bottom). The firing rate dropped below baseline during trials where Target #4 was chosen. Strikingly, the firing rate sharply increased above baseline in trials following non-rewarded choices of Target #4. Thus this single unit not only responded when the animal selected the associated target but also kept track of the stored value associated with that target. Figure 5*B* shows a LPFC unit whose activity in the post-target epoch is positively correlated with the action value associated with choosing Target #2. The raster plot illustrates a higher firing rate for trials where Target #2 was chosen (gray histogram and raster, Fig. 5*B*, middle). Similarly to the previous example, the trial-by-trial evolution of the post-target firing rate reveals sharp variations from trial to trial (Fig. 5*B*-bottom), consistent with sharp changes of action values in the model that best described behavior adaptation in this task (Fig. 2*A*).

We found 126/145 (87%) LPFC and 227/397 (57%) dACC *Q*-value encoding cells. The proportion was significantly greater in LPFC (Pearson *χ*^{2} test, 1 df, *T* = 41.30, *P* < 0.001; Fig. 6*A*). We next verified whether the activity of these cells carried *Q* value information only during trials where the neuron's preferred target was selected by the monkey, or also during other trials. To do so, we performed a new multiple regression analysis on the activity of each cell after excluding trials where the cell's preferred target was chosen. The activity of, respectively, 18% (23/126) and 13% (29/227) of LPFC and dACC *Q* value encoding cells were still significantly correlated with a *Q* value in the same epoch after excluding trials where the cell's preferred target was selected by the animal (multiple regression analysis with Bonferroni correction). Importantly, the difference in proportion of *Q* cells between LPFC and dACC was still significant after restricting to *Q* cells showing a significant correlation while excluding trials with their preferred target (LPFC: 23/145, 16%; dACC: 29/397, 7%; Pearson *χ*^{2} test, 1 df, *T* = 8.97, *P* < 0.01).

Given the deterministic nature of the task, and thus the limited sampling of options, a question remains of whether these neurons really encode *Q* values or whether they participate to action selection. The control analysis above excluding trials with each cells’ preferred target showed that at least a certain proportion of these cells carried information about action values outside trials where the corresponding action is selected. But how much information about choice do these neurons carry and is there a quantitative difference between LPFC and dACC? Interestingly, 43% (54/126) of LPFC *Q* cells had high mutual information with monkey choice (*I* >0.1) whereas only 33% (75/227) of dACC *Q* cells verified such condition. The difference in proportion was marginally significant (Pearson *χ*^{2} proportion test, 1 df, *T* = 3.37, *P* = 0.07). Moreover, LPFC *Q* cells activity contained more information about monkey choice (mean *I* = 0.12) than dACC *Q* cells (mean *I* = 0.09; Kruskal–Wallis test, 1df, *χ*^{2} = 3.88, *P* < 0.05; Post hoc test with Bonferroni correction found that LPFC-*Q* >dACC-*Q*) and more than LPFC non-*Q* cells (average = 0.09; Kruskal–Wallis test, *χ*^{2} = 6.65, 1 df, *P* < 0.01; Post hoc test with Bonferroni correction found that LPFC-*Q* >LPFC-non-*Q*). dACC *Q* cells activity did not contain more information about monkey choice than LPFC non-*Q* cells (Kruskal, 1 df, *χ*^{2} = 1.57, *P* >0.05). Although the observed difference in *Q*-encoding between dACC and LPFC are weak, these results are in line with the hypothesized dACC role in action value encoding and with the transfer of such information to LPFC for action selection—the LPFC would encode a probability distribution over possible actions.

#### Feedback-Related Activities in dACC and LPFC

A large proportion of neurons had activity correlated with *δ* during postfeedback epochs (Fig. 6, referred to as *δ* cells, see examples of such cells during late-feedback and inter-trial interval in Fig. 7*A*,*B*; raster plots and correlation with variable *δ* can be found in Supplementary Fig. 7 for the first cell and in Fig. 9*A* for the second cell). Significantly more cells correlated with *δ* in the dACC than in the LPFC: 252/397 (63%) versus 69/145 (48%; Pearson *χ*^{2} test, 1 df, *T* = 11.10, *P* < 0.001; Fig. 6*B*,*C*), which confirmed previous comparisons (Kennerley and Wallis 2009). Consistent with the high learning rate suitable for the task (due to the deterministic reward schedule of the task), the information about the reward prediction error *δ* from previous trials vanished quickly both in LPFC and dACC compared with other protocols (Seo and Lee 2007). Few dACC cells (31/285, 10.9%) and LPFC cells (9/116, 7.8%) retained a trace of *δ* from the previous trial in any of the prefeedback epochs (Fig. 6*B*,*C*). No significant difference was found between dACC and LPFC proportions (Pearson *χ*^{2} test, *T* = 0.89, *P* >0.3). Interestingly, only few LPFC *δ* cells (13/69, 18.8%) revealed a positive correlation (*δ*^{+} cells, i.e., neurons responding to unexpected correct feedback; Fig. 6*B*). The great majority of *δ* cells in LPFC had negative correlations (56/69, 81.2%), that is, displayed increased activity after errors (*δ*^{−} cells; Fig. 6*C*). In comparison, dACC had a higher proportion of *δ*^{+} cells (101/252 *δ*^{+} cells, 40.1%, and 151/202 *δ*^{−} cells, 74.8%; see example of such cell in Fig. 7*E*; raster and correlation plots are shown in Supplementary Fig. 8). The difference in proportion of *δ*^{+} cells between LPFC and dACC was significant (Pearson *χ*^{2} test, 1 df, *T* = 10.67, *P* < 0.01). Thus LPFC activity is much more reactive to negative feedback compared with dACC which responds equally to positive and negative feedback.

Previous studies have reported quantitative discrimination of positive reward prediction errors in dACC unit activity (Matsumoto et al. 2007; Kennerley and Walton 2011). dACC feedback-related activity might also represent categorical information (i.e., correct, choice error, execution error) rather than quantitative reward prediction errors (Quilodran et al. 2008; see discussion). The present model-based analysis confirms this and also extends it to LPFC feedback-related activity by finding that only very few cells were still correlated with *δ* when analyzing correct and INC separately. 10/159 (6.3%) dACC and 2/57 (3.5%) LPFC *δ*^{−} cells where still significantly correlated with *δ* when considering incorrect trials only (multiple regression analysis with bootstrap). These proportions were not significantly different (Pearson *χ*^{2} test, *T* = 0.62, *P* >0.4). Figure 7*A*,*B* illustrate examples of dACC and LPFC neurons which respond to errors without significantly distinguishing between different amplitudes of modeled negative reward prediction errors. 23/101 (22.8%) dACC and 2/13 (15.4%) LPFC *δ*^{+} cells where still significantly correlated with *δ* on COR trials only. These proportions were not significantly different (Pearson *χ*^{2} test, *T* = 0.37, *P* >0.5). Figure 7*E* illustrates the activity of such a cell. In summary, the most striking result regarding feedback-related activity was the differential properties of dACC and LPFC in coding both positive and negative outcomes, LPFC activity being clearly biased toward responding after negative outcomes.

#### Correlates of Outcome Uncertainty

Hypotheses on the neural bases of cognitive regulation have been largely inspired by the dynamics of activity variations in dACC and LPFC during behavioral adaptations (Kerns et al. 2004; Brown and Braver 2005). Functions of the dACC are considered to enable monitoring of variations in the history of reinforcements (Seo and Lee 2007, 2008), of the error-likelihood (Brown and Braver 2005), to accordingly adjust behavior. Thus we looked for correlations between single unit activities and the outcome uncertainty *U* (which progressively increases after elimination of possible targets during search and drops to zero after the first correct trial; see Materials and methods). We observed both positive and negative correlations between dACC neural activity and *U* (*U* cells): 71.8% were positive correlations—higher firing rate during search periods—and 28.2% were negative correlations—higher firing rate during repetition. These proportions are different from an expected 50–50% proportion (*χ*^{2} goodness of fit—one sample test, 1 df, *χ*^{2} = 39.32, *P* < 0.001). The population activity of these units correlated with *U* showed gradual trial-by-trial changes during search, and sharp variations from search to repetition, after the first correct feedback of the problem (see examples of such cells during the poststart epoch in Fig. 7*C*,*D*; see raster and correlation plots in Supplementary Fig. 7B,C). These patterns of activity were in opposite direction from changes in reaction times (Fig. 2*B*). They belonged to a larger group of cells that globally discriminated between search and repetition (see a different profile of such type of neurons in the post-target epoch in Fig. 7*F*; see raster and correlation plots in Supplementary Fig. 8B). Neural data revealed that *U* cells were more frequent in dACC (206/397, 52%) than in LPFC (48/145, 33%; Pearson *χ*^{2} test, *T* = 15.05, *P* < 0.001; Fig. 6*D*). Importantly, Figure 6 shows that, during trials, *U* was decoded from dACC activity mostly just before and after feedback occurrence. By contrast, *U* was better decoded during delay (i.e., pretarget epoch) in LPFC. These different dynamics reinforce the idea of an intimate link between *U* updating and the information provided by feedback for performance monitoring in dACC and, in contrast, of an implication of LPFC in incorporation of *U* into the decision function in LPFC.

#### Multiplexed Reinforcement-Related Information

We found that both dACC and LPFC single units multiplexed information about different model variables, with LPFC activity reflecting more integration of information than dACC activity. First, in LPFC the great majority of *U* cells (81%, 39/48) were also correlated with one of the model action values while this was true for only 52% (107/206) of dACC *U* cells (Pearson *χ*^{2} test, 1 df, *T* = 13.68, *P* < 0.001). Stronger integration was also reflected through higher correlation strengths with multiple variables of the model, as found by a PCA on regression coefficients for all dACC and LPFC neurons (Fig. 8). The first principal component (PC1) obtained with dACC neurons corresponds in all trial epochs to activity variations mainly related to the outcome uncertainty *U* and reveals weak links with *Q* and *δ* (Fig. 8*A*). In contrast, the 2 first components (PC1 and PC2) obtained with LPFC neurons both were expressed as a combination of *Q* and *U* during prefeedback epochs (Fig. 8*A*). The PCA also revealed a strong change in the principal components between pre- and postfeedback epochs both in dACC and LPFC and reliably in the 2 monkeys (Fig. 8*A*), consistent with the postfeedback activity changes and correlations between model variables reported in the previous analyses.

To quantify differences in multiplexing at the single-unit level, we computed an ELI of sharpness of encoding of different model variables based on the distributions of correlation strengths between individual cell activities and model variables (see Materials and methods): for example, a neuron with activity correlated with different model variables with similar strengths will have a high ELI; a neuron with activity highly correlated with only one model variable will have a low ELI (see illustration of different ELI obtained with artificial data illustrating these cases in Supplementary Fig. 1). We found a higher ELI in LPFC neurons than in dACC neurons in the pretouch and prefeedback epochs (Kruskal–Wallis test, *P* < 0.05) and the opposite effect (i.e., dACC >LPFC) in the early-feedback epoch (Kruskal–Wallis test, *P* < 0.05; Fig. 8*B*). These pre- and postfeedback variations in ELI may reflect different processes: action selection and value updating respectively. Overall, these results reveal higher information integration in LPFC before the feedback, and higher integration in dACC after the feedback.

We then measured the contribution of each model variable to each principal component in each epoch, and combined it with the contribution of each principal component to the global variance in neural activity in each epoch. We deduced a normalized contribution of each model variable to neural activity variance in each epoch (see Materials and methods). Strikingly, in dACC the model variable *U* dominated (contribution >50%) in all prefeedback epochs, while the contribution of *δ* started increasing in the early-feedback epoch (Fig. 8*C*). In contrast, in LPFC the model variables *Q* and *U* had nearly equal contributions to variance during prefeedback epochs, while the contribution of *δ* started increasing in the late-feedback epoch, thus later than in dACC. The global entropy in the normalized contributions of model variables to neural activity variance revealed marginally higher in LPFC than in dACC (Kruskal–Wallis test, *P* < 0.06) when analyzed with Model GQLSB2β's variables. These properties of PCA analyses were also true with Model SBnoA2β (see Supplementary Fig. 10), and the latter effect was found to be even stronger with the latter model (Kruskal–Wallis test, *P* < 0.01; see Supplementary Fig. 10C), thus confirming the higher information integration in LPFC than in dACC.

Finally, single unit activity could encode different information at different moments in time, corresponding to dynamic coding. More than half LPFC *δ* cells (55%, 38/69)—that is, neurons responding to feedback—showed an increase in choice selectivity at the beginning of each new trial in repetition, thus reflecting information about the subsequent choice (see a single cell example in Fig. 9*A*, and a population activity in Fig. 9*C*). In contrast, only 33% (84/252) of dACC *δ* cells showed such effect. The difference in proportion between LPFC and dACC was statistically different (Pearson *χ*^{2} test, 1 df, *T* = 10.86, *P* < 0.001; Fig. 9*B*). Thus, while dACC postfeedback activity may mostly be dedicated to feedback monitoring, LPFC activity in response to feedback might reflect the onset of the decision-making process triggered by the outcome.

## Discussion

Interaction between performance monitoring and cognitive control hypothetically relies on interactions between dACC and LPFC (e.g., Cohen et al. 2004). Here we described how the functional link between the 2 areas might contribute to the regulation of decisions.

In summary, we found that LPFC early–delay activity was more tightly related to monkeys' behavior than dACC activity, displaying higher mutual information with animals' choices than dACC, supporting LPFC's role in action selection. Also, the high choice selectivity in LPFC covaried with the control level measured from behavior: decreased choice selectivity during the search period, putatively promoting exploration; increased choice selectivity during the repetition period, putatively promoting exploitation. In contrast, this effect was not consistent in dACC. dACC activity correlated with various model variables, keeping track of pertinent information concerning the animal's performance. A calculation of outcome uncertainty (*U*) correlated with activity changes between exploration and exploitation mostly in dACC, and dominated the contribution to neural activity variance in prefeedback epochs. Moreover, dACC postfeedback activity appeared earlier than in LPFC and represented positive and negative outcomes with similar proportions while LPFC postfeedback activity mostly tracked negative outcomes.

Reinforcement-related (*Q* and *δ*) and task monitoring-related (*U*) information was multiplexed both in dACC and LPFC, but with higher integration of information before the feedback in LPFC and after the feedback in dACC. LPFC unit activity responding to feedback was also choice selective during early–delay, possibly contributing to decision-making, while dACC feedback-related activity—possibly categorizing feedback per se—showed less significant choice selectivity variations. Taken together, these elements suggest that reinforcement-based information and performance monitoring in dACC might participate in regulating decision functions in LPFC.

### Mixed Information and Coordination Between Areas

Correlations with variables related to reinforcement and actions were found in both structures in accordance with previous studies showing redundancy in information content, although with some quantitative biases (Seo and Lee 2008; Luk and Wallis 2009). However, compared with LPFC, dACC neuronal activity was more selective for outcome uncertainty that could be used to regulate exploration (Fig. 8). The PCA analysis showed that multiplexing of reinforcement-related information is stronger in LPFC activity suggesting that this structure receives and integrates this information. In this hypothesis dACC would influence LPFC computations by modulating an action selection process. Such interaction has been interpreted as a motivational or energizing function (from dACC) onto selection mechanisms (in LPFC) (Kouneiher et al. 2009). More specifically, our results support a recently proposed model in which dACC monitors task-relevant signals to compute action values and keep track of the agent's performance necessary for adjusting behavioral meta-parameters (Khamassi et al. 2011, 2013). In this model, values are transmitted to the LPFC which selects the action to perform. But the selection process (stochastic) is regulated online based on dACC's computations to enable dynamic variations of the control level.

This view preserves the schematic regulatory loop by which performance monitoring acts on cognitive control as proposed by others (Botvinick et al. 2001; Cohen et al. 2004). We further suggest a functional structure that reconciles data related to regulatory mechanisms, reinforcement learning, and cognitive control. In particular we point to the potential role of dACC in using reinforcement-related information (such as reward prediction error), relayed through the reward system (Satoh et al. 2003; Enomoto et al. 2011), to regulate global tendencies (formalized by meta-parameters) of adaptation. Interestingly, human dACC (i.e., midcingulate cortex) activation covaries with volatility or variance in rewards and could thereby also participate in regulating learning rates for social or reward-guided behaviors (Behrens et al. 2007, 2009). Kolling et al. (2012) have recently found that dACC encodes the average value of the foraging environment. This suggests a general involvement of dACC in translating results of performance monitoring and task monitoring into a regulatory level.

The fact that dACC activity correlated with changes in modeled meta-parameters would suggest a general function in the global setting of behavioral strategies. It has been proposed that dACC can be regarded as a filter involved in orienting motor or behavioral commands (Holroyd and Coles 2002), in regulating action decision (Domenech and Dreher 2010), and that it is part of a core network instantiating task-sets (Dosenbach et al. 2006). Interestingly, dACC neural activity encodes specific events that are behaviorally relevant in the context of a task, events that—like the SC in our task—can contribute to trigger selected adaptive mechanisms (Amiez et al. 2005; Quilodran et al. 2008). In line with this, Alexander and Brown recently proposed that dACC signals unexpected nonoccurrences of predicted outcomes, that is, negative surprise signals, which in their model consist of context-specific predictions and evaluations (Alexander and Brown 2011). Their model elegantly explains a large amount of reported dACC postfeedback activity. But dACC signals related to positive surprise (Matsumoto et al. 2007; Quilodran et al. 2008), and to other behaviorally salient events (Amiez et al. 2005), suggest an even more general role in processing information useful to guide selected behavioral adaptations.

### Exploration

Following a standard reinforcement learning framework, exploratory behavior was here associated with low *β* values, which flatten the probability distribution of competing actions in models and simulations (Khamassi et al. 2011). Although the precise molecular and cellular mechanisms underlying shifts between exploration and exploitation are not yet known, accumulating evidence suggest that differential levels of activation of D1 and D2 dopamine receptors in the prefrontal cortex may produce distinct states of activity: a first state allowing multiple network representations nearly simultaneously and thus permitting “an exploration of the input space”; a second state where the influence of weak inputs on PFC networks is shut off so as to stabilize one or a limited set of representations, which would then have complete control on PFC output and thus promote exploitation (Durstewitz and Seamans 2008). The consistent variations of LPFC choice selectivity between search and repetition periods suggest that such mechanism could also underlie exploration during behavioral adaptation.

However, this should not be interpreted as an assumption that monkeys' behavior is purely random during search periods of the task (see Model-based analysis of behavior). In fact, animals often display structured and organized exploratory behaviors as also revealed by our behavioral analyses. For instance, when facing a new open arena, rodents display sequential stages of exploration, first remaining around the nest position, second moving along walls and third visiting the center of the arena (Fonio et al. 2009). Non-human primates also use exploration strategies, such as optimized search trajectories adapted to the search space configuration (De Lillo et al. 1997), trajectories that can evolve based on reinforcement history along repeated exposure to the same environment (Desrochers et al. 2010). In ecological large scale environments search strategies are best described by correlated random or Levy walks and are modulated by various environmental parameters (Bartumeus et al. 2005).

One possible interpretation of our results is that decreases of choice selectivity in LPFC during search could reduce the amount of information about choice and ergo release biases in the influence on downstream structures such as the basal ganglia. In this way, efferent structures could express their own exploratory decisions. Consistent with this, it has been recently suggested that variations of tonic dopamine in the basal ganglia could also affect the exploration–exploitation trade off in decision-making (Humphries et al. 2012).

The prefrontal cortex might also contribute to the regulation of exploration based on current uncertainty (Daw et al. 2006; Frank et al. 2009). Uncertainty-based control could bias decision towards actions that provide very variable quantities of reward so as to gain novel information and reduce uncertainty. In our task, outcome uncertainty variations—progressive increase during search and drop to zero during repetition—can be confounded with other similar performance monitoring measures such as the feedback history (Khamassi et al. 2011) or variations of attentional level. Nevertheless, they covaried with the animal's reaction times and were mostly encoded by dACC neurons, thus revealing a possible relevance of this information for behavioral control in our task. It should be noted that outcome uncertainty is distinct from action uncertainty which would be confounded in our task with other task monitoring variables such as conflict (Botvinick et al. 2001) and error-likelihood (Brown and Braver 2005). All of them gradually and monotonically decrease along a typical problem of the PS task and remain low during repetition. We found neurons with such activity profile (e.g., Fig. 7*F*), however, in about half the proportion of *U* cells. More work is required to understand whether these different task monitoring measures are distributed and coordinated within the dACC–LPFC system.

### Reinforcement Learning or Working Memory?

It has been recently suggested that model-based investigations of adaptive mechanisms often mix and confound reinforcement learning mechanisms and working memory updating (Collins and Frank 2012). In particular, rapid improvements in behavioral performance during decision-making tasks can be best explained by gating mechanisms in computational models of the prefrontal cortex rather than by slow adaptation usually associated with dopamine-dependent plasticity in the basal ganglia. In the present study, the fact that Models SBnoA and SBnoA2β (with a high learning rate *α* fixed to 1) and Models GQLSB and GQLSB2β (where *α* is a free-meta-parameter between 0 and 1) produce a non-different fitting score on monkey behavior suggests that behavior in this task might fall into such a case. Under this interpretation, rapid behavioral adaptations would rely on gating appropriate flows of information between dACC and LPFC. In fact, the increase of LPFC activity mostly after negative and not positive outcomes, and the interaction with spatial selectivity, might reflect gating working memory or planning processes at the time of adaptation, rather than direct outcome-related responses. An alternative hypothesis that cannot be excluded is that in this type of deterministic task animals still partly rely on reinforcement learning mechanisms, but would progressively learn to employ a high learning rate during the long pretraining phase. The fact that a group of behavioral sessions were better fitted with *α* between 0.3 and 0.9 when *α* was not fixed to 1 (i.e., in Model GQLSB; see Supplementary Fig. 2C) reveals a continuum in the range of optimized *α* values which could be the result of a progressive but incomplete increase of the learning rate during pretraining. Such adaptation in rate might have also contributed to the weak quantitative coding of reward prediction errors. Further investigations will be required to answer this question, in particular by precisely characterizing monkey behavioral performance during the pretraining phase and the associated changes in information coding in prefrontal cortical regions.

### Network Regulation and Decisions in LPFC

We reported new data on the possible functional link between LPFC and dACC. However, we have no evaluation of putative dynamical and direct interactions between neurons of the 2 regions. Functional coordination of local field potentials between LPFC and dACC has been described but evidence for direct interactions is scarce (Rothe et al. 2011). The schematized modulatory function from dACC performance monitoring into LPFC decision process could in fact be indirect. For instance, it has been proposed that norepinephrine instantiates gain (excitability) variations in LPFC, and that this mechanism would be regulated by dACC afferences to the locus coeruleus (Aston-Jones and Cohen 2005; Cohen et al. 2007). Average activity variations in dACC and LPFC observed in our recordings could be a consequence of such activity gain changes. Gain modulation and biased competition are 2 mechanisms by which attentional signals can operate (Wang 2010). Increased working memory load, higher cognitive control, or attentional selection are concepts widely used to interpret prefrontal activity modulations dependent on task requirements (Miller and Cohen 2001; Leung et al. 2002; Kerns et al. 2004). Note that these concepts are closely related and have similar operational definitions (Barkley 2001; Miller and Cohen 2001; Cohen et al. 2004).

Recently, Kaping et al. (2011) have shown that spatial attentional and reward valuation signals are observed in different subdivisions of the fronto-cingulate region. Correlates of spatial attention selectivity were found in both dACC and LPFC, together with correlates of valuation, and independently of action plans. These signals would contribute to top-down attentional control of information (Kaping et al. 2011). Here we also verified that values were coded independently of choices by showing significant correlation with *Q*-values even after exclusion of trials selecting the neuron's preferred target.

The present study revealed 2 effects of task periods on frontal activity that would reflect variations in control and decision: an increased average firing rate and changes in recruited neural populations during exploration in both dACC and LPFC, and an increased spatial selectivity in LPFC during repetition. The latter would argue against a reduction of control implemented by LPFC during repetition. This however suggests that transitions between exploration and repetition involve a complex interplay between global unselective regulations and refined selection functions, and that qualitative changes in control occurred between search and repetition.

Finally, studies in rodents suggest that adaptive changes in behavioral strategies are also accompanied by global dynamical state transitions of prefrontal activity (Durstewitz et al. 2010). Our analyses showed that for both LPFC and dACC the neural populations participating in exploratory versus exploitative periods of the task differ significantly. We have also previously shown that the oscillatory coordination between the 2 areas changes from one period to the other (Rothe et al. 2011). Hence, a dynamical system perspective might be imperative to explain cognitive flexibility and its neurobiological substrate with more precision.

## Supplementary Material

Supplementary material can be found at: http://www.cercor.oxfordjournals.org/.

## Funding

This work was supported by the Agence Nationale de la Recherche ANR LU2 and EXENET, Région Rhône-Alpes projet Cible, and by the labex CORTEX ANR-11-LABX-0042 for E.P.; EU FP7 Project Organic (ICT 231267) for P.F.D.; Facultad de Medicina Universidad de Valparaíso (MECESUP UVA-106) and by Fondation pour la Recherche Médicale for R.Q.; ANR (Amorces and Comprendre) for P.F.D. and M.K..

## Notes

The authors thank Jacques Droulez, Mark D. Humphries, Henry Kennedy, Olivier Sigaud, and Charlie R.E. Wilson for comments on an early version of the manuscript, and Francesco P. Battaglia and Erika Cerasti for useful discussions. They also would like to thank anonymous reviewers for thorough comments and questions which helped drastically improve the manuscript. *Conflict of Interest:* None declared.

## References

*Cebus apella*)