Although economic theories based on utility maximization account for a range of choice behaviors, utilities must be estimated through experience. Dynamics of this learning process may account for certain discrepancies between the predictions of economic theories and real choice behaviors of humans and other animals. To understand the neural mechanisms responsible for such adaptive decision making, we trained rhesus monkeys to play a simulated matching pennies game. Small but systematic deviations of the animal's behavior from the optimal strategy were consistent with the predictions of reinforcement learning theory. In addition, individual neurons in the dorsolateral prefrontal cortex (DLPFC) encoded 3 different types of signals that can potentially influence the animal's future choices. First, activity modulated by the animal's previous choices might provide the eligibility trace that can be used to attribute a particular outcome to its causative action. Second, activity related to the animal's rewards in the previous trials might be used to compute an average reward rate. Finally, activity of some neurons was modulated by the computer's choices in the previous trials and may reflect the process of updating the value functions. These results suggest that the DLPFC might be an important node in the cortical network of decision making.
In order to make decisions optimally, animals must be able to predict the outcomes of their actions efficiently and choose the action that produces most desirable outcomes. For optimal decision making, therefore, the animal needs to know the mapping between its actions and outcomes for various states of its environment. In reality, however, the properties of the animal's environment change almost constantly and therefore are seldom fully known. To improve their decision-making strategies adaptively, therefore, the animals need to update continually their estimates for the outcomes expected from their actions. Reinforcement learning theory provides a formal description of this process (Sutton and Barto 1998).
In reinforcement learning, the animal's estimates for the sum of all future rewards are referred to as value functions. Value functions can be used to predict the reward expected at each time step, and the discrepancy between the predicted reward and actual reward, referred to as the reward prediction error, can be used to update the value functions (Sutton and Barto 1998). Reward prediction errors are encoded by midbrain dopamine neurons (Schultz 2006). In general, however, how specific computations of reinforcement learning algorithms, such as temporal integration of reward prediction errors, are implemented in different brain areas is still not well known (Lee 2006; Daw and Doya 2006).
The prefrontal cortex has long been recognized for its contribution to working memory (Goldman-Rakic 1995), and much research has focused on how information held in working memory can be used flexibly to guide the animal's behavior (Miller and Cohen 2001). However, the prefrontal cortex may also play an important role in reinforcement learning. For example, signals related to the value functions and the choice outcomes have been identified in the prefrontal cortex (Daw and Doya 2006). Previously, we showed that during computer-simulated competitive games (von Neumann and Morgenstern 1944), monkeys might approximate optimal decision-making strategies using reinforcement learning algorithms (Lee et al. 2004, 2005). We have also found that during the same task, individual neurons in the dorsolateral prefrontal cortex (DLPFC) encode signals related to the animal's choice and its outcome in the previous trial (Barraclough et al. 2004). In the present study, we investigated whether and how the signals related to the animal's choice and its outcome are maintained across multiple trials in the DLPFC.
Materials and Methods
Five rhesus monkeys (4 males and 1 female; body weight = 5–12 kg) were used. The animal's eye movements were monitored at a sampling rate of 250 or 500 Hz with either a scleral eye coil (Riverbend Instrument, Birmingham, AL) or a high-speed video-based eye tracker (ET 49, Thomas Recording, Giessen, Germany). Once the animal was completely trained for the behavioral tasks, a recording chamber was attached over the DLPFC. In 3 animals, a second recording chamber was implanted over the parietal cortex, and activity was often recorded simultaneously from the 2 chambers. All the procedures used in this study conformed to the National Institutes of Health guideline and were approved by the University of Rochester Committee on Animal Research.
Monkeys were trained to perform an oculomotor free-choice task that simulated a 2-player zero-sum game, known as the matching pennies task (Fig. 1). A trial began when the animal fixated a small yellow square in the center of a computer screen. Following a 0.5-s fore period, 2 green disks were presented along the horizontal meridian, and the central fixation target was extinguished after a 0.5-s delay period. The animal was then required to shift its gaze toward one of the peripheral targets and maintain its fixation during a 0.5-s hold period. At the end of this hold period, a red feedback ring appeared around the target chosen by the computer. The animal was rewarded only when it chose the same target as the computer and successfully maintained its fixation on the chosen target during the 0.5-s (0.2 s for some neurons, n = 133) feedback period following the feedback onset.
As described previously (Barraclough et al. 2004; Lee et al. 2004), the computer was programmed to exploit statistical biases in the animal's choice behavior. At the beginning of each trial, the computer made a prediction about the animal's choice by applying a set of statistical tests based on the animal's entire choice and reward history during a given recording session. First, the probability that the animal would choose a particular target as well as a set of conditional probabilities that the animal would choose a particular target given the animal's choices in the preceding n trials (n = 1–4) were estimated. In addition, the conditional probabilities that the animal would choose a particular target given its choices and rewards in the preceding n trials (n = 1–4) were also estimated. Second, for each of these 9 probabilities, the computer tested the null hypothesis that the animal had chosen the 2 targets randomly with equal probabilities and independently of its choices and their outcomes in the previous trials. When this null hypothesis was not rejected for any probability, the computer selected each target with a 0.5 probability. Otherwise, the computer biased its selection according to the conditional probability with the largest deviation from 0.5 that was statistically significant (binomial test, P < 0.05). For example, if the animal chose the rightward target significantly more frequently, with a 0.75 probability, following a rewarded trial in which the animal selected the leftward target, and if this conditional probability deviated more from 0.5 than any other conditional probabilities, then the computer chose the leftward target with a 0.75 probability.
At the beginning of each recording session, the animal performed 130 trials of a visual search task, which was identical to the matching pennies task described above, except that one of the 2 peripheral targets was red. The position of the red target was chosen pseudorandomly for each trial. The animal was never rewarded when it selected the red target. In addition, to match the overall reward probability for the 2 tasks, the animal was rewarded with a 0.5 probability when it selected the green target in the search task.
Single-neuron activity was recorded extracellularly in the DLPFC, using a 5-channel multielectrode recording system (Thomas Recording, Giessen, Germany). The placement of the recording chamber was guided by magnetic resonance images, and this was confirmed in 2 animals by metal pins inserted in known anatomical locations at the end of the experiments. In addition, the frontal eye field (FEF) was localized in all animals as sites in which eye movements were evoked by electrical stimulations with currents <50 μA during active fixation of a visual target (Goldberg et al. 1986). All the neurons described in this study were anterior to the FEF.
Reinforcement Learning Model
In reinforcement learning models, the value function for choosing target x is updated according to the reward prediction error (Sutton and Barto 1998) as follows:
In other words, the process of updating the value function can be described by a first-order autoregressive model with an exogenous input. To indicate this explicitly and be consistent with the notation in our previous study (Barraclough et al. 2004; Lee et al. 2004), the above equation was reparameterized as
As a result, the probability of choosing the rightward target increased gradually as its value function increased and as the value function for the leftward target decreased. It should be noted that the above equation does not include the inverse temperature parameter, so the magnitudes of Δrew and Δunrew determine how deterministically the animal's choice is influenced by the outcomes of its previous choices (Lee et al. 2004). All model parameters (α, Δrew, Δunrew) were estimated separately for each recording session using a maximum likelihood procedure (Pawitan 2001; Lee et al. 2004) by taking the best parameters obtained from 5 independent searches performed using the initial parameters randomly chosen in the interval of [0 1] for α and Δrew and [−1 0] for Δunrew. The maximum likelihood procedure was implemented using the fminsearch function in Matlab 7.0 (Mathworks Inc., Natick, MA). Thus, the parameters were not restricted to any particular interval.
Time-Series Analyses of Neural Data
A series of time-series models were applied to determine whether the activity of a given neuron was influenced by the choices of the animal and computer and by the rewards in the current and previous trials. For these analyses, spikes were counted for successive 0.5-s bins defined relative to the time of target onset or feedback onset. In the present study, we focused on 3 different time bins corresponding to the fore period, delay period, and feedback period. Variability in the activity of cortical neurons is often temporally correlated (Lee et al. 1998; Bair et al. 2001). Therefore, in order to distinguish the effects of different behavioral variables in the previous trials on neural activity from the temporal correlation in neural activity resulting from slow changes in the neuron's intrinsic excitability, the spike counts were detrended separately for each bin by taking the residuals from a linear regression model that includes the trial number as the independent variable. We also modeled the temporal correlation in neural activity using a first-order autoregressive moving-average models with exogenous inputs (ARMAX; Ljung 1999). Because this model included a first-order moving-average term and a first-order autoregressive term, it is commonly referred to as ARMAX(1,1). In this model, the detrended spike counts in a particular bin of trial t, y(t), is given by the following:
To test whether signals related to the animal's choices and rewards in the previous trial are influenced by the type of decisions made by the animal, the neural activity during the search trials (n = 130) and the first 260 trials of the matching pennies were analyzed using the following regression analysis. The neurons examined for <260 trials in the matching pennies task were excluded from this analysis.
Choice Behavior during the Matching Pennies Task
The behavioral data were collected from a total of 81 742 trials in 140 recording sessions (Table 1). These data were analyzed by fitting a reinforcement learning model (see Materials and Methods). The parameters of the model were estimated separately for each recording session. Overall, the decay factors were relatively large and skewed toward one, indicating that the effects of the previous choice outcomes were integrated over multiple trials (Fig. 2A). In addition, in approximately two-thirds of the sessions (94/140 sessions), the value functions increased and decreased for the target chosen by the animal, when it was rewarded and unrewarded, respectively (Fig. 2B).
Signals Related to Previous Choices and Outcomes
Single-unit activity was recorded from 322 neurons in the DLPFC during the matching pennies task (Table 1). Each neuron was tested for 130 trials during the search task, and at least 128 trials during the matching pennies task. The average number of trials tested during the matching pennies task was 584 (Table 1). Many of these neurons modulated their activity according to the animal's previous choices and their outcomes. In some neurons, the previous choices of the computer opponent also influenced their activity. However, the time course of the activity related to these different behavioral events varied substantially across different neurons (Figs 3 and 4). For example, for the neuron illustrated in Figure 3, the time courses and strengths of signals related to the animal's choice, the computer's choice, and reward were relatively similar. This neuron increased its activity around the time of eye movements, when the animal chose the rightward target (Fig. 3, top, trial lag = 0). In addition, the activity of the same neuron increased during the fore period and delay period when the animal had selected the rightward target in the previous trial (Fig. 3, top, trial lag = 1) and when the animal had been rewarded in the previous trial (Fig. 3, bottom, trial lag = 1). During the delay period, this neuron also increased its activity when the computer opponent had selected the rightward target in the previous trial (Fig. 3, middle, trial lag = 1). On the other hand, the neuron shown in Figure 4 modulated its activity mostly according to the recent reward history of the animal. When the animal was rewarded in a particular trial, its immediate effect was to increase the neuron's activity (Fig. 4, bottom, trial lag = 0). However, the activity of this neuron was reduced when the animal was rewarded in the previous 2 trials (Fig. 4, bottom, trial lag = 1 and 2). Thus, the activity of this neuron during the feedback period was enhanced when the animal was rewarded after one or more unrewarded trials.
How the activity of neurons in the DLPFC was influenced by different behavioral events was quantified using a regression model, which is referred to as ARMAX(0,0), in the present study. The results from this analysis showed that the animal's choice, the computer's choice, and the reward in the previous trial significantly influenced the activity in 36.0%, 19.3%, and 37.9% of the DLPFC neurons during the fore period, respectively, and in 40.1%, 18.3%, and 33.2% of the neurons during the delay period (Fig. 5). The fractions of DLPFC neurons that significantly modulated their activity during the fore period in a given trial according to the choice made by the animal and reward 2 trials before were both 11.2%. The fraction of neurons that displayed significant modulations in their activity according to the choice made by the computer opponent 2 trials before was 7.8% during the fore period. During the feedback period, the activity of DLPFC neurons was frequently affected by the animal's choice, the computer's choice and reward in the same trial but was also affected by the animal's choice and reward in the 3 previous trials (Fig. 5).
Task-Specific Choice Signals
To test whether activity changes related to the animal's previous choices and their outcomes were specific to the matching pennies task, we analyzed the activity of 284 neurons in which the data were collected from at least 260 trials during the matching pennies task in addition to 130 trials during the search task. In order to exclude the possibility that seemingly task-specific activity might be due to random nonstationary changes in neural activity, we included a set of control variables to determine whether significant changes also occurred during the 2 successive blocks of trials in the matching pennies task (see Materials and Methods). We found that the overall percentages of neurons that modulated their activity according to the previous choices of the animal and the computer opponent or the outcomes of the animal's choices were similar for the search task and the matching pennies task (data not shown). Nevertheless, many neurons in the DLPFC encoded the signals related to the animal's choices in the current and previous trials differently for the 2 tasks (Fig. 6). For example, during the delay period, 23.9% of the neurons modulated their activity according to the animal's choice differently for the search task and the matching pennies task (Fig. 6, top, trial lag = 0). This is not surprising because the animal received explicitly instruction about its eye movement only in the search task. By contrast, only 3.5% of the neurons displayed similar changes during the delay period between the 2 successive blocks of trials in the matching pennies task. This difference was statistically significant (χ2 test, P < 10−10). During the delay period, 15.5% of the neurons also modulated their activity according to the animal's choice in the previous trial differently for the 2 tasks, whereas only 5.3% of the neurons displayed similar changes between the 2 blocks of trials in the matching pennies task (χ2 test, P < 10−4). By contrast, large task-specific changes in neural activity related to the choice of the computer opponent or reward were seen only during the feedback period of the same trial (Fig. 6, middle and bottom), suggesting that the outcomes of the animal's previous choices similarly influenced the neural activity in the DLPFC for the 2 tasks.
Comparison of ARMAX and State-Space Models
The regression model described above included each of 3 different behavioral events in 3 previous trials. This postulates that information about each of these distinct events is stored in the brain separately, and their effects are combined to determine the activity of individual neurons in a given trial. Alternatively, neural activity in a given trial might be determined by the state of the brain that undergoes certain dynamic changes on a trial-by-trial basis under the influence of certain behavioral events. To test this possibility, we applied a state-space model, commonly known as the Kalman filter model, to estimate the state in each trial and used this state information to predict the activity of each neuron (see Materials and Methods). For comparison, 3 other time-series models were fit to the data, namely, a first-order autoregressive model, a first-order moving-average model, and a first-order autoregressive moving-average model. All these models included the same exogenous input variables used in the state-space model. According to the AIC, the state-space model was selected as the best model most frequently regardless of the epochs examined (Fig. 7). For some neurons, the autoregressive model or the autoregressive moving-average model performed better than the state-space model. The moving-average model was never chosen as the best model.
Using a decision-making task that simulated a simple competitive interaction with another decision maker, we found that monkeys tend to seek an optimal decision-making strategy according to a reinforcement learning algorithm (Lee et al. 2004; Corrado et al. 2005; Lau and Glimcher 2005; Lee et al. 2005; Samejima et al. 2005). The decay factors in the reinforcement learning model were relatively large, suggesting that the outcomes of multiple trials in the past were temporally integrated and influenced the animal's choice in a given trial. Consistent with this behavioral finding, a significant number of neurons in the DLPFC also modulated their activity according to the animal's choices and their outcomes in multiple trials. The fact that the model based on a state space accounted for the neural data more parsimoniously compared with other time-series models suggests that the signals related to the animal's choices and their outcomes might be temporally integrated in the form of a state variable in the DLPFC. Therefore, the DLPFC might be an important node in the cortical network that is responsible for monitoring the outcomes of previous choices and using that information to update the animal's decision-making strategies dynamically. However, the exact mechanism by which these signals are used to update the value functions or decision-making strategies is not known. Different types of signals identified in the present study, such as the animal's choices and rewards, might contribute to the following aspects of adaptive decision making.
First, in reinforcement learning theory, signals related to the decision maker's previous actions are referred to as the eligibility trace. Such signals can link a reward delivered at a particular time step to an action that caused it, when these 2 events are temporally separated (Sutton and Barto 1998). Eligibility trace was not incorporated into the reinforcement learning algorithm we applied to model the animal's choice behavior because during the matching pennies task the outcome of a particular action was revealed immediately. Nevertheless, the neural signals related to eligibility trace might be utilized in more complex tasks involving multistage decision making (Saito et al. 2005; Averbeck et al. 2006; Sohn and Lee 2006). Second, signals related to the rewards in the previous trials might be used to compute an average rate of reward. It has been reported that neurons in the orbitofrontal cortex also encode signals related to rewards in the previous trials (Sugrue et al. 2004). During the process of decision making, information about the average reward rate might be utilized in several ways. For example, in a class of reinforcement learning algorithms, referred to as average reward reinforcement learning, the average reward rate is used as a criterion for optimal decision making (Mahadevan 1996). In addition, choices of humans and other animals may be influenced by the same outcome differently, depending on whether it is considered as a gain or loss (Tinklepaugh 1928; Crespi 1942; Zeaman 1949; Kahneman and Tversky 1979; Flaherty 1982). Therefore, signals related to reward rate may influence the process of decision making by providing a frame of reference (Helson 1948). Information about the average reward rate may also play a role in setting the optimal level of threshold used to terminate the process of evidence accumulation during the process of perceptual decision making (Simen et al. 2006) or switching between exploitation and exploration (Aston-Jones and Cohen 2005). Finally, neurons in the DLPFC encoded signals related to the previous choices of the computer opponent, although less often than those related to the animal's previous choices and rewards. During the matching pennies game, the animal was rewarded only when it chose the same target as the computer opponent, so signals related to the computer's previous choices might directly contribute to the process of computing the value functions for alternative choices.
Signals related to the animal's choices, their outcomes, and the previous choices of the opponent were sometimes multiplexed in a single neuron in the DLPFC. In addition, when these same variables were used as exogenous inputs, the one-dimensional state-space model often provided a parsimonious description of activity in the DLPFC. This raises the possibility that in some neurons, the process of integration might be applied after signals related to multiple variables are combined. Whether these different types of signals are then demultiplexed and utilized for different purposes by separate groups of downstream neurons is not known. In addition, single-cell and network mechanisms for integrating these signals in the prefrontal cortex are not well understood. It has been shown that a recurrent network combined with a reward-dependent stochastic Hebbian learning rule can reproduce the choice behavior observed in monkeys during the matching pennies game (Soltani and Wang 2006; Soltani et al. 2006). However, mechanisms for temporally integrating signals related to these multiple events need to be further investigated in future studies.
We are grateful to Lindsay Carr and John Swan-Stone for their technical assistance. This study was supported by a grant from the National Institute of Mental Health (MH073246).
Conflict of Interest: None declared.