Abstract

Although economic theories based on utility maximization account for a range of choice behaviors, utilities must be estimated through experience. Dynamics of this learning process may account for certain discrepancies between the predictions of economic theories and real choice behaviors of humans and other animals. To understand the neural mechanisms responsible for such adaptive decision making, we trained rhesus monkeys to play a simulated matching pennies game. Small but systematic deviations of the animal's behavior from the optimal strategy were consistent with the predictions of reinforcement learning theory. In addition, individual neurons in the dorsolateral prefrontal cortex (DLPFC) encoded 3 different types of signals that can potentially influence the animal's future choices. First, activity modulated by the animal's previous choices might provide the eligibility trace that can be used to attribute a particular outcome to its causative action. Second, activity related to the animal's rewards in the previous trials might be used to compute an average reward rate. Finally, activity of some neurons was modulated by the computer's choices in the previous trials and may reflect the process of updating the value functions. These results suggest that the DLPFC might be an important node in the cortical network of decision making.

Introduction

In order to make decisions optimally, animals must be able to predict the outcomes of their actions efficiently and choose the action that produces most desirable outcomes. For optimal decision making, therefore, the animal needs to know the mapping between its actions and outcomes for various states of its environment. In reality, however, the properties of the animal's environment change almost constantly and therefore are seldom fully known. To improve their decision-making strategies adaptively, therefore, the animals need to update continually their estimates for the outcomes expected from their actions. Reinforcement learning theory provides a formal description of this process (Sutton and Barto 1998).

In reinforcement learning, the animal's estimates for the sum of all future rewards are referred to as value functions. Value functions can be used to predict the reward expected at each time step, and the discrepancy between the predicted reward and actual reward, referred to as the reward prediction error, can be used to update the value functions (Sutton and Barto 1998). Reward prediction errors are encoded by midbrain dopamine neurons (Schultz 2006). In general, however, how specific computations of reinforcement learning algorithms, such as temporal integration of reward prediction errors, are implemented in different brain areas is still not well known (Lee 2006; Daw and Doya 2006).

The prefrontal cortex has long been recognized for its contribution to working memory (Goldman-Rakic 1995), and much research has focused on how information held in working memory can be used flexibly to guide the animal's behavior (Miller and Cohen 2001). However, the prefrontal cortex may also play an important role in reinforcement learning. For example, signals related to the value functions and the choice outcomes have been identified in the prefrontal cortex (Daw and Doya 2006). Previously, we showed that during computer-simulated competitive games (von Neumann and Morgenstern 1944), monkeys might approximate optimal decision-making strategies using reinforcement learning algorithms (Lee et al. 2004, 2005). We have also found that during the same task, individual neurons in the dorsolateral prefrontal cortex (DLPFC) encode signals related to the animal's choice and its outcome in the previous trial (Barraclough et al. 2004). In the present study, we investigated whether and how the signals related to the animal's choice and its outcome are maintained across multiple trials in the DLPFC.

Materials and Methods

Animal Preparations

Five rhesus monkeys (4 males and 1 female; body weight = 5–12 kg) were used. The animal's eye movements were monitored at a sampling rate of 250 or 500 Hz with either a scleral eye coil (Riverbend Instrument, Birmingham, AL) or a high-speed video-based eye tracker (ET 49, Thomas Recording, Giessen, Germany). Once the animal was completely trained for the behavioral tasks, a recording chamber was attached over the DLPFC. In 3 animals, a second recording chamber was implanted over the parietal cortex, and activity was often recorded simultaneously from the 2 chambers. All the procedures used in this study conformed to the National Institutes of Health guideline and were approved by the University of Rochester Committee on Animal Research.

Behavioral Tasks

Monkeys were trained to perform an oculomotor free-choice task that simulated a 2-player zero-sum game, known as the matching pennies task (Fig. 1). A trial began when the animal fixated a small yellow square in the center of a computer screen. Following a 0.5-s fore period, 2 green disks were presented along the horizontal meridian, and the central fixation target was extinguished after a 0.5-s delay period. The animal was then required to shift its gaze toward one of the peripheral targets and maintain its fixation during a 0.5-s hold period. At the end of this hold period, a red feedback ring appeared around the target chosen by the computer. The animal was rewarded only when it chose the same target as the computer and successfully maintained its fixation on the chosen target during the 0.5-s (0.2 s for some neurons, n = 133) feedback period following the feedback onset.

Figure 1.

Visual stimuli and payoff matrix (inset) for the matching pennies game. The duration of fore period and delay period was 0.5 s, and the animal was required to shift its gaze toward one of the peripheral targets within 1 s after the central target was extinguished and hold its fixation for 0.5 s (Sacc/Fix). The duration of the feedback ring was 0.2 or 0.5 s.

Figure 1.

Visual stimuli and payoff matrix (inset) for the matching pennies game. The duration of fore period and delay period was 0.5 s, and the animal was required to shift its gaze toward one of the peripheral targets within 1 s after the central target was extinguished and hold its fixation for 0.5 s (Sacc/Fix). The duration of the feedback ring was 0.2 or 0.5 s.

As described previously (Barraclough et al. 2004; Lee et al. 2004), the computer was programmed to exploit statistical biases in the animal's choice behavior. At the beginning of each trial, the computer made a prediction about the animal's choice by applying a set of statistical tests based on the animal's entire choice and reward history during a given recording session. First, the probability that the animal would choose a particular target as well as a set of conditional probabilities that the animal would choose a particular target given the animal's choices in the preceding n trials (n = 1–4) were estimated. In addition, the conditional probabilities that the animal would choose a particular target given its choices and rewards in the preceding n trials (n = 1–4) were also estimated. Second, for each of these 9 probabilities, the computer tested the null hypothesis that the animal had chosen the 2 targets randomly with equal probabilities and independently of its choices and their outcomes in the previous trials. When this null hypothesis was not rejected for any probability, the computer selected each target with a 0.5 probability. Otherwise, the computer biased its selection according to the conditional probability with the largest deviation from 0.5 that was statistically significant (binomial test, P < 0.05). For example, if the animal chose the rightward target significantly more frequently, with a 0.75 probability, following a rewarded trial in which the animal selected the leftward target, and if this conditional probability deviated more from 0.5 than any other conditional probabilities, then the computer chose the leftward target with a 0.75 probability.

At the beginning of each recording session, the animal performed 130 trials of a visual search task, which was identical to the matching pennies task described above, except that one of the 2 peripheral targets was red. The position of the red target was chosen pseudorandomly for each trial. The animal was never rewarded when it selected the red target. In addition, to match the overall reward probability for the 2 tasks, the animal was rewarded with a 0.5 probability when it selected the green target in the search task.

Neurophysiological Recording

Single-neuron activity was recorded extracellularly in the DLPFC, using a 5-channel multielectrode recording system (Thomas Recording, Giessen, Germany). The placement of the recording chamber was guided by magnetic resonance images, and this was confirmed in 2 animals by metal pins inserted in known anatomical locations at the end of the experiments. In addition, the frontal eye field (FEF) was localized in all animals as sites in which eye movements were evoked by electrical stimulations with currents <50 μA during active fixation of a visual target (Goldberg et al. 1986). All the neurons described in this study were anterior to the FEF.

Reinforcement Learning Model

In reinforcement learning models, the value function for choosing target x is updated according to the reward prediction error (Sutton and Barto 1998) as follows: 

graphic
where Vt(x) denotes the value function for target x in trial t, rt the reward received by the animal in trial t, and β the step-size parameter. This can be rearranged as 
graphic

In other words, the process of updating the value function can be described by a first-order autoregressive model with an exogenous input. To indicate this explicitly and be consistent with the notation in our previous study (Barraclough et al. 2004; Lee et al. 2004), the above equation was reparameterized as 

graphic
where the decay factor α = (1 − β) and the exogenous input Δt(x) = βrt. A large decay factor indicates that the outcome of the animal's choice in a given trial would influence the animal's choices across a relatively large number of trials. We assumed that Δt(x) = Δrew if the animal was rewarded at trial t, and Δt(x) = Δunrew otherwise. Thus, Δrew and Δunrew reflect how the value function for the target chosen by the animal is influenced by the outcome of the animal's choice. The signs of these parameters indicate whether the animal would be more likely to choose the same target in the future trials. For example, positive Δrew and negative Δunrew correspond to the so-called win-stay and lose-switch strategies, respectively. The probability that the animal would choose the rightward target in trial t, Pt(R), was then determined by the softmax transformation as follows: 
graphic

As a result, the probability of choosing the rightward target increased gradually as its value function increased and as the value function for the leftward target decreased. It should be noted that the above equation does not include the inverse temperature parameter, so the magnitudes of Δrew and Δunrew determine how deterministically the animal's choice is influenced by the outcomes of its previous choices (Lee et al. 2004). All model parameters (α, Δrew, Δunrew) were estimated separately for each recording session using a maximum likelihood procedure (Pawitan 2001; Lee et al. 2004) by taking the best parameters obtained from 5 independent searches performed using the initial parameters randomly chosen in the interval of [0 1] for α and Δrew and [−1 0] for Δunrew. The maximum likelihood procedure was implemented using the fminsearch function in Matlab 7.0 (Mathworks Inc., Natick, MA). Thus, the parameters were not restricted to any particular interval.

Time-Series Analyses of Neural Data

A series of time-series models were applied to determine whether the activity of a given neuron was influenced by the choices of the animal and computer and by the rewards in the current and previous trials. For these analyses, spikes were counted for successive 0.5-s bins defined relative to the time of target onset or feedback onset. In the present study, we focused on 3 different time bins corresponding to the fore period, delay period, and feedback period. Variability in the activity of cortical neurons is often temporally correlated (Lee et al. 1998; Bair et al. 2001). Therefore, in order to distinguish the effects of different behavioral variables in the previous trials on neural activity from the temporal correlation in neural activity resulting from slow changes in the neuron's intrinsic excitability, the spike counts were detrended separately for each bin by taking the residuals from a linear regression model that includes the trial number as the independent variable. We also modeled the temporal correlation in neural activity using a first-order autoregressive moving-average models with exogenous inputs (ARMAX; Ljung 1999). Because this model included a first-order moving-average term and a first-order autoregressive term, it is commonly referred to as ARMAX(1,1). In this model, the detrended spike counts in a particular bin of trial t, y(t), is given by the following: 

graphic
where u(t) is a row vector consisting of 3 binary variables corresponding to the animal's choice (0 and 1 for leftward and rightward choices, respectively), the computer's choice (coded as for the animal's choice), and the reward (0 and 1 for unrewarded and rewarded trials, respectively) in trial t; A (1 × 1), B (1 × 12), and C (1 × 1) are the vectors of coefficients, and e(t) is the error term. As special cases of this ARMAX model, we also considered 1) a model without any autoregressive or moving-average terms (A = 0 and C = 0), ARMAX(0,0); 2) a model only with the first-order autoregressive term (C = 0), ARMAX(1,0); 3) a model only with the first-order moving-average term (A = 0), ARMAX(0,1); in addition to (4) the full ARMAX(1,1) model described above. For ARMAX(0,0), which is equivalent to the standard multiple linear regression model, a statistical significance for each coefficient in B was determined with a t-test (P < 0.05).

We also applied the state-space model to the same data. The state-space model consists of the following transition and observation equations: 

graphic
where F (1 × 1), G (1 × 3), K (1 × 1), H (1 × 1), and J (1 × 3) are the vectors of coefficients. In the present study, we only considered a one-dimensional state space. Therefore, this model assumes that the effects of the behavioral events in the previous trials on the neural activity are mediated by the state-space variable that follows a first-order autoregressive process. If this assumption is true, then the above state-space model would account for the data more parsimoniously, namely, with a fewer parameters than other time-series models. The performance of each model was evaluated with the Akaike's information criterion (AIC), given by 
graphic
where L denotes the likelihood of the model computed under the assumption of Gaussian distribution for the error and N the number of model parameters.

To test whether signals related to the animal's choices and rewards in the previous trial are influenced by the type of decisions made by the animal, the neural activity during the search trials (n = 130) and the first 260 trials of the matching pennies were analyzed using the following regression analysis. The neurons examined for <260 trials in the matching pennies task were excluded from this analysis. 

graphic
where y(t) is the detrended spike count at trial t, u(t) is a vector consisting of the animal's choice, the computer's choice, and the reward in trial t, v(t) = u(t) for the trials in the matching pennies task (i.e., t = 131–390) and 0 otherwise, and w(t) = u(t) for the second block of 130 trials in the matching pennies task (i.e., t = 261–390) and 0 otherwise. BAll, BTask, and BBlock are the vectors of regression coefficients. Thus, BAll reflects the overall strength of signals related to various behavioral events, whereas BTask reflects the extent to which the neural activity related to the same behavioral events differs for the 2 tasks. By comparing the activity during the second block of 130 trials during the matching pennies task to the activity in the preceding trials, BBlock provides an estimate of nonstationarity in neural activity related to choices and rewards. For the search task, the computer's choice was defined as the correct (green) and incorrect (red) targets in rewarded and unrewarded trials, respectively.

Results

Choice Behavior during the Matching Pennies Task

The behavioral data were collected from a total of 81 742 trials in 140 recording sessions (Table 1). These data were analyzed by fitting a reinforcement learning model (see Materials and Methods). The parameters of the model were estimated separately for each recording session. Overall, the decay factors were relatively large and skewed toward one, indicating that the effects of the previous choice outcomes were integrated over multiple trials (Fig. 2A). In addition, in approximately two-thirds of the sessions (94/140 sessions), the value functions increased and decreased for the target chosen by the animal, when it was rewarded and unrewarded, respectively (Fig. 2B).

Figure 2.

Parameters of the reinforcement learning model applied to choice behaviors in the matching pennies task. (A) Distribution of decay factors. (B) Scatter plot for the incremental changes in the value function applied after rewarded (abscissa) and unrewarded (ordinate) trials.

Figure 2.

Parameters of the reinforcement learning model applied to choice behaviors in the matching pennies task. (A) Distribution of decay factors. (B) Scatter plot for the incremental changes in the value function applied after rewarded (abscissa) and unrewarded (ordinate) trials.

Table 1

The number of recording sessions, trials, and neurons in each animal

Monkeys Sessions Trials Trials/session Neurons 
21 18 003 857.3 45 
44 25 884 588.3 88 
10 3395 339.5 21 
41 26 702 651.3 91 
24 7758 323.3 77 
All 140 81 742 583.9 322 
Monkeys Sessions Trials Trials/session Neurons 
21 18 003 857.3 45 
44 25 884 588.3 88 
10 3395 339.5 21 
41 26 702 651.3 91 
24 7758 323.3 77 
All 140 81 742 583.9 322 

Signals Related to Previous Choices and Outcomes

Single-unit activity was recorded from 322 neurons in the DLPFC during the matching pennies task (Table 1). Each neuron was tested for 130 trials during the search task, and at least 128 trials during the matching pennies task. The average number of trials tested during the matching pennies task was 584 (Table 1). Many of these neurons modulated their activity according to the animal's previous choices and their outcomes. In some neurons, the previous choices of the computer opponent also influenced their activity. However, the time course of the activity related to these different behavioral events varied substantially across different neurons (Figs 3 and 4). For example, for the neuron illustrated in Figure 3, the time courses and strengths of signals related to the animal's choice, the computer's choice, and reward were relatively similar. This neuron increased its activity around the time of eye movements, when the animal chose the rightward target (Fig. 3, top, trial lag = 0). In addition, the activity of the same neuron increased during the fore period and delay period when the animal had selected the rightward target in the previous trial (Fig. 3, top, trial lag = 1) and when the animal had been rewarded in the previous trial (Fig. 3, bottom, trial lag = 1). During the delay period, this neuron also increased its activity when the computer opponent had selected the rightward target in the previous trial (Fig. 3, middle, trial lag = 1). On the other hand, the neuron shown in Figure 4 modulated its activity mostly according to the recent reward history of the animal. When the animal was rewarded in a particular trial, its immediate effect was to increase the neuron's activity (Fig. 4, bottom, trial lag = 0). However, the activity of this neuron was reduced when the animal was rewarded in the previous 2 trials (Fig. 4, bottom, trial lag = 1 and 2). Thus, the activity of this neuron during the feedback period was enhanced when the animal was rewarded after one or more unrewarded trials.

Figure 3.

An example neuron in the DLPFC that modulated its activity according to the animal's choice, computer's choice, and reward in the previous trial (trial lag = 1). Each panel shows a pair of spike density functions estimated separately for trials sorted by the animal's choice, the computer's choice, or reward in the current trial (trial lag = 0) or previous trials (trial lag = 1–3). Neural activity is aligned according to the target onset (left plots) or feedback onset (right plots). In the top 2 rows, the black and blue lines correspond to the leftward and rightward choices, whereas in the bottom row, they correspond to the unrewarded and rewarded trials, respectively. Dotted vertical lines correspond to the time when the animal fixated the central target or the onset time of feedback ring. Circles are the standardized regression coefficients from a linear regression model, and filled circles indicate that the effect was statistically significant (t-test, P < 0.05).

Figure 3.

An example neuron in the DLPFC that modulated its activity according to the animal's choice, computer's choice, and reward in the previous trial (trial lag = 1). Each panel shows a pair of spike density functions estimated separately for trials sorted by the animal's choice, the computer's choice, or reward in the current trial (trial lag = 0) or previous trials (trial lag = 1–3). Neural activity is aligned according to the target onset (left plots) or feedback onset (right plots). In the top 2 rows, the black and blue lines correspond to the leftward and rightward choices, whereas in the bottom row, they correspond to the unrewarded and rewarded trials, respectively. Dotted vertical lines correspond to the time when the animal fixated the central target or the onset time of feedback ring. Circles are the standardized regression coefficients from a linear regression model, and filled circles indicate that the effect was statistically significant (t-test, P < 0.05).

Figure 4.

Another example neuron in the DLPFC that modulated its activity according to the animal's choice and reward in the previous trial. In this neuron, the effect of reward was maintained in multiple trials. Same format as in Figure 3.

Figure 4.

Another example neuron in the DLPFC that modulated its activity according to the animal's choice and reward in the previous trial. In this neuron, the effect of reward was maintained in multiple trials. Same format as in Figure 3.

How the activity of neurons in the DLPFC was influenced by different behavioral events was quantified using a regression model, which is referred to as ARMAX(0,0), in the present study. The results from this analysis showed that the animal's choice, the computer's choice, and the reward in the previous trial significantly influenced the activity in 36.0%, 19.3%, and 37.9% of the DLPFC neurons during the fore period, respectively, and in 40.1%, 18.3%, and 33.2% of the neurons during the delay period (Fig. 5). The fractions of DLPFC neurons that significantly modulated their activity during the fore period in a given trial according to the choice made by the animal and reward 2 trials before were both 11.2%. The fraction of neurons that displayed significant modulations in their activity according to the choice made by the computer opponent 2 trials before was 7.8% during the fore period. During the feedback period, the activity of DLPFC neurons was frequently affected by the animal's choice, the computer's choice and reward in the same trial but was also affected by the animal's choice and reward in the 3 previous trials (Fig. 5).

Figure 5.

Time course of neural signals related to choice and reward. Histograms show the fractions of neurons that displayed significant modulations in their activity according to the animal's choice (top), the computer's choice (middle), and the choice outcome (reward, bottom) during various time bins in the current (trial lag = 0) and previous (trial lag=1, 2, and 3) trials. The asterisks indicate that the proportion of neurons is significantly higher than the P value (0.05) used in the regression analysis according to a binomial test (P < 0.05).

Figure 5.

Time course of neural signals related to choice and reward. Histograms show the fractions of neurons that displayed significant modulations in their activity according to the animal's choice (top), the computer's choice (middle), and the choice outcome (reward, bottom) during various time bins in the current (trial lag = 0) and previous (trial lag=1, 2, and 3) trials. The asterisks indicate that the proportion of neurons is significantly higher than the P value (0.05) used in the regression analysis according to a binomial test (P < 0.05).

Task-Specific Choice Signals

To test whether activity changes related to the animal's previous choices and their outcomes were specific to the matching pennies task, we analyzed the activity of 284 neurons in which the data were collected from at least 260 trials during the matching pennies task in addition to 130 trials during the search task. In order to exclude the possibility that seemingly task-specific activity might be due to random nonstationary changes in neural activity, we included a set of control variables to determine whether significant changes also occurred during the 2 successive blocks of trials in the matching pennies task (see Materials and Methods). We found that the overall percentages of neurons that modulated their activity according to the previous choices of the animal and the computer opponent or the outcomes of the animal's choices were similar for the search task and the matching pennies task (data not shown). Nevertheless, many neurons in the DLPFC encoded the signals related to the animal's choices in the current and previous trials differently for the 2 tasks (Fig. 6). For example, during the delay period, 23.9% of the neurons modulated their activity according to the animal's choice differently for the search task and the matching pennies task (Fig. 6, top, trial lag = 0). This is not surprising because the animal received explicitly instruction about its eye movement only in the search task. By contrast, only 3.5% of the neurons displayed similar changes during the delay period between the 2 successive blocks of trials in the matching pennies task. This difference was statistically significant (χ2 test, P < 10−10). During the delay period, 15.5% of the neurons also modulated their activity according to the animal's choice in the previous trial differently for the 2 tasks, whereas only 5.3% of the neurons displayed similar changes between the 2 blocks of trials in the matching pennies task (χ2 test, P < 10−4). By contrast, large task-specific changes in neural activity related to the choice of the computer opponent or reward were seen only during the feedback period of the same trial (Fig. 6, middle and bottom), suggesting that the outcomes of the animal's previous choices similarly influenced the neural activity in the DLPFC for the 2 tasks.

Figure 6.

Task-specific modulation of neural signals related to choice and reward. Histograms labeled “All” show the fractions of neurons that displayed overall modulations in their activity according to the animal's choice, the choice of the computer, and reward in the current trial (trial lag = 0) or in the previous trials (trial lag = 1 or 2) regardless of the task. Histograms labeled “Task” show the fractions of neurons in which the effects of behavioral events on neural activity differed significantly for the trials in the search task and matching pennies task. Finally, histograms labeled “Block” show the fractions of neurons that displayed nonstationary changes in the signals related to choices and rewards between the 2 successive blocks of 130 trials in the matching pennies task.

Figure 6.

Task-specific modulation of neural signals related to choice and reward. Histograms labeled “All” show the fractions of neurons that displayed overall modulations in their activity according to the animal's choice, the choice of the computer, and reward in the current trial (trial lag = 0) or in the previous trials (trial lag = 1 or 2) regardless of the task. Histograms labeled “Task” show the fractions of neurons in which the effects of behavioral events on neural activity differed significantly for the trials in the search task and matching pennies task. Finally, histograms labeled “Block” show the fractions of neurons that displayed nonstationary changes in the signals related to choices and rewards between the 2 successive blocks of 130 trials in the matching pennies task.

Comparison of ARMAX and State-Space Models

The regression model described above included each of 3 different behavioral events in 3 previous trials. This postulates that information about each of these distinct events is stored in the brain separately, and their effects are combined to determine the activity of individual neurons in a given trial. Alternatively, neural activity in a given trial might be determined by the state of the brain that undergoes certain dynamic changes on a trial-by-trial basis under the influence of certain behavioral events. To test this possibility, we applied a state-space model, commonly known as the Kalman filter model, to estimate the state in each trial and used this state information to predict the activity of each neuron (see Materials and Methods). For comparison, 3 other time-series models were fit to the data, namely, a first-order autoregressive model, a first-order moving-average model, and a first-order autoregressive moving-average model. All these models included the same exogenous input variables used in the state-space model. According to the AIC, the state-space model was selected as the best model most frequently regardless of the epochs examined (Fig. 7). For some neurons, the autoregressive model or the autoregressive moving-average model performed better than the state-space model. The moving-average model was never chosen as the best model.

Figure 7.

Fraction of neurons in the DLPFC for which a particular time-series model was chosen as the best model. SS(1), first-order state-space model; (0, 0), regression model without autoregressive or moving-average terms; (1, 0), first-order autoregressive model; (0, 1), first-order moving-average model; and (1, 1), first-order autoregressive moving-average model.

Figure 7.

Fraction of neurons in the DLPFC for which a particular time-series model was chosen as the best model. SS(1), first-order state-space model; (0, 0), regression model without autoregressive or moving-average terms; (1, 0), first-order autoregressive model; (0, 1), first-order moving-average model; and (1, 1), first-order autoregressive moving-average model.

Discussion

Using a decision-making task that simulated a simple competitive interaction with another decision maker, we found that monkeys tend to seek an optimal decision-making strategy according to a reinforcement learning algorithm (Lee et al. 2004; Corrado et al. 2005; Lau and Glimcher 2005; Lee et al. 2005; Samejima et al. 2005). The decay factors in the reinforcement learning model were relatively large, suggesting that the outcomes of multiple trials in the past were temporally integrated and influenced the animal's choice in a given trial. Consistent with this behavioral finding, a significant number of neurons in the DLPFC also modulated their activity according to the animal's choices and their outcomes in multiple trials. The fact that the model based on a state space accounted for the neural data more parsimoniously compared with other time-series models suggests that the signals related to the animal's choices and their outcomes might be temporally integrated in the form of a state variable in the DLPFC. Therefore, the DLPFC might be an important node in the cortical network that is responsible for monitoring the outcomes of previous choices and using that information to update the animal's decision-making strategies dynamically. However, the exact mechanism by which these signals are used to update the value functions or decision-making strategies is not known. Different types of signals identified in the present study, such as the animal's choices and rewards, might contribute to the following aspects of adaptive decision making.

First, in reinforcement learning theory, signals related to the decision maker's previous actions are referred to as the eligibility trace. Such signals can link a reward delivered at a particular time step to an action that caused it, when these 2 events are temporally separated (Sutton and Barto 1998). Eligibility trace was not incorporated into the reinforcement learning algorithm we applied to model the animal's choice behavior because during the matching pennies task the outcome of a particular action was revealed immediately. Nevertheless, the neural signals related to eligibility trace might be utilized in more complex tasks involving multistage decision making (Saito et al. 2005; Averbeck et al. 2006; Sohn and Lee 2006). Second, signals related to the rewards in the previous trials might be used to compute an average rate of reward. It has been reported that neurons in the orbitofrontal cortex also encode signals related to rewards in the previous trials (Sugrue et al. 2004). During the process of decision making, information about the average reward rate might be utilized in several ways. For example, in a class of reinforcement learning algorithms, referred to as average reward reinforcement learning, the average reward rate is used as a criterion for optimal decision making (Mahadevan 1996). In addition, choices of humans and other animals may be influenced by the same outcome differently, depending on whether it is considered as a gain or loss (Tinklepaugh 1928; Crespi 1942; Zeaman 1949; Kahneman and Tversky 1979; Flaherty 1982). Therefore, signals related to reward rate may influence the process of decision making by providing a frame of reference (Helson 1948). Information about the average reward rate may also play a role in setting the optimal level of threshold used to terminate the process of evidence accumulation during the process of perceptual decision making (Simen et al. 2006) or switching between exploitation and exploration (Aston-Jones and Cohen 2005). Finally, neurons in the DLPFC encoded signals related to the previous choices of the computer opponent, although less often than those related to the animal's previous choices and rewards. During the matching pennies game, the animal was rewarded only when it chose the same target as the computer opponent, so signals related to the computer's previous choices might directly contribute to the process of computing the value functions for alternative choices.

Signals related to the animal's choices, their outcomes, and the previous choices of the opponent were sometimes multiplexed in a single neuron in the DLPFC. In addition, when these same variables were used as exogenous inputs, the one-dimensional state-space model often provided a parsimonious description of activity in the DLPFC. This raises the possibility that in some neurons, the process of integration might be applied after signals related to multiple variables are combined. Whether these different types of signals are then demultiplexed and utilized for different purposes by separate groups of downstream neurons is not known. In addition, single-cell and network mechanisms for integrating these signals in the prefrontal cortex are not well understood. It has been shown that a recurrent network combined with a reward-dependent stochastic Hebbian learning rule can reproduce the choice behavior observed in monkeys during the matching pennies game (Soltani and Wang 2006; Soltani et al. 2006). However, mechanisms for temporally integrating signals related to these multiple events need to be further investigated in future studies.

We are grateful to Lindsay Carr and John Swan-Stone for their technical assistance. This study was supported by a grant from the National Institute of Mental Health (MH073246).

Conflict of Interest: None declared.

References

Aston-Jones
G
Cohen
JD
An integrative theory of locus coeruleus-norepinephrine function: adaptive gain and optimal performance
Annu Rev Neurosci
 , 
2005
, vol. 
28
 (pg. 
403
-
450
)
Averbeck
BB
Sohn
J-W
Lee
D
Activity in prefrontal cortex during dynamic selection of action sequences
Nat Neurosci
 , 
2006
, vol. 
9
 (pg. 
276
-
282
)
Bair
W
Zohary
E
Newsome
WT
Correlated firing in macaque visual area MT: time scales and relationship to behavior
J Neurosci
 , 
2001
, vol. 
21
 (pg. 
1676
-
1697
)
Barraclough
DJ
Conroy
ML
Lee
D
Prefrontal cortex and decision making in a mixed-strategy game
Nat Neurosci
 , 
2004
, vol. 
7
 (pg. 
404
-
410
)
Corrado
GS
Sugrue
LP
Seung
HS
Newsome
WT
Linear-nonlinear-poisson models of primate choice dynamics
J Exp Anal Behav
 , 
2005
, vol. 
84
 (pg. 
581
-
617
)
Crespi
LP
Quantitative variation of incentive and performance in the white rat
Am J Psychol
 , 
1942
, vol. 
55
 (pg. 
467
-
517
)
Daw
ND
Doya
K
The computational neurobiology of learning and reward
Curr Opin Neurobiol
 , 
2006
, vol. 
16
 (pg. 
199
-
204
)
Flaherty
CF
Incentive contrast: a review of behavioral changes following shifts in reward
Anim Learn Behav
 , 
1982
, vol. 
10
 (pg. 
409
-
440
)
Goldberg
ME
Bushnell
MC
Bruce
CJ
The effect of attentive fixation on eye movements evoked by electrical stimulation of the frontal eye fields
Exp Brain Res
 , 
1986
, vol. 
61
 (pg. 
579
-
584
)
Goldman-Rakic
PS
Cellular basis of working memory
Neuron
 , 
1995
, vol. 
14
 (pg. 
477
-
485
)
Helson
H
Adaptation-level as a basis for a quantitative theory of frames of reference
Psychol Rev
 , 
1948
, vol. 
55
 (pg. 
297
-
313
)
Kahneman
D
Tversky
A
Prospect theory: an analysis of decision under risk
Econometrica
 , 
1979
, vol. 
47
 (pg. 
263
-
291
)
Lau
B
Glimcher
PW
Dynamic response-by-response models of matching behavior in rhesus monkeys
J Exp Anal Behav
 , 
2005
, vol. 
84
 (pg. 
555
-
579
)
Lee
D
Neural basis of quasi-rational decision making
Curr Opin Neurobiol
 , 
2006
, vol. 
16
 (pg. 
191
-
198
)
Lee
D
Conroy
ML
McGreevy
BP
Barraclough
DJ
Reinforcement learning and decision making in monkeys during a competitive game
Cogn Brain Res
 , 
2004
, vol. 
22
 (pg. 
45
-
58
)
Lee
D
McGreevy
BP
Barraclough
DJ
Learning and decision making in monkeys during a rock-paper-scissors game
Cogn Brain Res
 , 
2005
, vol. 
25
 (pg. 
416
-
430
)
Lee
D
Port
NL
Kruse
W
Georgopoulos
AP
Variability and correlated noise in the discharge of neurons in motor and parietal areas of the primate cortex
J Neurosci
 , 
1998
, vol. 
18
 (pg. 
1161
-
1170
)
Ljung
L
System identification: theory for the user
1999
Upper Saddle River (NJ)
Prentice-Hall Inc
Mahadevan
S
Average reward reinforcement learning: foundations, algorithms, and empirical results
Mach Learn
 , 
1996
, vol. 
22
 (pg. 
159
-
195
)
Miller
EK
Cohen
JD
An integrative theory of prefrontal cortex function
Annu Rev Neurosci
 , 
2001
, vol. 
24
 (pg. 
167
-
202
)
Pawitan
Y
In all likelihood: statistical modelling and inference using likelihood
2001
Oxford
Oxford University Press
Saito
N
Mushiake
H
Sakamoto
K
Itoyama
Y
Tanji
J
Representation of immediate and final behavioral goals in the monkey prefrontal cortex during an instructed delay period
Cereb Cortex
 , 
2005
, vol. 
15
 (pg. 
1535
-
1546
)
Samejima
K
Ueda
Y
Doya
K
Kimura
M
Representation of action-specific reward values in the striatum
Science
 , 
2005
, vol. 
310
 (pg. 
1337
-
1340
)
Schultz
W
Behavioral theories and the neurophysiology of reward
Annu Rev Psychol
 , 
2006
, vol. 
57
 (pg. 
87
-
115
)
Simen
P
Cohen
JD
Holmes
P
Rapid decision threshold modulation by reward rate in a neural network
Neural Netw
 , 
2006
, vol. 
19
 (pg. 
1013
-
1026
)
Sohn
J-W
Lee
D
Effects of reward expectancy on sequential eye movements in monkeys
Neural Netw
 , 
2006
, vol. 
19
 (pg. 
1181
-
1191
)
Soltani
A
Lee
D
Wang
X-J
Neural mechanism for stochastic behaviour during a competitive game
Neural Netw
 , 
2006
, vol. 
19
 (pg. 
1075
-
1090
)
Soltani
A
Wang
X-J
A biophysically based neural model of matching law behavior: melioration by stochastic synapses
J Neurosci
 , 
2006
, vol. 
26
 (pg. 
3731
-
3744
)
Sugrue
LP
Corrado
GS
Newsome
WT
Neural correlates of value in orbitofrontal cortex of the rhesus monkey. Program No. 671.8. 2004 Abstract Viewer/Itinerary Planner [Internet]
2004
Washington (DC)
Society for Neuroscience
 
Sutton
RS
Barto
AG
Reinforcement learning: an introduction
1998
Cambridge (MA)
MIT Press
Tinklepaugh
OL
An experimental study of representative factors in monkeys
J Comp Psychol
 , 
1928
, vol. 
8
 (pg. 
197
-
236
)
von Neumann
J
Morgenstern
O
Theory of games and economic behavior
1944
Princeton (NJ)
Princeton University Press
Zeaman
D
Response latency as a function of the amount of reinforcement
J Exp Psychol
 , 
1949
, vol. 
39
 (pg. 
466
-
483
)