The orbital frontal cortex appears to be involved in learning the rules of goal-directed behavior necessary to perform the correct actions based on perception to accomplish different tasks. The activity of orbitofrontal neurons changes dependent upon the specific task or goal involved, but the functional role of this activity in performance of specific tasks has not been fully determined. Here we present a model of prefrontal cortex function using networks of integrate-and-fire neurons arranged in minicolumns. This network model forms associations between representations of sensory input and motor actions, and uses these associations to guide goal-directed behavior. The selection of goal-directed actions involves convergence of the spread of activity from the goal representation with the spread of activity from the current state. This spiking network model provides a biological implementation of the action selection process used in reinforcement learning theory. The spiking activity shows properties similar to recordings of orbitofrontal neurons during task performance.
The orbitofrontal cortex plays an important role in goal-directed behavior (Wallis et al., 2001). Lesions of the orbitofrontal cortex impair the ability of animals to learn which stimuli are associated with reward (Bechara et al., 1994, 1997; Frey and Petrides, 1997; Miller and Cohen, 2001; Pears et al., 2003; Izquierdo and Murray, 2004). Recordings from orbitofrontal cortex neurons demonstrate that spiking activity in response to sensory stimuli changes dependent upon the association of a stimulus with a reward in humans (Rolls, 1999), non-human primates (Thorpe et al., 1983; Schultz et al., 2000; Wallis and Miller, 2003) and rats (Mulder et al., 2003; Schoenbaum and Eichenbaum, 1995a,b; Schoenbaum et al., 2003). The orbitofrontal cortex appears to be particularly important when the generation of specific actions depends upon the context of particular sensory stimuli (Miller and Cohen, 2001). Here we focus on behavior directed toward a specific goal; we do not yet deal with decisions about the relative value of different goals (Balleine and Dickinson, 1998; Tremblay and Schultz, 1999).
Here we present a computational model that is applicable to multiple regions of the prefrontal cortex (PFC), demonstrating how populations of spiking neurons could mediate goal-directed behavior. In particular, we demonstrate how representations of specific motor actions can be used for goal-directed behavior in multiple different circumstances, dependent upon the context of specific sensory stimuli. This modeling effectively simulates the behavior and pattern of activity of orbitofrontal cortex neurons described in an experiment by Schultz et al. (2000) — neurons that show response to sensory stimuli, to reward and to expectation of reward. This task involves the differential generation of Go versus NoGo responses to randomly presented visual cues. Recordings demonstrated that some neurons in the orbitofrontal cortex do indeed fire selectively for the transition from one specific state to another. Schultz et al. (2000) identified these neurons, labeling them as selective for the instruction that initiates a specific trial, as well as predictive for a specific action.
Previous models of frontal cortex function have used neurons with sigmoid input–output functions which represent firing of populations of neurons (Cohen and Servan-Schreiber, 1992; O'Reilly and Munakata, 2000). In order to model the patterns of spiking activity more directly during behavioral tasks, we use integrate-and-fire neurons (Stein, 1967; Gerstner, 2002; Gerstner and Kistler, 2002) with Hebbian spike-timing-dependent synaptic plasticity (STDP) (Levy and Steward, 1983). Integrate-and-fire neurons simulate the membrane potential response to the build-up of synaptic input over time and emit a spike when the potential crosses threshold. The model shows how integrate-and-fire neurons can perform the functions described in equations for a circuit model of the PFC (Hasselmo, 2005). The structure of the model was motivated by anatomical evidence suggesting the organization of neural circuits into minicolumns (Lund et al., 1993), cell assemblies of highly interconnected neurons found in the PFC. In our model, different minicolumns responded to both sensory input and motor actions, consistent with evidence (Fuster, 1973, 2000; Fuster et al., 1982; Funahashi et al., 1989; Quintana and Fuster, 1992) that activity in the PFC represents two types of perception: (i) the perception of past sensory stimuli available due to short-term buffers and current sensory stimuli; and (ii) the proprioceptive sensation and prediction of motor actions. The organization into minicolumns was motivated by evidence for strong excitatory and inhibitory connectivity within local circuits of cortical neurons (Mountcastle, 1997; Lübke and von der Malsburg, 2004). The rapid strengthening of associations between sensory states, motor actions and reward is motivated by studies showing rapid changes in functional interactions between populations of prefrontal neurons during learning (Thorpe et al., 1983; Schoenbaum et al., 2000; Mulder et al., 2003).
The structure of this model closely resembles features of reinforcement learning (Sutton and Barto, 1981; Schultz et al., 1997; Sutton and Barto, 1998), so we will commonly refer to sensory information from the environment as ‘state’. We will refer to motor output as ‘actions’ and to the desired goal as ‘reward’. However, this model does not focus on the temporal difference learning rule (Sutton, 1988), a rule that uses the difference between successive outputs as error measure. Instead it focuses on mechanisms of action selection associated with specific sensory states and reward. This demonstrates how integrate-and-fire neurons can perform the circuit mechanism of action selection proposed in a more abstract model of the PFC (Hasselmo, 2005).
In the following sections we simulate the proposed mechanism of the prefrontal minicolumn circuitry and apply that to the delayed Go/NoGo task with its reward protocol for different stimuli. We focus on explaining selective neuronal activity, as recorded by Schultz et al., with our model.
Materials and Methods
This model focused on replicating neuronal activity and behavior in the experiments by Schultz et al. (2000). In these experiments, an initial visual stimulus indicates one of three possible trials (Fig. 1A): (i) rewarded movement stimulus (Srm), whereby reward is given if the monkey presses a key; (ii) rewarded non-movement stimulus (Srnm), whereby reward is given if the monkey chooses not to press the key; (ii) unrewarded movement stimulus (Surm), whereby the reward is not given but the key press is still required. Unless the movement is performed in the Surm trial, another unrewarded Surm trial follows. The decision to move or not to move followed a delay of 2 s, when a trigger signal was given, which was identical in each trial. Schultz et al. found that orbitofrontal neurons that showed task related activity fired selectively. Some responded with increased firing rates to a specific instruction cue, some responded with increased firing rates predictive of Go/NoGo choice according to the expectation of reward, and some responded with increased firing rates to reward received.
We propose that goal directed behavior is learned by associating states and actions that are separately represented by the population of neurons of individual minicolumns. A state is indicated by the perception of specific sensory stimuli or the perception of reward received, while an action is indicated by proprioceptive input about motor activity. According to our hypothesis, the initial states Srm, Srnm and Surm, as well as the Reward state, are represented by activity in individual minicolumns in the PFC, while activity in a further two minicolumns represents action selections Go (move to press a key) or NoGo. During learning of goal-directed behavior, STDP strengthens connections within and between minicolumns so that state and action representations are associated. Because activity that corresponds to consecutive states and actions may appear at arbitrary time intervals, a short-term buffer based on persistent spiking due to after-depolarization (ADP) of membrane potential (Andrade, 1991; Klink and Alonso, 1997b) is used to enable encoding with STDP (Lisman and Idiart, 1995; Jensen et al., 1996; Koene et al., 2003).
We propose that the retrieval of goal-directed behavior depends on the spread of activity through strengthened connections from a minicolumn that represents the reward state and from the specific state minicolumn activated by current input. Consistent with this hypothesis, experimental evidence indicates that retrieval in the PFC produces goal-directed activity that is initiated by the desire for a goal (Schultz, 1998; Schultz and Dickinson, 2000; Miller and Cohen, 2001). In our model, the spread of activity from the representation of current state is gated by the spread from a desired goal. When the gated spread produces output from the minicolumn that represents the current state, the correct next action is selected. Hence, the convergence of activity from a current state representation and from a goal representation governs goal-directed behavioral responses.
Given the representation of states and actions, the transition from one state to another state via a specific action can be encoded uniquely if there is specific neural activity that occurs only for that action and only when the action is initiated in a particular state. This requirement leads to the presupposition that a functional minicolumn contains populations of input neurons and populations of output neurons that form connections with other minicolumns, and that the neurons in those populations are connected in a structured manner to other minicolumns (in this simulation to exactly one). Since the combination of activity at a specific input neuron and a specific output neuron of an action minicolumn represents the transition from a preceding state to a following state, that information gives the model the Markov property (Sutton and Barto, 1998). With this property, one-step dynamics enable us to predict the next state and expected reward for a specific action.
We developed simulations of the Schultz et al. task with Catacomb2 (Cannon et al., 2003) that replicated the actions of an agent (monkey) within an environment, as well as integrate-and-fire neuron dynamics in PFC. With our approach (which we call ‘design-based’ modeling), data from a simulated operant task protocol was linked with simulated neuronal circuitry for sensory processing and functions of the PFC (see Fig. 1B). Further details of the neurophysiology were modeled explicitly where needed for specific functional requirements, such as the after-depolarization experienced by specific neuron populations that may enable persistent firing.
The integrate-and-fire neurons in our model of PFC minicolumns have a resting and reset potential of −60 mV and an exponential decay time constant of 10 ms. The firing threshold is −50 mV and action potentials have a duration of 1 ms, followed by a 2 ms refractory period and subsequent strong after-hyperpolarization with reversal potential −90 mV and exponential decay time constant 30 ms. We used dual-exponential functions for the responses of synaptic conductances. Unless the description of a specific synaptic connection indicates otherwise, the time constant for the rise of the dual-exponential response function was 2 ms and the time constant for the fall was 4 ms. Excitatory synaptic connections had a reversal potential of 0 mV and inhibitory synaptic connections had a reversal potential of −70 mV.
In the simulation of the operant task environment, stimuli produced by visual cues and reward, as well as proprioceptive sensation of motor activity are conveyed as spike trains (top of Fig. 2) that are produced by specific neurons [signal pathway (a) in Fig. 1B]. The simulation of perceptual processing circuitry receives those spike trains and transforms them into reliable sequences of state–action spike pairs (bottom of Fig. 2). Every time that a spike train corresponding to a new state or a new motor action is detected, a pair of spikes is generated that represents the most recent state and the most recent action. The individual spike times of a state–action spike pair are separated by several cycles of theta rhythm to insure that persistent spiking of the most recent two spike inputs to the short-term buffer occurs over a suffcient duration to achieved strong associative connections through STDP. To simplify the readability of the graphs, an identity matrix is used for input connections to the set of PFC minicolumns instead of a learned mapping [signal pathway (b) in Fig. 1B]. Motor action in the operant task is driven by the output of prefrontal minicolumns [signal pathway (c) in Fig. 1B]. In this manner, the seven trials shown in Figure 2 are simulated during encoding so that all relevant rules are learned in the network of prefrontal minicolumns.
Specific Neuron Populations within Prefrontal Minicolumns Achieve the Gating of the Forward Spread of Activity by Spread from the Goal
Retrieval and encoding of associations between prefrontal minicolumns that represent states and actions are assumed to take place in opposite phase intervals of rhythmic modulation at 8 Hz (Hasselmo et al., 2002) that represents theta rhythm found in the PFC and hippocampus (Manns et al., 2000). This enables both to occur at any time during a task. The modulation supports different dynamics in the two modes. We will therefore discuss the distinct functions of encoding and retrieval separately, even though they alternate continuously during a simulated task. The modulating rhythm also serves to insure that activity in different simulated brain regions is properly synchronized, as described in our previous work (Koene et al., 2003). The plot of membrane potential for the buffer neuron abuf (Rew) in Figure 6B provides an example of the modulation by theta rhythm and clearly demonstrates rhythmic changes at 125 ms intervals.
As shown in Figure 3, we distinguish five populations of pyramidal neurons in each presupposed functional minicolumn of PFC: a, gi, go, ci and co. Of these, each a neuron connects exclusively to other neurons within the same minicolumn and plays an important role during encoding of associations between minicolumns. These a neurons represent neurons that receive thalamic input in layer IV of PFC. The neurons of a population labeled go experience suprathreshold depolarization during encoding in response to input from a (with a fixed conductance of 5.2 nS and time constants 1 ms for the rise and 2 ms for the fall of the synaptic response), but during retrieval go is inhibited by an interneuron network that is driven by a. A spike in a during encoding also provides subthreshold depolarization to all neurons of a population labeled gi (with a fixed conductance of 1.0 nS and time constants 12 ms for the rise and 20 ms for the fall of the synaptic response).
The output of each neuron in the go population projects to one of the other minicolumns in the PFC network. In the gi population, each neuron receives one connection from a go neuron located in another minicolumn. Synaptic weights are modifiable on these connections between different minicolumns and are the elements of a matrix Wg. When strengthened, the Wg connection can fire a unit gi if the presynaptic unit go is active. Such a connection indicates that a rule was learned that expresses the knowledge that activity in the minicolumn containing the postsynaptic neuron gi preceded activity in the minicolumn of the connected go neuron.
Similarly, each neuron of a population co makes one connection to a neuron in a ci population of another minicolumn, so that activity in the co population can target any one of the other minicolumns specifically. Again, the synaptic strengths of such connections are modifiable and make up elements in a matrix Wc. Unlike the effect of synaptic weights in Wg, postsynaptic depolarization due to input through a connection with the maximum strength in Wc is subthreshold, so that spiking in ci remains dependent on additional input. The additional input to neurons in ci, which can elevate their membrane potential over threshold, is supplied by one-to-one connections (an identity matrix) from neurons in go (with a conductance of 2.5 nS and time constants 1 ms for the rise and 2 ms for the fall of the synaptic response). The activity of go therefore fulfills a gating role with regard to spike propagation to ci.
Within a minicolumn, every neuron in gi connects to every neuron in go through modifiable synapses with weights in Wig, while every neuron in ci connects to every neuron in co through modifiable synapses with weights in Wic. The maximum depolarization caused by a connection encoded in Wig is suprathreshold, while depolarization caused by strengthened connections in Wic is limited to subthreshold values. Additional depolarization is provided to co by one-to-one connections from neurons in gi (with a conductance of 2.5 nS and time constants 1 ms for the rise and 2 ms for the fall of the synaptic response). This provides a gating function for decisions about which action is selected based on convergence. The fan-out of connections within a minicolumn between gi and go and between ci and co enables the encoding of multiple routes between minicolumns. The following sections will first describe the retrieval process and then describe encoding.
Retrieving Behavioral Rules in The PFC
Miller and Cohen propose that the top-down processing in which behavior is guided by internal states or intentions (cognitive control) stems from the active maintenance of patterns of activity in PFC that represent goals and the means to achieve them. They suggest that these patterns provide a bias that guides activity affecting behavior, a gating function and support their theory with a review of neurobiological, neuroimaging and computational studies (Miller and Cohen, 2001).
In our simulation, associations that form known rules are encoded in PFC. A desire for reward then elicits a spread of activity from the minicolumn representing that reward state (see dashed lines in Fig. 3a and left arrows in Fig. 3b). The neurons of the go population within that Reward minicolumn spike simultaneously in response to rhythmic input at an 8 Hz theta frequency. Those spikes propagate along connections with strengthened synaptic weights in Wg and produce a spike in the targeted gi neurons of minicolumns that immediately preceded the Reward minicolumn in a known rule. Within such a preceding minicolumn (a minicolumn that represents an action) a spike elicited at a neuron in the gi population fans out across strengthened connections to neurons in the go population of that minicolumn. Through those connections with strengthened synaptic weights in Wig, suprathreshold depolarization is elicited at the target go neuron. This same process is repeated in other consecutive minicolumns to spread activity through the gi and go populations of consecutive action and state minicolumns. As the spread branches out, it follows multiple reverse paths through connections that associate states and actions. Once the spread of activity reaches the minicolumn that represents the current state, the convergence of current state and goal spread allows selection of action. In addition, spikes in go neurons are inhibited (‘end-stopping’) by the synchronous activity of interneurons (with time constants 1 ms for the rise and 10 ms for the fall of the synaptic response of the input) elicited by input that identifies the current state.
The selection of action is indicated by an interaction of the goal spread with current state. The input that identifies the current state also targets the neurons in the co population of the same current state minicolumn. The excitatory input produces a subthreshold depolarization of co neurons. In addition to this input, the spiking of neurons in the co population is gated by population gi activity in the same minicolumn due to the spread of activity from the goal. Those co neurons that receive additional depolarization from spiking neurons in the gi population fire.
The present simulation uses only the first step of the forward spread to determine output that controls goal-directed behavior in the task, so the forward gating only has an effect on the co of the minicolumn representing current state. The output of neurons in the co populations of state minicolumns that target action minicolumns is connected to the motor circuitry of the simulation. A spike in co thereby drives motor output of the corresponding action (thick black arrow in Fig. 3a). A spike in co also causes spiking in interneurons that provide lateral inhibition to the remaining neurons in co, so that a clear winner-takes-all behavioral response is obtained.
For other applications, the minicolumn model also enables a forward spread of activity for known associations encoded in the PFC (see dotted lines in Fig. 3a and right arrows in Fig. 3b). The spikes that propagate through connections with strengthened synaptic weights in Wc cause subthreshold depolarization of a ci neuron in the associated action minicolumns. Again, forward spread of activity is gated by the spread from the goal, since a neuron in the ci population needs additional depolarization from a corresponding neuron in the go population to fire. The spike of a ci neuron fans out through connections with strengthened synaptic weights in Wic to co neurons that are gated by the dependence on activity in gi neurons in the same minicolumn.
Figure 3a includes an example of rule retrieval in a rewarded move trial. Neurons that spike as activity spreads are represented by gray circles. The example points out the importance of neuron populations gi, go, ci and co, in which individual neurons make connections with other minicolumns. As shown in Figure 3a, desire for reward causes all neurons in the go population of the Reward minicolumn to fire. The activity then spreads to associated minicolumns, including Go, NoGo and all sensory input minicolumns. In the same trial, when the Srm stimulus is perceived, the co population of the Srm minicolumn is depolarized. In the Srm minicolumn, the specific depolarized co neuron that corresponds with a spiking neuron of the gi population fires, so that activity spreads forward along a route from minicolumn Srm to minicolumn Go. The firing of the co neuron is used to generate the Go response. An analogous approach would be to use the spikes of a ci neuron in the Go minicolumns to generate the Go response. During this process, the go population of the Srm minicolumn is inhibited (end-stopping). Figure 3a shows that the spread of activity from the goal is stopped there.
In the example, spreading activity from the Reward minicolumn involves two different known paths that include the Go minicolumn. One path retrieves the associated items Reward–Go–Srm, the other retrieves the associated items Reward–Go–Surm and a separate path through NoGo retrieves Reward–NoGo–Srnm. [The retrieval of rules resembles the sequence of transitions in a finite state machine (Harel, 1987) and the recurrent connections that lead to two visits of the Go minicolumn in trials initiated by the Surm stimulus are reminiscent of connectionist Elman networks (Elman, 1990, 1991).] Since the spread of activity through different known paths elicits spikes at separate gi neurons, they do not interfere with each other. And since the neurons in ci and co populations also maintain separate connections with other minicolumns, the activity in gi correctly allows the gated forward spread to propagate only on a path from a state receiving current input. Thus, the structure of our model allows mapping through the same action from different states. While retrieval activity spreads forward along known paths to reward, those spikes elicited in the co population of the current state minicolumn that target action minicolumns also trigger the output of PFC. In Figure 3a, the spike propagation through the connection from minicolumn Srm to minicolumn Go is therefore marked as a thick black arrow. This output generates the correct ‘Go’ response, thereby guiding successful goal-directed behavior.
Encoding Behavioral Rules in The PFC
The above section described retrieval. This section describes encoding. During encoding, the neuron labeled a in the model of a minicolumn fires when input that matches the item represented by the minicolumn is received. For example, when an input spike indicates that a rewarded-move stimulus, Srm, is detected, that input causes neuron a(Srm) to spike. Here, it is assumed that stimuli activate minicolumn n after minicolumn n − 1. Encoding is achieved by STDP (Levy and Steward, 1983; Markram et al., 1997; Bi and Poo, 1998) that corresponds to the long-term potentiation (LTP) of synaptic responses (Bliss and Lømo, 1973; Bliss and Collingridge, 1993). The four steps described below take place sequentially in each encoding cycle.
Reverse Associations between Minicolumns are Encoded in Weight Matrix Wg at synapses from go(n) onto gi(n − 1)
A short-term memory (STM) buffer maintains spiking that corresponds with the two most recent inputs to the network of minicolumns. During this reactivation in encoding phases of PFC minicolumns, a(n) spikes less than 20 ms after a(n − 1). As shown in Figure 4a, the neuron a(n − 1) provides subthreshold depolarization to all the neurons of the gi population in minicolumn n − 1. And all neurons in the go population in minicolumn n receive suprathreshold depolarization through synapses from a(n). As the neurons in go(n) spike, that neuron in the gi population of minicolumn n − 1 which is connected to a neuron in go(n) receives subthreshold depolarization, due to the initial value of synaptic strengths in weight matrix Wg. The neuron in gi(n − 1) that receives input from both a(n − 1) and go(n) spikes a few milliseconds later than the presynaptic neuron in go(n), so that STDP is elicited. Thus, the amplitude of the corresponding synaptic response is increased in Wg. After several repetitions in the STM buffer, encoding establishes a suprathreshold connection between go(n) and gi(n − 1) (Fig. 4a).
Forward Associations between Minicolumns are Encoded in Weight Matrix Wc at Synapses from co(n − 1) onto ci(n)
Rhythmic input modulates the membrane potential of neurons in co. During the encoding phase, the rhythmic depolarization of neurons in co(n − 1) is such that excitatory input through one-to-one connections from gi(n − 1) in the same minicolumn causes postsynaptic spiking. The spiking in gi(n − 1) that is described in the encoding step above therefore drives spiking in co(n − 1), as shown in Figure 4b. The neurons in ci(n) receive subthreshold (gating) depolarization through one-to-one input from neurons in go (n). In the presence of rhythmic depolarization as above and given small initial values in Wc, the neuron in ci(n) that is connected to a neuron in the co population of minicolumn n − 1 spikes due to the combined subthreshold inputs from both go (n) and co(n − 1). Again, STDP is elicited, since the postsynaptic neuron in ci(n) spikes a few milliseconds after it receives input from the presynaptic neuron in co(n − 1). After repetition, a subthreshold connection is established between co(n − 1) and ci(n), which propagates spikes if input is received from the corresponding neuron in the gating go (n) population, even when rhythmic depolarization is absent in retrieval phases.
Rules that Associate Preceding with Possible Ensuing Activity are Encoded within a Minicolumn by the Weight Matrix Wic at Synapses from ci(n − 1) onto co(n − 1)
During encoding, the activity of the ci population is driven by an STM buffer that maintains the activity of ci populations of the twomost recently active minicolumns. [The buffer holds two items so that the buffered activity ci(n) can replace ci(n − 1) as the memory of preceding activity in ci when the next association with minicolumn n + 1 is encoded.] As Figure 4c shows, neurons in ci(n − 1) spike several milliseconds before spiking of neurons in co(n − 1) is driven by corresponding spikes in population gi(n − 1) (with a synaptic conductance of 6.0 nS), as described above. STDP is elicited and repetition increases synaptic strengths in Wic from initial values near zero to subthreshold amplitudes.
Associations that Enable the Spread of Activity from the Representation of a Goal are Encoded by the Weight Matrix Wig at Synapses from gi(n − 1) onto go (n − 1) within a Minicolumn
During encoding, spiking in a subpopulation of go that is identified as
Short-term Memory Based on Persistent Spiking Enabled Spike Timing Dependent Potentiation to Encode Associations
As described, encoding in our model of the PFC depends on STDP in Wg,Wc, Wig and Wic, and on the buffered activity of populations a and ci. A Hebbian model of STDP that is based on the long-term potentiation observed at many synapses requires multiple instances in which presynaptic spiking precedes postsynaptic spiking by <40 ms (Levy and Steward, 1983; Markram et al., 1997; Bi and Poo, 1998), while input to the PFC may arrive with arbitrary large time intervals. As mentioned previously, we therefore presuppose that firing patterns may be reactivated in a persistent manner by intrinsic neuronal mechanisms, such as after-depolarization (ADP) of membrane potential (Fig. 6A), caused by calcium sensitive cation currents that are induced by muscarinic receptor activation (Andrade, 1991; Klink and Alonso, 1997a). We also presuppose that a common brain rhythm may produce oscillatory modulation in different regions that provides synchronization of activity. The reactivation of firing patterns by ADP in one population of neurons at specific phases of the brain rhythm can thereby reliably provide input to other populations in the PFC where STDP can occur in an encoding mode (Fig. 6B). Using rhythmic modulation and ADP, we provide short-term memory (STM) in a manner similar to the STM model first proposed by Lisman and Idiart (1995) and Jensen and Lisman (1996). Recurrent inhibition within such a buffer separates the reactivation of sequential items to maintain their order. The STM may reside in the PFC or may be provided by input from the entorhinal cortex.
The membrane potentials of three neurons of an STM buffer are plotted in Figure 6B. In the hippocampus, regular activity originating in the septum (Brazhnik and Fox, 1999) is believed to cause 8 Hz oscillations of the membrane potential by modulating the GABAergic inhibition of pyramidal cells via networks of interneurons (Alonso et al., 1987; Stewart and Fox, 1990). A similar mechanism appears to cause theta rhythm oscillations in limbic cortices due to rhythmic activity of basal forebrain neurons Manns et al. (2000). Those oscillations define two functional phases of the buffer neurons. We call the phase interval of greatest rhythmic depolarization the reactivation phase of STM and the remaining interval the input phase of STM. The plots show that spiking produced by afferent activity during the input phase of the buffer is reactivated by the ADP during subsequent repetition phases. The duration of the rise of the ADP matches the period of oscillation. This means that the ADP of the earliest neuron to spike in one cycle allows that neuron to reach threshold first in the following cycle. The order of spikes is maintained during reactivation in STM. As spikes caused by the buffer occur in pre- and postsynaptic neurons of modifiable connections in the PFC, an asymmetric function of spike-timing dependent potentiation takes into account the order of spikes. This ensures that STDP is elicited in specific connections so that a direction of causality is inferred during rule learning. Furthermore, the separation of consecutive spikes is maintained in STM by recurrent inhibition that is caused by the activation of an interneuronal network (Bragin et al., 1995) each time a buffer neuron spikes.
In the absence of input, the contents of an STM buffer decay gradually, due to noise and a slow-afterhyperpolarization (AHP). But when a full buffer receives new input, such as when rule learning involves a long sequence of states and actions, the earliest item in the buffer needs to retire so that the new item is maintained. The item replacement must also avoid changing the order of items. To achieve this, we propose that the appearance of a new item leads to inhibition at a specific phase of the rhythmic oscillation (see dashed box in Fig. 6C). Inhibition at that specific phase suppresses the reactivation of the first item (Koene et al., 2003) until its ADP has subsided, as shown in Figure 6C. The new item, represented by action potentials in the plot of the membrane potential of the third cell, assumes the last position in the sequence of reactivation.
Each neuron in an STM buffer projects output to a corresponding target neuron in a or ci. Current and preceding activity are therefore available for encoding, as shown in Figure 7 for the membrane potential of a neurons throughout the network. The activity in a corresponds to current and preceding input, as pairs of state and action spikes are received in PFC during the seven simulated encoding trials of rule learning (Fig. 2).
The network described above effectively encoded the different rules of the task and showed effective behavioral performance when tested with different stimuli, generating a Go response to Srm, a NoGo response to Srnm and a Go response to Surm stimuli. This behavior was guided by spiking activity that matches the data obtained by Schultz et al. (2000).
In the seven training trials (Fig. 2), the necessary associations for stimulus gated selection of action were encoded with strengthening of connections using STDP at synapses in Wg, Wc, Wig and Wic. Six trials were used to test performance with all possible initial stimuli. For these trials, the spike trains that represent the sensation of the initial stimulus were provided as input and the model-generated motor commands that lead to behavioral responses and the sensation of reward received were observed. The network showed the correct behavior in the task. The correct action followed each initial state during tests of task performance. Inspection of individual neuronal responses reveals that the three main types of responses observed by Schultz et al. were also found in the present simulations: (i) neurons that respond selectively to a trial-specific initial stimulus; (ii) neurons that respond prior to reward in a specific trial and may indicate a chosen course of action; and (iii) neurons that respond selectively to predicted and obtained reward. In addition to these, several more specialized responses were observed, providing predictions of the model.
During performance of the operant task, a desire for reward begins at the onset of every trial in the form of regular suprathreshold input to all neurons of the go population of the minicolumn that represents the goal. When trial input stimuli appear in different trials they are maintained as persistent spikes of buffer neurons that cause the spiking of a(Srm), a(Srnm) and a(Surm) in Figure 8. These input stimuli also provide subthreshold input to the co population of the minicolumn that represents the current state. Converging with the spread of activity from the goal minicolumn, spiking co neurons drive goal-directed behavior, resulting in the generation of output which in turn causes proprioceptive feedback of the correct action in each sequence in Figure 8, as well as the perception of reward received.
Activity Underlying Selective Responses in the Model
Membrane potentials of those neurons within a minicolumn that are involved in the choice of action demonstrate the decision process that is based on a forward spread of activity that is gated by the spread of activity from the goal. This is shown in Figure 9, in which membrane potentials of relevant a, gi and co neurons in the minicolumn that represent the Surm instruction state are plotted during an interval within an Surm trial (the convergence looks the same for the Srm example in Fig. 3). The plots show that neurons in the co population of that minicolumn experience subthreshold depolarization due to current state input from a. This contribution is joined by converging input from a specific neuron in the gi population that spikes due to the spread of activity from the minicolumn that represents the goal (dashed arrows in Fig. 3). When the inputs converge a neuron of the co population fires (bottom of Fig. 9). Activity in co was gated by activity in gi, and recurrent inhibition assured that only the first spike in co led to a behavioral response. The chosen behavior was determined by the minicolumn that was targeted by that spike, in this example a Go motor command for the simulated task environment.
For the six test trials, the spike trains that represent the sensation of the initial stimulus, motor commands that lead to behavioral responses and the sensation of reward received are shown in Figure 10. The spike trains show that Srm stimuli were followed by Go responses and reward, Srnm was followed by NoGo responses and reward, and Go action responses followed Surm stimuli and led to subsequent rewarded trials. The network can perform correctly regardless of the order of presented test stimuli.
Schultz et al. plotted the recorded spikes of three orbitofrontal neurons during many rewarded move (Srm) and unrewarded move (Surm) trials. We compare our simulation results with those of the experiment by Schultz et al. by displaying results for the three main categories of neuronal responses described by Schultz et al. side by side in Figure 11. These plots show spikes in individual trials (short vertical lines) aligned to specific parts of the task.
As in the Schultz et al. results, our results showed that individual neurons activate specifically when one of the three cue stimuli is perceived. In our model, this is caused by the current state response of the a population (Fig. 11A,D). We also found individual neurons that activate for a chosen behavioral response. This activity results when neurons of the co population in the current state minicolumn receive gating activity from gi neurons due to the spread of activity from the goal minicolumn (Fig. 11B,E). We also found neurons that activate specifically when reward is received. This activity is caused by the current state activation of the a neuron in the goal minicolumn in our model (Fig. 11C,F).
As in the Schultz et al. data, there is spiking in Figure 11E during Srm and Surm trials, but the spike rate is higher during the Go action in Srm. Both the data and the output of our model show a quantitative difference in the amount of firing between Srm and Surm trials before reward is received. In our model, this is explained because co(Srm→Go) is activated in encoding phases in both trials when a(Go) is maintained by the STM buffer, since strengthened connections from go(Go→Srm) to gi(Srm←Go) propagate the activity. Additionally, co(Srm→Go) is activated specifically in the Srm trial when the goal spread causes spiking in the gating gi(Srm←Go) neuron, while current state input depolarizes the co(Srm) population. The appearance of similar activity at the trigger time during URM trials in Figure 11B suggests that the activity is not merely background noise and supports the possible explanation provided by our model.
A smaller temporal overlap of activity similar to that in the Schultz et al. results is achieved if the intervals between instruction stimulus, action trigger and reward delivery are increased in the model to match the data, for a trial length of 6–8 s instead of 1500 ms in the simulation. The shorter intervals in the model significantly reduced the time needed to compute each simulation run without affecting resulting behavior.
Some Neurons in the PFC are Active in Multiple Behaviors
In addition to the results above, we found that some neurons in the simulation activate selectively for a specific phase of two different trials. As shown in Figure 12A, the a(Go) neuron in the minicolumn that represents a movement response spikes in rewarded movement and unrewarded movement trials. Similarly, the a(Rew) neuron in the minicolumn that represents the perception of reward spikes in rewarded movement and rewarded non-movement trials.
In Figure 12B, we show that specific neurons in the gi and go populations of minicolumns that are involved in the retrieval of associations with a goal generated a spike in every trial of that specific task. The neurons that activate throughout each trial correspond to those involved in the learned associations for the spread of activity from the goal during retrieval, as shown in Figure 3. Thus, even neurons with very extensive response properties are important for performance of this task. Activity of the a population in the current state produces end-stopping of activity through the go population in the same minicolumn. Therefore, the onset of a rewarded move (RM) trial produces end-stopping at go(Srm) cells, but, due to the associations from Reward to Srnm via NoGo and from Srnm to Surm via Go, a neuron in gi(Surm) also spikes during that trial. Similarly, gi(Surm) spikes during rewarded non-movement trials due to the alternate path for the spread of retrieval activity from the goal via the Srm minicolumn. Thus, we predict a correlation of neuronal firing during Surm and Srm trials (strong Go involvement in both), and a lesser correlation of neuronal firing during Surm and Srnm trials, as shown in rows 1 and 3 of Figure 12B.
Activity in Figure 12C demonstrates the end-stopping function proposed in the minicolumn model. During rewarded movement trials, the neuron
Schultz et al. point out that some neurons activated less selectively, namely in a manner that was selective for the instruction cue regardless of trial type and expected reward. Similarly, our simulation shows that a neuron of the ci(Srm→Go) population in the Go minicolumn that receives input from the Srm minicolumn exhibits retrieval spikes in both Srm and Surm trials during instruction activity in the Srm or Surm minicolumns. Those retrieval spikes disappear once the Go minicolumn receives proprioceptive input about a key press movement in the environment and spikes begin to occur in the encoding phase of theta modulated network. This produces a 180° phase shift of firing at the time of the movement generation. The Go minicolumn ci(Surm→Go) neuron that receives input from the Surm minicolumn exhibits the same transition of spiking from the retrieval to the encoding phase, but its retrieval spiking is more selective and appears only during an Surm trial, since no sequence exists that involves the Surm minicolumn in other trials.
Schultz et al. provide a quantitative assessment of the trial and phase selective responses recorded. Of 505 neural responses identified at recording sites, 188 exhibited task related activity: 99 responses showed selective activity at the instruction phase of trials. Of those, 63 reflected the type of reinforcer or trial (38 active during RM, RNM or both trial types, 22 active only during URM trials and three active during RM and URM trials). Fifty-one responses showed selective activity at the trial phase preceding reward (41 during both RM and RNM trials, six during RM or RNM trials and four during URM trials). Sixty-seven responses showed selective activity at the reinforcer delivery phase of trials (62 during both RM and RNM trials, two during only RM trials and three during URM trials).
Before comparison of these numbers with the model, some caveats should be raised. The small sample sizes in terms of the number of sites recorded by Schultz et al. and the number of neurons simulated in the model is too small to allow statistical comparison. Also, the number of selective model responses in a specific category depends on the arbitrary number of neurons chosen as a cell assembly within a population of neurons in each minicolumn. When the model is minimized so that individual functions of the minicolumn are performed by the smallest number of neurons, then the following quantitative assessment of responses was obtained.
In the simulation, the neural circuitry of the model prefrontal minicolumns consisted of 328 neurons (excluding neurons that form short-term buffers and circuitry to process prefrontal input and output). From those neurons, 169 task related responses were recorded: 37 responses showed selective activity at the instruction phase of trials. Of those, 34 reflected the type of reinforcer or trial (21 active during RM or RNM trials, 10 active only during URM trials and three active during RM and URM trials). Seventy-five responses showed selective activity at the trial phase preceding reward (40 during RM or RNM or both trial types, 11 during only URM trials, 17 during RM and URM trials and seven unselective for trial type). Fifty-seven responses showed selective activity at the reinforcer delivery phase of trials (20 during both RM and RNM trials, 14 during only RM trials, 21 during RNM trials and two unselective for trial type).
These results support a correlation during the instruction phase between RM and RNM trials seen in both data and model. The absence of a correlation between URM and RNM during the trial phase preceding reward is also consistent with the data. The number of responses for both RM and URM trials is rather higher than the data, as is the response activity for only RNM trials. Both differences may reflect a difference in the model or merely statistical variability.
Our model replicates goal-directed behavior in a visual discrimination task based on a hypothesis about the functional connectivity of PFC circuits (Hasselmo, 2005). Behavioral responses and reward associations to visual cues are encoded in synaptic strengths between neuronal networks representing cortical minicolumns. The goal-directed behavior is retrieved by means of a converging spread of activity from a representation of desired reward and the spread of activity from the current state. Our results specifically replicate the qualitative findings by Schultz et al. (2000) in terms of individual neuronal responses, while suggesting a possible neural mechanism for learning and retrieval. We use the model to propose explanations for the selective responses of individual neurons in orbitofrontal cortex during goal-directed behavior.
The model provides a framework for the context/stimulus dependent change in action selection, as proposed by Miller and Cohen (2001). In particular, it provides a spiking neuron implementation of context effects similar to those of Cohen and Servan-Schreiber (1992). We show how populations of spiking neurons could interact to allow selection of specific actions based on the context of specific sensory input (states) and the desire for reward. Because activity in a specific minicolumn (Fuster, 2000) that represents such a state or action may play a role in different contexts that require its association with different state–action-state transitions, we presuppose separate populations of neurons within a minicolumn for input from and output to other minicolumns (Hasselmo, 2005). For example, the Go and Reward minicolumns in the experimental task fulfill such multiple roles, as shown in Figures 3 and 12A.
We show what functional role the individual neurons in these populations could play in the performance of the task by replicating essential features of the Schultz et al. experiment. We used similar learning and retrieval protocols and replicated individual neuronal responses that are selective for a specific state in a specific trial (see Fig. 11). These selective responses may be understood in the context of a neuron's function in the minicolumn model.
In addition to these explanations, the model generates predictions for this task about what other types of responses should appear in the PFC, including neuronal responses which would look rather complex and might therefore not normally be classified. One set of complex responses is shown in Figure 12B. The model predicts that some neurons will spike throughout all trials of a goal-directed task, not just for a specific state, due to the spreading activity from a goal representation. And if encoding and retrieval alternate continuously as modeled, then such responses that are indicative of spreading activity should be recorded during stages of novel learning as well as task performance.
Our results also propose that end-stopping implemented in the retrieval function of the model may be detected as shown in Figure 12C. Evidence that supports possible end-stopping of spreading activity is provided by the termination of recorded spikes in Schultz et al. (2000), where neuronal activity that is selective for Srm or Srnm instruction stimuli and for action preceding reward terminates as soon as reward is received.
Predictions of the model suggest experiments that test the validity of two of its central tenets: convergence of activity through representations that may be associated in multiple ways (Sutton and Barto, 1981) and the need for a short-term buffer.
The structure of the model uses a progressive backward spread of activity from the goal. This suggests an experiment that could test this feature, in which associations are formed sequentially between states and actions leading to a particular goal. Imagine an operant task, in which specific sequences of lever presses result in reward. For example, pressing levers in the sequence A–B–C should result in reward. If the levers are pressed randomly, eventually the correct sequence will occur, in a learning paradigm analogous to the one used in experiments by Terrace et al. (2003). In the model, this will initially lead to an association between the final action ‘press C’ and reward (note that this action involves being at a specific state — in front of lever C — and generating the action ‘press’). Upon further accidental production of the sequence, it will lead to association of ‘press B, then press C’ with reward, and finally ‘press A, press B, press C’ with reward. The activity of the gi and go neurons in the model would initially show activity only for reward, then would show a persistent increase when the association is first formed with ‘press C’, followed by increases in separate populations when the association is formed for ‘press B’ and finally for ‘press A’. Thus, the overall population of neurons firing during the task would show a progressive increase as the specific sequence is learned.
During encoding, our model depends on the function of STM buffers, and data by Andrade shows sustained currents that may support such a function in the PFC (Andrade, 1991). However, those buffers need not reside in the PFC. A plausible alternative source of buffered perceptual spike patterns is in the entorhinal cortex, in which neurons that exhibit intrinsic persistent spiking have been found (Klink and Alonso, 1997b). In either case, it is possible that STM function may be disrupted without impairing decision making for known tasks. The function of short-term buffers may be blocked by pharmacological agents. For example, the muscarinic antagonist scopolamine will block the ADP which provides one mechanism for sustained spiking of cortical neurons (Andrade, 1991; Klink and Alonso, 1997b; Fransen et al., 2002). Without working short-term buffers in the PFC, the model predicts correct retrieval function for learned tasks, but an inability or impairment to learn new tasks. This may underlie the impairment of task rule shifting seen with cholinergic lesions (McGaughy et al., 2004; J. McGaughy et al., unpublished data). Cholinergic blockade does cause impairment of short-term delayed matching function (Bartus and Johnson, 1976; Penetar and McDonough, 1977).
Critical Variables of the Simulation
The successful results obtained with the simulation depend on several critical variables. Within the model of a prefrontal minicolumn, a specific set of connections must have conductances that lead to subthreshold excitation of postsynaptic neurons and another set must have conductances that lead to suprathreshold excitation and therefore drive spiking in postsynaptic neurons. The set of subthreshold connection consists of the connections from a to gi and the connections from go to ci. The set of suprathreshold connections consists of the connections from a to go, from gi to co, and from ci to go (as shown in Fig. 4). For goal-directed prefrontal output, it is necessary that current state input to a co neuron population does not achieve spiking, except at those neurons that also receive gating input from neurons activated in the gi population by the spread from the goal representation. Synapses at modifiable connections Wg, Wig, Wc and Wic are initialized with small subthreshold conductances. There is no need to adjust the learning rate during encoding, since a specific maximum conductance is achieved in strengthened connections. That maximum is set to provide suprathreshold excitation through the goal-spread connections Wg and Wig, and subthreshold excitation through Wc and Wic (where the spiking of neurons in ci is gated by go, the spiking of neurons in co is gated by gi during retrieval). The excitation of a neuron in ci by individual input from go or through Wc and the excitation of a neuron in co by individual input from gi or through Wic is insufficient to elicit a spike. When two subthreshold inputs combine at a neuron in ci (one from go and one through Wc), or when two subthreshold inputs combine at a neuron in co (one from gi and one through Wic), then a spike is elicited.
Another critical variable is the modulation of specific connection strengths in the minicolumn model by theta input (Hasselmo et al., 2002). Theta modulation allows
Lastly, critical variables are involved in the timing of short-term buffers (Lisman and Idiart, 1995; Jensen et al., 1996; Koene et al., 2003). A working buffer requires that the rise time of ADP matches the period of a theta cycle (Fransen et al., 2002) and that recurrent inhibition separates consecutive spikes sufficiently to retain their order, but within a time interval that enables STDP between neurons that spike in response to the buffer output. For the first-in-first-out replacement of spikes maintained in a buffer, inhibitory input presented due to the combination of new input to the buffer and the last spike in the buffer must cause hyperpolarization at the phase of first spike reactivation (see Fig. 6C). Theta oscillations achieve the necessary synchronization of reactivation cycles in the STM buffers and encoding and retrieval phases in the minicolumns.
Correspondence of Simulation Results and Data
As mentioned in the results, the present study does not attempt to attribute meaning to the quantitative assessment of numbers of responses that belong to any specific category of responses that are selective for a trial type and a phase of that trial. For a quantitative comparison of that sort, an experimental study would have to record from a larger sample of neurons and the simulation would have to include a rationale for the number of cells in assemblies that correspond to each functional unit of the prefrontal model.
The model effectively matches the data in many ways, in addition to successfully learning the goal-directed behavior for the visual discrimination task. Our results show that the simulations replicate trial and phase-of-trial selective activity in individual neurons. A direct comparison between the selective activity recorded by Schultz et al. and that produced in the simulation (Fig. 11) demonstrates the correspondence between the two sets of results. Both the Schultz et al. data and our simulation results show individual neurons that are selective for the presentation of a visual cue, the period preceding potential reward in which a decision for motor action may be made, or the receipt of reward. That selectivity is specific to a particular trial type: rewarded movement, rewarded non-movement or unrewarded movement.
Significantly, both the data and the simulation results show that selectivity for exactly one specific trial type (RM, RNM or URM) was typical of responses that showed selective activity during the instruction phase of a trial, and atypical for responses that showed selective activity during a later phase of a trial. This correspondence supports the idea that those minicolumns that represent specific actions or rewards may be associated with multiple trial types. Another significant feature of the model is the absence of neurons that respond in both RNM and URM trials, which also corresponds with the data.
Some properties of neuronal responses in the model are important for function, but may not be tested by the analysis procedures of the experiment. In particular, the analysis of experimental data did not specifically search for neurons which turned on continuously during task performance without showing specificity, and did not search for neurons which terminated activity at a specific time. The model produced background spiking activity that appears unselective for trial and phase throughout the task in 38 neurons. For the purpose of response categorization, this background spiking rate was subtracted to identify selective spike trains in those responses. The cells with this background activity are those that are involved in the spread of activity from the goal through associated minicolumns. Note that many such cells may have been deemed not task related by Schultz et al., while they clearly perform an important function in the model. One indication of such background activity in the report by Schultz et al. comes in the form of neurons with task specific activity that appeared prior to the instruction stimulus. Schultz et al. evaluated activity in 188 out of 505 neurons. As specified in Tremblay and Schultz (2000), they did find 14 neurons that activated unselectively for all familiar instruction types in the task. Yet Schultz et al. evaluated neurons that activated selectively for one or two phases of specific task trials, since responses demonstrating activity throughout a trial may have been discarded by the one-tailed Wilcoxon test of the evaluation software that they used to assess task related activity.
The simulation results identified significant periods of inactivity in addition to the detection of selective activity. Some of the cells with background spiking throughout the trials of the task exhibit periods of inactivity that correspond directly with their involvement in the retrieval of a known association that determines goal-directed behavior in a specific trial. At such a simulated cell, inhibition (end-stopping) of the spread of activity from the goal representation causes the period of inactivity. Schultz et al. did not report a specific evaluation of the times at which the activity of some neurons ends, while other responses with rhythmic background activity during the same trial continue. Schultz et al. mention neurons that remain active throughout the instruction-trigger delay, but do not quantify the number of such cases. Cases reported in the data in which neural activity within a trial turns off immediately at the onset of a following phase may be indicative of end-stopping.
The simulation results show some differences compared to the data obtained by Schultz et al. One that is immediately apparent is the precise and reproducible nature of specific intervals of spiking and of silence for each neuron in the model. This is a feature caused by the absence of noise in the simulated physiological functions.
A greater proportion of the responses recorded by Schultz et al. showed selective activity prior to reward or during reward in both RM and RNM trials than in only one of those two trial types. The proportions were reversed in the results obtained with the model, where more neurons responded to only one of the two trial types, but these differences may not be meaningful due to the sample size issue outlined above.
The model responses contained a larger proportion of cells that respond selectively during both RM and URM trials than that reported by Schultz et al. In the trial phase preceding the reinforcer, this was a category not reported by Schultz et al. and a prediction of the model that further experiments with recordings at a greater number of sites may verify.
Relation to Other Physiological Studies
This study shows how neuronal responses that guide behavior could reflect a conjunction of forward spread (stimulus dependent spread) and backward spread from goal (goal-dependent spread). The latter relates to responses obtained by Thorpe et al. (1983), where the change in reward contingency demonstrates evidence for reward dependent response. The Schultz et al. experiments replicated here were an extension of the work by Thorpe and Rolls, who recorded single unit activity of orbitofrontal neurons in primates during a Go/NoGo operant task. In that task, monkeys learned to associate reward or an aversive outcome with movement following a specific stimulus. The meaning of a stimulus was reversed during this task. Thorpe and Rolls showed that most neurons responded selectively to specific stimuli and that the responses were also selective to whether the stimulus indicated reward in a specific trial. Simulation of Thorpe et al. using our model would require changes in reward contingency in the task, and the use of some mechanism of long-term depression in the model to replicate decrease in response to previously rewarded stimuli.
Tetrode recordings by Jung et al. (1998) showed that the correlation of activity in neurons in the PFC does not map directly to sensory information such as location in spatial tasks. Rather, the activity correlates with behavioral requirements that are task specific, as shown with other simulations of a virtual rat in spatial tasks (Hasselmo, 2005). The present experimental results also relate to response data obtained by Schoenbaum et al. (1998), where changes in reward contingency were also shown to influence neuronal responses in rats. These responses were recorded in brain areas that communicate with orbitofrontal cortex through reciprocal connections, such as the basolateral amygdala, which may provide feedback of an error function to avoid an aversive outcome.
In order to encode the specific components of a task and to encode predictive relationships by associating those components, the connections between neurons in networks of minicolumns and connections with the areas that provide input and receive output must be easily modifiable. Experimental evidence has been found for a rapid change in functional connectivity in terms of modifications of the strength of connections in orbitofrontal cortex and between orbitofrontal cortex and related areas such as the basolateral amygdala (Schoenbaum et al., 2000; Mulder et al., 2003). In those experiments, observed changes in odor selectivity were closely matched by changes in correlated firing activity during initial learning that led to accurate performance on a discrimination problem.
Relation to Reinforcement Learning Theory: a Biological Implementation of Reinforcement Learning
Rules that govern successful behavior are discovered by learning how a specific action taken in one circumstance is followed by another circumstance. In other words, a causal effect is inferred from the results of a possible action that is explored while in a perceived state. In machine learning, algorithms for this are known as reinforcement learning (Sutton and Barto, 1998). In reinforcement learning, goals are explicit and formally represented by a reward value. The reinforcement learning framework has also been related to cognitive neural processes (Barto, 1995a,b; Montague et al., 1996; Schultz et al., 1997).
Reinforcement learning defines a state signal as any information that is available about the environment at a given time, which may be pre-processed sensory input and may include some memory of preceding states. The state signal has what is known as the Markov property if it contains a representation of all the information about current and preceding states and actions that are relevant to future decisions (White, 1969; Ross, 1983; Bertsekas, 1995). A state signal with the Markov property may be evaluated independent of the states and actions that precede it.
Reinforcement learning algorithms do not provide instruction about correct actions. Instead, an action is given a value by learning its consequences. Yet, reinforcement learning allows a range of different algorithms for learning these values. A popular algorithm for reinforcement learning is temporal difference (TD) learning (Sutton, 1988), which is related to models of conditioning (Konorski, 1948; Rescorla and Wagner, 1972). This algorithm learns from raw experience by updating predictive associations using a reward value at the time of update.
TD learning is useful, since it requires no information prior to exploration about the probabilities of transitions between states in an environment. In addition, TD learning methods with Hebbian mechanisms (Hancock et al., 1991; Montague et al., 1993; Montague and Sejnowski, 1994; Rao and Sejnowski, 2001) have been proposed for the canonical circuit of neocortex (Douglas et al., 1989; Artola et al., 1990). One approach to TD learning, known as SARSA (state–action reward state–action), is notable for learning the value of actions in transitions between state–action pairs instead of the value of a state in transitions from state to state (Sutton, 1996; Sutton and Barto, 1998, ch. 7.5). The learning method in this paper assumes state–action pairs, as in the SARSA approach, although it is not derived from SARSA or TD learning.
The present model focuses on selection of actions on the basis of action value. It does not require the use of TD learning to create the action value function, because the constrained nature of training ensured that it learned effective action value functions. Further modification will be needed to allow effective learning with random generation of actions during exploration, using a mechanism analogous to TD learning (Hasselmo, 2005). The model nevertheless provides a neural implementation of the action selection process in the reinforcement learning framework that does not depend on lookup tables.
In the model, encoding of behavioral rules requires that PFC contains unique representations of specific states and actions. Fuster (2000) presented evidence that activity in the PFC is representative of two types of perception, one that correlates with the sensory state evoked by past and current stimuli and one related to proprioceptive sensation and prediction of motor actions.
Given the representation of states and actions, the transition from one state to another state via a specific action can be encoded uniquely if there is specific neural activity that occurs only for that action and only when the action is initiated in a particular state. This requirement leads to the presupposition that a functional minicolumn contains populations of input neurons and populations of output neurons that form connections with other minicolumns, and that the neurons in those populations are connected in a structured manner to other minicolumns, in this simulation to exactly one. The internal weight matrices of an action minicolumn, Wig and Wic, act as second-order conditional transition matrices from one state to another. A functionally similar pattern of connectivity could be learned by self-organization. Since the combination of activity at a specific input neuron and a specific output neuron of an action minicolumn represents the transition from a preceding state to a following state, that information gives the model the Markov property (e.g. Sutton and Barto, 1998, ch. 3.5). This property means that one-step dynamics enable us to predict the next state and expected reward for a specific action. Our model therefore provides a means of extending principles of reinforcement learning to biological circuits and the spiking responses of neurons.
Relation to Anatomical Data on Minicolumns
The successive neuronal layers in a canonical circuit of the neocortex, as described by Douglas et al. (1989), can be represented by the individual networks at the branch nodes of a hierarchical network (Felleman and Van Essen, 1991). Categorizing the parts of our model in such a hierarchy, the motor output (by populations ci and co) corresponds to the activity of the infragranular layer of the neocortex. Since sensory input is received in layer IV, its function may correspond to that of neurons designated a. And the supragranular layer has many extensive and long range excitatory connections with other regions so that it can perform the function of our minicolumn model populations gi and go. This function that achieves the convergence of goal spread with current state input depends on the lateral connectivity within the neocortex. In studies of the visual cortex, the lateral connectivity has been associated (Kawato et al., 1993; Dayan and Hinton, 1996) with a necessary role in the interpretation of input and its translation into a complex hierarchical model. The generation of visual receptive fields that are tuned to recognize different orientations (Somers et al., 1995; Yishai et al., 1995) was related to this proposed role.
Lateral connectivity in the prefrontal region of neocortex includes short- and long-range excitatory connections, as well as short-range inhibitory connections (Barbas and Pandya, 1989; Barbas, 2000). The result is a patchy lateral layout of cells that are highly interconnected within a column of cortical layers, the so-called neocortical minicolumn. It has been shown that strong local connectivity in a minicolumn can sustain activity during delayed response tasks such as long-term goal directed behavior for which a subject must be able to maintain information regarding a stimulus (Gutkin et al., 2000; Wood and Grafman, 2003).
Local circuits that may exhibit the function of the proposed minicolumns were identified in the lateral connectivity of the PFC, and Constantinidis and Goldman-Rakic (2002) showed that the activity of interneurons within such ensembles is strongly correlated. The correlated firing does not extend to distant areas or modules, and the activity of spatially proximate excitatory cells is less correlated than that of interneurons. In fact, spiking of different pyramidal cells responsible for the long-range propagation of activity is largely independent. Lund et al. (1993) proposed means by which such local circuits may arise during development. Analogous connectivity was described for the middle temporal visual area (Maunsell and Van Essen, 1983), and a model for similar local circuit development was proposed by Grossberg and Williamson (2001) for visual cortex areas V1 and V2. While our model resembles interaction of feedback and feedforward used in Grossberg and Williamson (2001), the visual models focus on top-down spread mediating global feature detection rather than reward contingencies. Our model more closely resembles the proposal by Mumford (Mumford, 1992) for bottom-up and top-down interactions.
If goal-directed behavior is to emerge in the PFC, its neuroanatomy must support activity that interprets sensory and proprioceptive motor input, and it must enable subsequent output that affects behavior. Previous surveys of the neuronal architecture of neocortex show that dual pathways between cortical areas could implement the necessary pathways for the analysis of input and the synthesis of output that guides behavior (Mumford, 1991, 1992, 1994). In the framework presented here, neuronal populations that correspond to cells in layer IV of neocortex are identified as input neurons for bottom-up cortical processing. Their ability to analyze input is represented by consequent activity of input neurons in a specific minicolumn. The associative connections between minicolumns lead to a synthesis of activity that represents goal-directed output.
While the model is intended to be applicable to the function of prefrontal minicolumns in general and not specific to orbitofrontal cortex, the encoding of reward found in orbitofrontal cortex for the Schultz et al. task led to a minicolumn representation of ‘reward state’. In other (e.g. spatial) tasks where multiple routes can achieve a goal, a specific reward value may be encoded by differential strengthening of associations between reward and specific goal directed strategies.
When a task includes multiple goals or strategies with different reward values, a mechanism must exist to select one goal over another and to direct behavior accordingly. The recruitment of distinct regions of orbitofrontal cortex has been observed during incentive judgements and goal selection. Lateral orbitofrontal activity has been observed selectively when a task required that responses to alternative desirable items must be suppressed (Arana et al., 2003). As implemented in the present model, gating by the spread of activity from one goal would compete with that of another goal at neuronal populations where goal spread and forward spread from current state converge. Successful neuronal firing suppresses the selection of other possibilities through recurrent inhibition.
The CATACOMB simulations described here and information about CATACOMB are available on our Computational Neurophysiology website at http://askja.bu.edu. Supported by NIH R01 grants DA16454, MH60013 and MH61492 to M.E.H. and by Conte Center Grant MH60450, as part of the NSF/NIH Collaborative Research in Computational Neuroscience Program.