Abstract
Working memory (WM) and reinforcement learning (RL) both influence decision-making, but how they interact to affect behaviour remains unclear. We assessed whether RL is influenced by the format of visual stimuli held in WM, either feature-based or unified, object-based representations. In a pre-registered paradigm, participants learned stimulus-action combinations that provided reward through 80% probabilistic feedback. In parallel, participants retained the RL stimulus in WM and were asked to recall this stimulus after each RL choice. Crucially, the format of representation probed in WM was manipulated, with blocks encouraging either separate features or bound objects to be remembered. Incentivising a feature-based WM representation facilitated feature-based learning, shown by an improved choice strategy. This reveals a role of WM in providing sustained internal representations that are harnessed by RL, providing a framework by which these two cognitive processes cooperate.
Similar content being viewed by others
Introduction
Reinforcement learning (RL) is a key process through which biological and artificial agents incrementally learn actions by interacting with the environment1. Credit assignment to the relevant attribute is a key aspect for successful learning for humans and artificial agents alike2. For instance, consider sending an email to a collaborator. Unexpectedly, you get a very quick reply, and you would like to reinforce the action that led to such a quick response. You might infer that it was the time you sent it, or the formatting, or the subject line. Which strategy you end up reinforcing might depend on what is currently on your mind. Humans flexibly assign credit to different features of the world, depending on what they pay attention to. This process has been behaviourally tested using the Wisconsin Card Sorting Test3, where subjects learn to sort multi-featured cards, inferring the relevant features based on reward feedback. A major open question for this process is how the brain determines the set of states on which reinforcement operates. Although the role of attention in learning in multidimensional environments has been well accounted for4, here, we propose and test a new framework in which working memory (WM) plays a central role in providing a representational structure utilised by RL (Fig. 1).
WM is a system that provides a temporary buffer to manipulate small amounts of information at a time5. Previous work suggests that it can be used as a short-term store for stimulus-action pairs6,7. Consequently, both WM and RL may be independently deployed to guide rewarded behaviour8. WM has been shown to be more useful for small numbers of associations where reward is deterministic. As it is limited in capacity, WM is not optimised to perform learning with high numbers of associations, or when the reward is probabilistic. In such instances, RL is deployed6. This points to a task-dependent trade-off between WM and RL9. Alternatively, an account of WM contributing to augmented expectations of reward in the RL system has also been reported10. However, we propose that WM may also contributes to RL by holding structured information about currently relevant distinctions in the environment, which might provide the template that determines which states contextualise RL. Crucially, WM encodes information in a flexible and goal-directed form, which potentially lends great power to RL’s requirement to assign values to relevant features. Such characteristics of WM may facilitate the selection of the relevant representational format. Here, we assessed whether this selection is influenced by the representational format of the stimulus information held in WM. This would indicate that RL and WM collaborate, rather than compete, in assigning value to actions in different states.
An important aspect of the representational format in WM is whether visual features of an object are bound together to form an integrated representation11,12,13, or are maintained separately as flat sets and bound indirectly by a shared location14,15,16,17,18. WM is a system that prioritises task-relevant information, and representational states may be dynamic, holding both feature-based and object-based representations19,20. Furthermore, tasks can be designed to bias subjects towards maintaining either features or bound objects21. As representational formats in WM may either be beneficial or detrimental depending on the context, WM representations must be dynamically aligned to task demands to optimise goal-directed outcomes.
In this pre-registered study, we leveraged the idea that WM representations can be flexibly shifted to investigate whether the representational format of WM could impact learning. We manipulated whether features (colour or shape) independently defined the stimulus-action rules or whether objects (colour-shape combinations) defined the stimulus-action rules.
In other words, the attribute on to which credit should be assigned was either an individual feature or a bound object. We also biased participants towards holding the RL stimulus in WM as separate features in half of the blocks by probing the recall of single features (either a colour or a shape). In the other half, participants were probed on whole-object (coloured shape). This paradigm therefore aimed to causally bias representations towards either features or objects and assess its impact on each learning rule type.
Critically, the object and feature WM probes had exactly matched WM capacity requirements. We hypothesised that if biasing the format of WM representations would alter learning outcomes, it would suggest that the RL process uses representations held in WM. Specifically, we predicted that if representations in WM are used in RL, the representational structure that matched the RL rule (feature representation for feature-based rules, and object representation for object-based rules) would improve learning. Overall, we evaluated whether learning-independent shifts in representations alters learning behaviour.
Proposed framework by which RL utilises representations defined in WM. WM stores representation templates as feature-based or object-based representations. These representations are used in RL. The size of the dark grey box indicates the size of the learned stimulus-action weights. Here, learnt weights for where colour define the rule is represented, where blue is associated to left and red is associated to right. For example, the blue states with right actions are shown to have higher weighting compared to blue states for left actions.
Results
Participants completed 4 blocks of 128 trials each. Each trial consisted of Stage 1—RL and Stage 2—WM (Fig. 2). In Stage 1, a binary choice (left or right) was made in response to a centrally presented coloured shape. In Stage 2, the memory of the stimulus encountered in Stage 1 was probed. Some blocks biased WM representations towards feature representations by probing the memory of only the shape feature or the colour feature. In the other blocks, representations were biased towards object representations by probing the memory of coloured shapes. We evaluated how learning strategy changed with this WM bias (see Methods). We hypothesised that when the representation bias in WM was for features, feature rules in Stage 1—RL would be easier to learn. Alternatively, when the representation bias in WM was for objects, object rules would be easier to learn.
Experimental paradigm designed to assess impact of WM representational states on RL. (A) Stage 1: Participants completed an RL task with stimuli involving colour and shape features. Half of the blocks had feature-based rules whilst the other half had object-based rules. In feature-based rule blocks, participants had to make action decisions (left or right) depending upon either shape or colour. In object-based blocks, the rules were based on conjunctions of features. The RL rule defining the stimulus action mappings changed every 32 trials. Reward was 80% probabilistic. (B) Stage 2: After a 1.8s delay, participants recalled Stage 1 stimuli by selecting one of two items. In half of the blocks, they were probed with a single feature (just shape or just colour, top box), whilst in the other half, participants were probed using objects that contained both feature dimensions (bottom box). The memory demands were therefore equal in these two types of blocks. We quantified the impact of the Stage 2 WM representation on Stage 1 RL choice behaviour.
RL and WM accuracy
We first evaluated RL accuracy depending on rule type (feature or object rule) and WM probe type (feature or object probe). Both rule types were successfully learned over trials, (Fig. 3A). There was higher average RL accuracy for feature rules than object rules (β = 0.33, z = − 8.43, p < .001, odds ratio: 1.39, Fig. 3B) as well as better WM performance for feature memory probes than object memory probes (β = 0.20, z = 4.36, p < .001, odds ratio: 1.22, Fig. 3B). This is expected because object rules require credit assignment to both colour and shape features, whereas in feature rules, only one of the features requires credit assignment. The interaction between the RL rule and the WM probe was not significant (β = − 0.10, z = − 1.51, p = .13, odds ratio = 0.90). Thus, contrary to our predictions, matching the memory probe method to the type of learning rule did not increase RL accuracy. The results for basic accuracy appear to challenge the initial hypothesis that matching the probe of the learning rule would enhance RL accuracy.
RL accuracy based on RL rule and WM probe. (A) Average performance for learning for each rule type for each trial. Each mini block consisted of 32 trials. Reward was 80% probabilistic, yielding a performance of approximately 0.8. The shaded area represents 95% confidence interval. (B) Accuracy of RL stage split by RL rule, and by the type of WM probe in the WM stage revealing distinguishable patterns in RL accuracy based on rule and probe types. Individual line represents individual participants. Error bars are standard error of the mean. For both plots, the dotted line at 0.5 represents chance performance.
Magnitude of reinforcement
To examine whether WM representations are used as a substrate of RL, we investigated whether the effect of a reward on subsequent choice depends on the representational format defined in WM. We assessed the magnitude of reinforcement by quantifying p(stay), the probability of repeating an action (staying) when a reward was given, relative to when no reward was given (see “Methods” for details). This provided a metric to assess whether previous actions were reinforced by the previous reward. As expected, there was a strong effect of how many features changed from the previous trial on the likelihood of staying or switching choice based on the feedback provided in the previous trial (β = − 1.07, z = − 26.66, p < .001, odds ratio = 0.34, Fig. 4 left panel vs. middle panel vs. right panel). We therefore split the trials depending on how many features changed from the previous trial. For trials where two features changed from the previous trial, which is the critical condition that discriminates learning by objects vs. features, we found a significant effect of RL rule on p(stay) (probability of repeating a choice) (β = 0.74, z = 9.50, p < .01, odds ratio = 2.10) (Fig. 4 right panel). We also observed that the WM probe affected p(stay) (β = − 0.274, z = − 2.90, p < .01, odds ratio = 0.76). Crucially, a significant interaction effect between WM probe and RL rule was found (β = 0.36, z = 2.71, p < .01, odds ratio = 1.43). Post-hoc comparisons revealed that when the WM probe was feature based, p(stay) was closer to a win-stay, lose-switch strategy for feature rules (z = 2.91, p = .02, odds ratio = 1.31). In practice this meant that, when the rule was of the type “red = left, blue = right” or “square = left, diamond = right”, participants repeated their choice when the stimuli completely changed after reward versus no reward, but this learning was significantly stronger in blocks when the stimuli was probed as individual features in WM, compared to as a whole object. However, an object-based representation did not enhance the probability of repeating an action according to a win-stay lose-switch strategy for object rules (z = − 0.99, p = − .76, odds ratio = 0.91). The observed improvement in choice strategy for feature RL rules when paired with feature-based WM probes suggests that participants adapt reinforcement strategies more efficiently by aligning relevant feature representations in WM with task-specific RL rules. In other words, on the win-stay lose-shift analysis, there was evidence that the WM probe (which was irrelevant to the learning task) biased reinforcement learning strategy.
Probability of repeating a choice—P(stay), by number of feature changes during RL trials. Left panel: When rewarded on RL trial t and no features change on RL trial t + 1, win-stay, lose switch behaviour is to maintain the previous choice, irrespective of the rule type. Middle panel: When rewarded on RL trial t, and one feature changes on RL trial t + 1, win-stay, lose switch behaviour varies with rule types. For an object rule, the strategy is to always switch, whereas for a feature rule, the decision depends on which feature changed. For instance, if the relevant feature changes, staying might be beneficial, while if the irrelevant feature changes, switching might be more advantageous. Right panel: Win-stay, lose switch behaviour for two feature changes differs based on the RL rule. For object rules, individuals should stay, while for feature rules, individuals should switch. Optimal win-stay, lose switch strategy is represented by dotted lines. Individual lines represent individual participants. Error bars are standard errors of the mean. The text on the right describes the first-order strategy corresponding to the dotted lines.
Exploratory analysis
We explored whether the effects of WM representational formats were different at the start of the block after a rule change compared to later in the block. This was primarily motivated by the idea that WM representations could be more actively recruited at early stages after a rule reversal, where RL stimuli might be more robustly represented in WM. Contrary to our expectations, we found that in the first 16 trials, feature representations did not facilitate feature learning (z = 1.06, p = .72, odds ratio = 1.14) nor did object representation facilitate object rule learning (z = − 1.76, p = .29, odds ratio = 0.78). The reported effects of feature representations facilitating feature learning were driven by learning strategies adopted in the last 16 trials (z = 2.79, p = .027, odds ratio = 1.52). As per the previous main results, no effect was found for the object representations facilitating object rules, even in the last 16 trials (z = 0.40, p = .98, odds ratio = 1.06).
We also assessed whether the type of feature being probed would affect p(stay). Specifically for feature rules where we could assess the congruency of rule and dimension probed, we assessed whether congruency improved p(stay) in regard to a win-stay, lose switch strategy. For example, if the rule was based on colour, and the WM probe on that given trial also tested memory for colour, would learning be stronger? No significant benefit of congruency was observed on p(stay) (β = 0.027, z = 0.048, p = .57, odds ratio = 1.03) based on congruency when filtering out just the feature rules and dividing the analysis into congruent and incongruent trials. Further post-hoc tests did not show a benefit of congruent trials compared to incongruent trials (t(54) = 0.013, p = .99, d = 0.001). This showed that it was not simply the visual interaction of the rule matching the stimuli alone that optimised choice behaviour, but rather the representational state.
Discussion
We proposed that working memory (WM) acts as a flexible store that represents aspects of visual input in a way that is relevant for reinforcement learning (RL). We expected that the format of WM will bias RL strategies across varying rule types (Fig. 1). Our pre-registered experimental paradigm interleaving a WM task within an RL task showed that WM representations influence RL strategies. Crucially, instead of probing WM directly at the time of retrieval, we indirectly examined the use of WM contents in RL, which we argue interrogates the encoding and maintenance period and its use as a substrate for learning. This highlights how internal representations in WM might be harnessed in the RL process. Although all trials in this study included both an RL and WM stage, it is likely that these findings extend to tasks where WM of the RL stimuli is not probed. Such representations may be defined by a prior for what the suitable representation is in that context22.
Whilst we expected that the WM probe type (feature or object) matched to the RL rule (feature or object rule) would enhance RL accuracy, our findings showed no benefit in overall performance accuracy. Accuracy is measured relative to a fictional ‘correct’ choice in a probabilistic task, that is inaccessible to the participant. Therefore, it is an impure measure of RL. We therefore assessed the trial-wise probability of repeating an action after receiving a reward. This metric, p(stay), varied with the type of WM probe. Participants adopted improved reinforcement strategies when relevant feature representations in WM were aligned with feature-specific RL rules. This is consistent with the hypothesis that when WM represented the stimuli as features, feature rules were easier to learn, compared to when WM represented the stimuli as objects. However, this is not the case for the opposite direction: object-based WM representations did not facilitate learning of object-specific RL rules.
One explanation for this asymmetry is that the preferred WM format for RL is features, which the feature WM probe accentuated. It can be speculated that object binding requires more effort, reducing the capacity for WM representations to influence RL. Simple feature-based rule learning likely relies on a more explicit WM process compared to feature-integration learning23. This observation aligns with our findings that feature rules demonstrate higher accuracy than object rules. Feature-based rule learning may be equivalent to a lower WM load for rule learning, making it more susceptible to WM influences. In contrast, object rules may require greater WM capacity, as they necessitate attention to multiple features and are likely driven more by reward prediction errors. This interpretation is consistent with previous RL and WM studies6,9, which indicate that WM plays a more significant role at lower set sizes compared to higher set sizes. Alternatively, our WM bias may not have been strong enough. The cost of holding two features remains lower than the generally accepted limits of WM capacity and therefore the bias towards representations as a bound object may not have been successful. It is important in the future to move beyond lab-based studies using simple, static visual input with minimal ecological validity, and use tasks with richer and more complex environments with higher WM load24. Introducing greater complexity and uncertainty and reducing the precision of momentary internal representations could preferentially favour object representations. Finally, WM is not a unitary system25, and the same could also be said for RL. Thus, it is possible that the dovetailing of these two systems may not be complete, with only certain aspects of WM accessible for RL learning. For example, recent neural models of WM have distinguished sensory representations from “conjunctive” representations that enable this information to be bound into unified objects26. Thus, a possible explanation for why only feature-based encoding influenced RL is because it is easier for the sensory component of WM representations to impact RL. In support of this explanation, an fMRI study found that post-reward activation of sensory regions was influenced by attention, whereby attending to faces increased BOLD signal in the fusiform face area after a reward27. This study also found that these BOLD signal changes in sensory regions were influenced by connectivity with reward-related regions such as the ventral striatum. Cumulatively, this suggests that sensory regions that support memory and RL may be inextricably linked, which is more strongly associated to single feature representations.
Our exploratory analysis revealed that probing one feature dimension of the stimulus, e.g. its colour, did not bias the learning towards that particular feature. Thus, simply attending to a feature did not facilitate RL, which speaks to the effect of representational formats in affecting RL. This may be because this attentional selection occurs after the reinforcement takes place. In contrast, the representational format (i.e. object vs. features) remained constant for a whole block and could therefore bias both encoding and reinforcement. It remains an open question as to whether the representational bias is related only to how the stimulus is encoded into WM, or whether it can be subsequently affected by changes in WM representation, e.g. at the time of reinforcement.
The effect of RL strategy cannot be explained simply in terms of attentional set, as only the format of the information was biassed by the WM probe. Attentional set has traditionally been studied to assess how humans and animals select which dimensions of the stimulus are used (e.g. colour vs. shape) for learning28,29. The Wisconsin Card Sorting Task (WCST) requires individuals to sort cards based on changing rules without explicit instructions and has been used to study set-shifting and cognitive flexibility3. Rule learning in this context requires correct credit assignment on the defined set of features. Similarly, in the Intra-dimensional/Extra-dimensional shift test (ID/ED)30, participants are required to shift attention between different dimensions of a stimulus. Extra dimensional shifts require attention to a previously irrelevant dimension, whereas intra dimensional shifts require attention to the same dimension. Like the WCST, the ID/ED task requires rule learning based on feedback and shifting the state on which you are informing your decisions is critical. Importantly, WM and set-shifting has been reported to act cooperatively in overlapping regions of the prefrontal cortex31. In this study the aspects of the stimulus that are relevant remain unaffected by the WM bias. Furthermore, we showed that congruent trials where the feature dimension probed in the WM stage matched the learning rule did not alter learning strategy compared to incongruent trials. Therefore, the observed effects of biassing WM representations are attributed to a change in representation, rather than selecting or filtering the features, as has been done in the studies investigating attentional set.
Finally, we have framed the effect of WM in terms of influencing RL, which suggests that value updates are applied only to the representations in WM. For example, if items are held as features, then RL updates the state-action values, which represents the expected future rewards for taking an action in a given state of actions, for feature states, i.e. States = {square, circle, red and blue}; whereas if items are held as objects, then RL updates the values of actions for object states, i.e. States = {red square, red circle, blue square, blue circle} (Fig. 1). An alternative possibility is that in all cases, RL updates the values for some or all of these values, in a way that does not depend on WM. Then at the time of decision, the values associated to certain actions given a state are combined and weighted according to the contents of WM. In other words, WM may affect the value integration at the time of decision-making, rather than affecting learning directly. A similar dissociation has been proposed for model-based learning vs. decision-making32. The current study cannot rule out this possibility and future studies are required to directly assess this.
Our results underscore the interplay between WM representations and RL strategies. This study of the interaction of WM and RL provides important insight into how executive control is deployed for WM. Such interaction has been useful in characterising functional decline in schizophrenia33,34. Could there be cases of intact WM and intact RL, but dysfunction in their interaction? Future work assessing decline in WM and RL is likely to shed light on behavioural changes in neurological and psychiatric conditions.
Overall, our results underscore the interplay between WM representations and RL strategies: States for RL depend on the manner in which WM contents are represented. RL is a computational formulism that does not necessarily make mechanistic commitments. We provide initial evidence for an underlying mechanism for which biological RL can be so flexible and context dependent. These findings prompt further exploration into the dynamic nature of WM in decision-making and the utilisation of internal representations defined in WM in other cognitive processes.
Materials and methods
Participants
Data from 55 participants (mean age = 28.76, SD = 4.66 mean years of education = 15.53, SD = 3.40, 26 males: 29 females) was used for this study. The sample size was predetermined and pre-registered to achieve 0.90 power at a significance level of 0.05, to detect an effect size d’ of 0.40. This effect size was based on pilot data using a similar paradigm documented in the preregistration at https://osf.io/k7zjd (Shibata et al., 2023). Participants were recruited on a rolling basis until 55 individuals completed the following three criteria: (1) completed all experiment trials, (2) achieved an average asymptotic RL performance of 60% during the last 10 trials before a rule reversal (i.e. change in rules) and (3) responded to at least 75% of the WM trials correctly. A total of 84 people were initially recruited in the study and received monetary compensation. Thirty-one were excluded from the analysis as they did not meet the RL learning criterion. Three of those also had not attained the WM accuracy cut-off. No participants who performed above the RL performance threshold had a WM accuracy below 75%. All participants were fluent English speakers and reported no history of neurological or psychiatric illness. Informed consent was obtained from all participants through an online questionnaire. The protocol was approved by the ethics committee of the University of Oxford (IRAS ID: 248379. Ethics Approval Reference: 18/SC/0448). All research was performed in accordance with relevant research guidelines.
Experimental tasks
Apparatus
The task was programmed using PsychoPy (V2022.2.4) and hosted on Pavlovia. Participants were recruited on the Prolific platform. Participants completed the task on their personal computer devices (participation through mobile and tablet devices was disabled) and were instructed to position themselves at arm’s length from their screen for the duration of the study.
Procedure
Participants completed a total of 512 trials, distributed across 4 blocks of 128 trials each. Each trial consisted of two distinct stages:
-
Stage 1—RL was a binary choice task which involved making a right or left click depending on the centrally presented coloured shape to receive a reward. Stimulus-response mappings, i.e. the rule defining the rewarded action corresponding to a particular stimulus, changed throughout the experiment.
-
Stage 2—WM was a probe for the memory of the stimulus they had just encountered in Stage 1. A blocked design allowed the biasing of WM representations into feature representations or object representations. We evaluated the impact of this bias on learning behaviour in Stage 1—RL.
Prior to starting the main experiment, participants completed three practice blocks of 10 trials of each stage. The RL rule was provided for the practice blocks. Only individuals who achieved a performance accuracy of over 22/30 (73.33%) in both Stage 1 and Stage 2 trials during the practice were able proceed to the main experiment. This was implemented to verify task compliance and comprehension prior to starting the experiment. Participants were instructed to answer each trial as accurately and fast as possible. All participants were remunerated for their time.
Stage 1 (Fig. 2A)
The first part of the trial was a 2-choice RL task. Participants were presented with a stimulus appearing at the centre of the screen. This stimulus was made up of a combination of two features: colour (red or blue) and shape (square or diamond), making four possible stimuli: red square, red diamond, blue square, blue diamond. Participants were tasked with responding to a centrally presented stimulus with a mouse click on a grey circle either to the left, or to the right of the stimulus to win arbitrary rewards in the form points in a bank. A running total of points won was presented at the top of the screen throughout the task. No time limit was imposed. Visual feedback (‘Win!’ or ‘Lose!’) was instantly presented. A win resulted in an arbitrary reward of + 5 points, whereas no points were awarded for losing.
The rewarded action was determined by an underlying RL rule that changed every 32 trials. As there were 128 trials per block, three rule reversals occurred within a single block. In half of the blocks, RL rules were based on individual features (feature rules), where either the colour or shape of the object was associated with reward. The other half of the blocks had rules based on objects (object rules), where the conjunction of colour and shape was associated with reward. The object-based rules never overlapped with feature-based rules such that object-based rules were always crossed. i.e. blue diamond, red square = left; red diamond, blue square = right OR red diamond, blue square = left; blue diamond, red square = right. Reward was 80% probabilistic, whereby for every 5 correct choices, only 4 were rewarded. In pilot data, we found that this level of uncertainty allowed participants to learn over multiple trials without an accuracy around 70–80%, which is optimal for detecting changes in learning.
Stage 2 (Fig. 2B)
The second part of each trial consisted of a memory recall of the stimuli in Stage 1, after a 1.8s delay. Participants were asked to select which of the two stimuli, presented above and below the screen centre, corresponded to that observed in Stage 1. No time limit was imposed. In half of the blocks, participants were probed on one of the two features (either by colourless shapes or shapeless colours). Whether the shape or colour was probed was interleaved within a single block so that participants had to encode both features into working memory. In the other half of the blocks, participants were probed on objects with two features: coloured shape. The lure object could be different from the stimulus in either one or both features. Deterministic feedback and an arbitrary reward of + 1 was provided after every choice. No reward was given for an incorrect response.
Quantifying RL
We investigated reinforcement strategies in Stage 1 by examining the first-order strategy: the tendency to repeat an action, as a function of what was seen on the previous trial and whether the participant was rewarded. These simple heuristics allow us to quantify the type of representation subjects use for RL. We assessed whether the previous action was reinforced by the previous reward, i.e. the tendency to apply a “win-stay” or “lose-switch” strategy. To quantify this staying or switching behaviour, we calculated p(Stay), the probability of repeating an action (staying) when a reward was given, relative to when no reward was given as the following equation:
where t = trial number. A positive P(Stay) value indicates a higher likelihood of repeating an action, while a negative P(Stay) value suggests a greater probability of switching action.
Different choice strategies exist depending on the number of feature changes from one RL trial to the next. These strategies, illustrated in Fig. 4, inform the likelihood of repeating an action (p(stay)) under varying conditions.
-
When successive trials are identical (no feature changes), there should be a strong positive reinforcement to repeat an action for both feature and object rules.
-
If one feature from the previous trial is repeated, this leads to a neutral impact on action repetition for feature rules, but for object rules, it should prompt a switch in action.
-
In cases where both features change from the previous trial, for feature rules the win-stay, lose switch strategy is to switch the action, but for object rules, the strategy would be to stay on the previously taken action.
We specifically assess choice behaviour on trials with changes in both features as this trial type exhibits the most pronounced differences in p(stay) behaviour between the two rule types, which is a key point of interest in our analysis.
Data pre-processing
All trials with a RT exceeding 3SD of the mean were considered outliers and removed from the analysis. Additionally, incorrect trials in the WM stage were removed as this indicated that the RL stimulus was not attended to (3.3% of otherwise retained trials). The steps of trial and participant exclusions are reported in the pre-registration. We additionally removed trials with both feature and shape probe dimensions, as these trials contain more information in the WM stage and do not act as a valid comparison.
Statistics
Data was analysed using R (Rstudio 2022.12.0 + 353) and Matlab R2020b. A logistic mixed-effects regression model was fitted using the ‘glmer’ function from the ‘lme4’ package in R. The model aimed to evaluate the impact of RL Rule (feature or object) and WM Probe (feature or object) on binary RL choice outcomes (correct/incorrect). Specifically, the model investigated the individual contributions of RL Rule and WM Probe as well as their potential interaction effect on performance. Random effects were introduced for individual subjects to accommodate variability between conditions within each subject, accounting for within-subject correlations in the data. The model equation was formulated as follows: Accuracy (correct/incorrect) ~ 1 + RL Rule * WM Probe + (1 | Subject). We also ran an additional mixed-effects model on p(stay) when two features changed from the previous trial: P(Stay) (stay/switch) ~ 1 + RL Rule * WM Probe + (1 | Subject). We used the Tukey method for post-hoc comparisons. For between-subject exploratory analysis, Pearson correlations were used to test for correlation between variables. An alpha of 0.05 was used to report statistical significance and Greenhouse-Geisser correction was applied to degrees of freedom to correct for non-sphericity where appropriate.
Data availability
The code and datasets analysed in the current study are available from the corresponding author on reasonable request.
References
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (The MIT Press, 2018).
Niv, Y. Learning task-state representations. Nat. Neurosci. 22, 1544–1553 (2019).
Grant, D. A. & Berg, E. A behavioral analysis of degree of reinforcement and ease of shifting to new responses in a Weigl-type card-sorting problem. J. Exp. Psychol. 38, 404–411 (1948).
Niv, Y. et al. Reinforcement learning in Multidimensional environments relies on attention mechanisms. J. Neurosci. 35, 8145–8157 (2015).
Cowan, N. The magical number 4 in short-term memory: a reconsideration of mental storage capacity. Behav. Brain Sci. 24, 87–114 (2001).
Collins, A. G. E. The Tortoise and the Hare: interactions between reinforcement learning and Working Memory. J. Cogn. Neurosci. 30, 1422–1432 (2018).
Collins, A. G. E. & Frank, M. J. How much of reinforcement learning is working memory, not reinforcement learning? A behavioral, computational, and neurogenetic analysis. Eur. J. Neurosci. 35, 1024–1035 (2012).
Yoo, A. H. & Collins, A. G. E. How working memory and reinforcement learning are intertwined: a cognitive, neural, and computational perspective. J. Cogn. Neurosci. 34, 551–568 (2022).
Rac-Lubashevsky, R., Cremer, A., Collins, A. G. E., Frank, M. J. & Schwabe, L. Neural index of reinforcement learning predicts improved stimulus–response Retention under High Working Memory load. J. Neurosci. 43, 3131–3143 (2023).
Collins, A. G. E. & Frank, M. J. Within- and across-trial dynamics of human EEG reveal cooperative interplay between reinforcement learning and working memory. Proc. Natl. Acad. Sci. 115, 2502–2507 (2018).
Luck, S. J. & Vogel, E. K. The capacity of visual working memory for features and conjunctions. Nature. 390, 279–281 (1997).
Vogel, E. K., Woodman, G. F. & Luck, S. J. Storage of features, conjunctions, and objects in visual working memory. J. Exp. Psychol. Hum. Percept. Perform. 27, 92–114 (2001).
Brady, T. F., Konkle, T. & Alvarez, G. A. A review of visual memory capacity: beyond individual items and toward structured representations. J. Vis. 11, 4–4 (2011).
Bays, P. M., Wu, E. Y. & Husain, M. Storage and binding of object features in visual working memory. Neuropsychologia. 49, 1622–1631 (2011).
Fougnie, D. & Alvarez, G. A. Object features fail independently in visual working memory: evidence for a probabilistic feature-store model. J. Vis. 11, 3–3 (2011).
Fougnie, D., Cormiea, S. M. & Alvarez, G. A. Object-based benefits without object-based representations. J. Exp. Psychol. Gen. 142, 621–626 (2013).
Wheeler, M. E. & Treisman, A. M. Binding in short-term visual memory. J. Exp. Psychol. Gen. 131, 48–64 (2002).
Schneegans, S. & Bays, P. M. Neural Architecture for feature binding in visual Working Memory. J. Neurosci. 37, 3913–3925 (2017).
Vergauwe, E. & Cowan, N. Working memory units are all in your head: factors that influence whether features or objects are the favored units. J. Exp. Psychol. Learn. Mem. Cogn. 41, 1404–1416 (2015).
Geigerman, S., Verhaeghen, P. & Cerella, J. To bind or not to bind, that’s the wrong question: features and objects coexist in visual short-term memory. Acta Psychol. (Amst). 167, 45–51 (2016).
Cao, R. & Deouell, L. Y. Binding in Visual Working Memory Is Task-Dependent. https://doi.org/10.1101/2023.11.01.565116 (2023).
Bays, P. M., Schneegans, S., Ma, W. J. & Brady, T. F. Representation and computation in visual working memory. Nat. Hum. Behav. 8, 1016–1034 (2024).
Ashby, F. G. & Maddox, W. T. Human category learning. Annu. Rev. Psychol. 56, 149–178 (2005).
Draschkow, D., Kallmayer, M. & Nobre, A. C. When Natural Behavior engages Working Memory. Curr. Biol. 31, 869–874e5 (2021).
Baddeley, A. Working memory theories, models, and controversies. Annu. Rev. Psychol. 63, 1–29 (2012).
Manohar, S. G., Zokaei, N., Fallon, S. J., Vogels, T. P. & Husain, M. Neural mechanisms of attending to items in working memory. Neurosci. Biobehav. Rev. 101, 1–12 (2019).
Schiffer, A. M., Muller, T., Yeung, N. & Waszak, F. Reward activates stimulus-specific and Task-Dependent Representations in Visual Association Cortices. J. Neurosci. 34, 15610–15620 (2014).
Owen, A. M., Roberts, A. C., Hodges, J. R. & Robbins, T. W. Contrasting mechanisms of impaired attentional set-shifting in patients with frontal lobe damage or Parkinson’s disease. Brain. 116, 1159–1175 (1993).
Barceló, F., Muñoz-Céspedes, J. M., Pozo, M. A. & Rubia, F. J. Attentional set shifting modulates the target P3b response in the Wisconsin card sorting test. Neuropsychologia. 38, 1342–1355 (2000).
Slamecka, N. J. A methodological analysis of shift paradigms in human discrimination learning. Psychol. Bull. 69, 423–438 (1968).
Konishi, S. Contribution of working memory to transient activation in human inferior Prefrontal cortex during performance of the Wisconsin Card sorting test. Cereb. Cortex. 9, 745–753 (1999).
Doody, M., Van Swieten, M. M. H. & Manohar, S. G. Model-based learning retrospectively updates model-free values. Sci. Rep. 12, 2358 (2022).
Collins, A. G. E., Brown, J. K., Gold, J. M., Waltz, J. A. & Frank, M. J. Working memory contributions to reinforcement learning impairments in Schizophrenia. J. Neurosci. 34, 13747–13756 (2014).
Collins, A. G. E., Albrecht, M. A., Waltz, J. A., Gold, J. M. & Frank, M. J. Interactions among working memory, reinforcement learning, and effort in Value-based choice: a new paradigm and selective deficits in Schizophrenia. Biol. Psychiatry. 82, 431–439 (2017).
Funding
This work was funded by the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC) grant and the MRC clinician scientist fellowship [MR/P00878/X] to S.G.M; a Wellcome Trust Principal Research Fellowship to M.H; the Berrow Foundation to K.S; and the ESRC ES/P000649/1 and New College 1379 Old Members Scholarship to V.K.
Author information
Authors and Affiliations
Contributions
K.S. and S.G.M. designed, interpreted and pre-registered the study. K.S. acquired data and wrote the initial draft of the manuscript. V.K. created the online version of the paradigm. K.S. and V.K. analysed the data with the help of S.G.M. S.G.M. and S.J.F. were involved in the initial conceptualisation of the idea. S.G.M., M.H. and S.J.F. provided critical feedback on the manuscript. All authors contributed to the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Shibata, K., Klar, V., Fallon, S.J. et al. Working memory as a representational template for reinforcement learning. Sci Rep 14, 27660 (2024). https://doi.org/10.1038/s41598-024-79119-2
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-024-79119-2






