Post-learning replay of hippocampal-striatal activity is biased by reward-prediction signals

Roscow, Emma L.; Howe, Timothy; Lepora, Nathan F.; Jones, Matthew W.

doi:10.1038/s41467-025-65354-2

Download PDF

Article
Open access
Published: 24 November 2025

Post-learning replay of hippocampal-striatal activity is biased by reward-prediction signals

Nature Communications volume 16, Article number: 10394 (2025) Cite this article

6785 Accesses
1 Citations
16 Altmetric
Metrics details

Subjects

Abstract

Neural activity encoding recent experiences is replayed during sleep and rest to promote consolidation of memories. However, precisely which features of experience influence replay prioritisation to optimise adaptive behaviour remains unclear. Here, we trained adult male rats on a novel maze-based reinforcement learning task designed to dissociate reward outcomes from reward-prediction errors. Four variations of a reinforcement learning model were fitted to the rats’ behaviour over multiple days. Behaviour was best predicted by a model incorporating replay biased by reward-prediction error, compared to the same model with no replay, random replay or reward-biased replay. Neural population recordings from the hippocampus and ventral striatum of rats trained on the task evidenced preferential reactivation of reward-prediction and reward-prediction error signals during post-task rest. These insights disentangle the influences of salience on replay, suggesting that reinforcement learning is tuned by post-learning replay biased by reward-prediction error, not by reward per se. This work therefore provides a behavioural and theoretical toolkit with which to measure and interpret the neural mechanisms linking replay and reinforcement learning.

Predictive coding of reward in the hippocampus

Article 14 January 2026

Trait anxiety is associated with reduced reward-related replay at rest

Article Open access 26 August 2025

The behavioral signature of stepwise learning strategy in male rats and its neural correlate in the basal forebrain

Article Open access 21 July 2023

Introduction

Good decisions typically rely on past experience to guide future behaviour. Actions which have previously produced beneficial outcomes in a similar context can be reinforced to adapt behaviour for maximising benefit. The ability for brain activity to drive synaptic plasticity, establishing functional networks encoding and implementing task-relevant information and actions, is central to this learning. These functional networks are refined during sleep and rest, when many neurons switch to a so-called offline state in which they replay activity encoding previous or anticipated experiences rather than current events or behaviours^1,2,3,4. This offline replay, found across cortical, limbic and basal ganglia regions, has been suggested to play roles in decision-making⁵, emotional processing⁶, generalising across episodes⁷ and reinforcement learning⁸.

Studies in which replay has been manipulated provide strong evidence for its contributions to memory consolidation. For example, artificially enhancing replay by presenting odours or sounds during sleep, which had previously been paired with object locations or visual stimuli, leads to better subsequent recall of the paired stimuli^9,10,11,12. Disrupting replay events, meanwhile, impairs subsequent spatial memory^13,14,15,16.

An examination of how replay aids these cognitive processes requires assessment of which activity is replayed with greatest strength or frequency. Activity which is associated with experiences of reward^17,18,19,20 or fear^21,22, or with recent, repeated and/or novel experiences^23,24, is replayed preferentially. This suggests a replay bias towards the most salient experiences to be processed, consolidated or incorporated into an internal model of the world. However, these salient experiences could also be interpreted as those with the highest prediction error, i.e. the most unexpected and therefore informative experiences for updating internal models and for reinforcement learning. Tasks which involve learning the locations of rewards often conflate reward with reward-prediction error (RPE), leaving open the possibility that apparent replay biases towards reward actually reflect biases towards RPE.

Here we combine behaviour, reinforcement learning and electrophysiology to explore the hypothesis that RPE, rather than solely reward or salience, bias replay. We used variations of a reinforcement learning model, Q-learning, to estimate the value of actions encoded in the striatum during a reinforcement learning task, and varied the amount and type of replay in the model to predict behaviour. Reinforcement learning relies on inputs from hippocampus to ventral striatum^25,26,27,28, where representations of reward values differ following learning acquired over weeks compared to when acquired over minutes²⁹ and, correspondingly, reward-responsive cells are replayed preferentially in the ventral striatum¹⁸. We therefore propose that replay triggers value updates in the striatum, to enhance striatum-dependent reinforcement learning and moreover that activity encoding events that resulted in high RPE is preferentially replayed. To corroborate this, we also recorded single-unit activity simultaneously from the hippocampus and ventral striatum during learning of the same task, revealing signatures of inter-area reward prediction signals and intra-area reward-prediction-error signals being preferentially reactivated during post-task rest.

Q-learning³⁰ has been used successfully to model reinforcement learning, particularly in humans^31,32 but also in rodents^33,34,35. Q-learning models fit both behavioural outcomes and striatal activity, suggesting that they describe mechanisms of updating values in the striatum in response to RPEs which in turn guide behaviour^36,37,38,39. Temporal-difference-based RPEs, i.e. the difference between expected reward and actual reward which drives the update of Q values, closely resemble the dopaminergic input of ventral tegmental area (VTA) to the striatum^39,40,41, which modulates synaptic plasticity in the striatum⁴² and may provide a mechanism for the biological equivalent of Q-learning. Dyna-Q⁴³, a variant of Q-learning which incorporates offline temporal-difference updates, has been used to model replay in ways which produce learning qualitatively similar to animal reinforcement learning⁴⁴. RPE-biased replay has also been incorporated into machine learning algorithms and shown to enable much more efficient reinforcement learning, including for Atari games⁴⁵ and navigating a simulated environment⁴⁶ faster and with more success compared to replay without such a bias⁴⁷. These algorithms demonstrate the utility of prioritising replay by RPE, and provide a theoretical foundation for investigating RPE-biased replay in the hippocampal-striatal circuit.

We trained 6 rats on a stochastic reinforcement learning task which elicited both positive and negative RPE, and fitted Q-learning parameters to each rat’s behavioural data. We then included replay events between sessions, to simulate the effect of replay during sleep on reinforcement learning. Four replay policies were compared, prioritising state-action pairs to be updated according to different biases: random replay, replay proportional to expected reward, and two forms of RPE-biased replay. Random replay was included as a control, while reward-biased replay reflects the prevailing view of how replay is prioritised. Fitting the model parameters showed that the two RPE-biased replay policies increased the model’s predictive accuracy, while random and reward-biased replay did not. A separate cohort of 3 rats was trained on the same task while recordings were made in dorsal CA1 and ventral striatum. Pairs of CA1 and striatal neurons were reactivated within and between these regions during sharp-wave ripples in the post-task consolidation period. The most strongly reactivated cell pairs showed preferential firing during the approach towards a reward location with a high anticipated probability of reward, indicating replay of reward-prediction signals, not pure reward signals. Within the striatum, the most strongly reactivated pairs of striatal cells showed preferential firing following a less-expected reward, indicating replay of reward-prediction-error signals. This suggests that replay between sessions of a probabilistic reinforcement learning task in rats is biased by RPE and not solely by reward.

Results

Rats successfully learned a stochastic reinforcement learning task

Six rats were trained to forage for stochastic sucrose rewards on a three-armed maze, to assess their reinforcement learning on a task where reward outcome and RPE were dissociable. Each arm was assigned as either high probability, mid probability or low probability, which determined the protocol for reward delivery (Fig. 1a). This was designed so that, once rats gained enough experience of the task to correctly anticipate the reward probabilities, receipt of reward would elicit a low RPE, medium RPE and high RPE on each arm, respectively. For the first 15 daily training sessions, the high-probability arm delivered a reward on 75% of legitimate arm entries, the mid-probability arm on 50%, and the low-probability arm on 25%. A legitimate entry was one in which a different arm had been entered on the previous trial; entering the same arm twice in a row was illegitimate and did not result in a reward delivery. For sessions 16–20, the difference in reward probabilities for the high- and low-probability arms was amplified: reward was delivered on 87.5% and 12.5% legitimate entries, respectively. For sessions 21–22 the reward probabilities for the high- and low-probability arms were switched, such that the (formerly) high- and low-probability arms delivered reward on 12.5% and 87.5% of legitimate entries respectively. This set-up meant that receiving a reward in a low-probability arm would elicit a higher RPE than the same reward value in a high-probability arm, so reward outcome and RPE could be dissociated.

**Fig. 1: Behavioural performance on the task.**

Over 22 sessions, animals learned to distinguish between the high-, mid- and low-probability arms in their frequency of visits to each arm, indicating successful learning of the reward probabilities. Rats performed 45.1 ± 2.5 trials per session, eventually showing a significant preference for the high-probability arm and against the low-probability arm, evident by session 6 and stable by session 11. The six animals varied in the degree of their discrimination between the arms (Fig. 1b), but on average they distinguished between all arms on 14 out of 22 sessions (Fig. 1c; χ² tests, uncorrected), visiting the arms which delivered a higher probability of reward more often, particularly in later sessions. To minimise the possible confound of the maze orientation in the room, the arm probabilities were rotated between animals (for example, animals may have shown a confounding preference for the arm which was closest to the door of the recording room).

To quantify performance on the task, each trial was coded as optimal or suboptimal according to the animal’s choice given the arm most recently visited. Because no reward was given for re-entering the same arm consecutively, the optimal action choice following a visit to the mid- or low-probability arm was to visit the high-probability arm; the optimal action following the high-probability arm was the mid-probability arm. Over sessions, animals increased the proportion of trials on which they behaved optimally, achieving performance significantly above chance level of 33% from session 3 onwards (one-sided binomial tests, Bonferroni-corrected). Using a more conservative chance level of 50%, to account for rats’ natural tendency to alternate rather than repeat arms, they performed significantly above chance on 8 out of 22 sessions (Fig. 1d).

Reward probabilities were changed twice over the course of learning, triggering clear changes in behaviour. In the revaluation learning stage (sessions 16–20), the reward probabilities at each arm became more distinct: the high-probability arm delivering an 87.5% probability of reward compared to 75% in the initial learning stage, and the low-probability arm delivering a 12.5% probability of reward compared to 25% in the initial learning stage. This change offered a higher incentive-to-cost ratio and, correspondingly, preference for the high-probability arm over the low-probability arm increased compared to the previous five sessions (Fig. 1c; repeated-measures ANOVA, F(1) = 9.37, p = 0.005). As a result, the rate of optimal performance was also greater in the revaluation stage than during the last five sessions of the initial learning stage (Fig. 1d; repeated-measures ANOVA, F(1) = 13.2, p = 0.001).

The definition of optimal behaviour was the same in the initial and revaluation learning stages, because the arms did not change. However, optimal behaviour required a different behavioural policy in the reversal learning stage (sessions 21–22) when the high- and low-probability arms were switched. As expected, optimal performance correspondingly dipped when reward probabilities were reversed in sessions 21–22 as this new behavioural policy was learned: the frequency of optimal arm choices during the reversal learning stage fell to roughly the 50% chance level. These behavioural data demonstrate that reward probabilities successfully influenced learning and behaviour in the task, and that animals were capable of showing flexibility in response to changing reward. We therefore went on to test whether reinforcement learning algorithms were able to recapitulate rat behaviour and whether instantiating between-session (offline) replay of different task features improved model performance.

Q-learning modelled animal behaviour

We trained a Q-learning algorithm with no replay to generate probabilities of each action for each trial, based on Q values estimated from the animals’ previous experience (Fig. 2). Q-learning is a reinforcement learning algorithm in which an agent selects actions in its environment and observes the outcome, recording at each time step t its starting state s_t, selected action a_t, resulting reward r_t and resulting state s_t+1. The agent builds up a matrix Q of Q value estimates for every state-action pair:

$$\left[\begin{array}{cccc}{Q}_{{s}_{1},{a}_{1}}&{Q}_{{s}_{1},{a}_{2}}&\cdots &{Q}_{{s}_{1},{a}_{A}}\\ {Q}_{{s}_{2},{a}_{1}}&{Q}_{{s}_{2},{a}_{2}}&\cdots &{Q}_{{s}_{2},{a}_{A}}\\ \vdots &\vdots &\ddots &\vdots \\ {Q}_{{s}_{S},{a}_{1}}&{Q}_{{s}_{S},{a}_{2}}&\cdots &{Q}_{{s}_{S},{a}_{A}}\end{array}\right]$$

(1)

corresponding to the future discounted expected reward, i.e. the temporal difference between the current state and the reward state. These Q value estimates are used to guide actions to maximise reward. At each time step t, the Q value for the state-action pair observed is updated by:

$$Q({s}_{t},{a}_{t})\leftarrow (1-\alpha )\cdot Q \, ({s}_{t},{a}_{t})+\alpha \cdot ({r}_{t}+\gamma \cdot \max Q \, ({s}_{t+1},a))$$

(2)

where α ∈ (0, 1) is a learning rate parameter which determines the degree to which new information overrides old information, and γ ∈ (0, 1) is a discount parameter which determines the importance of long-term gains.

**Fig. 2: Example of model prediction for one trial, t = 100, in which rat H had most recently visited the high-probability arm (s = high) and chose the mid-probability arm (a = mid).**

In this task, entries into a chosen arm (and arrival at the goal location at the end of the arm) were modelled as actions, while the arm entered on the previous trial, on which reward probabilities were contingent, were modelled as states. Each trial, therefore, gave rise to one state-action transition out of nine possible state-action pairs.

For each trial, a matrix of Q values for all state-action pairs was updated based on experience and used to calculate predicted action probabilities, which were compared to the observed frequencies of state-action pairs to produce a vector of errors for the three available actions. An error score was calculated from the summed square of the error vector, weighted by the prevalence of the state. This produced a measure of how reliably the Q value estimates predicted behaviour (Fig. 2; see ‘Methods’).

Observed action frequency correlated well with predicted action probabilities (Fig. 3a), indicating a good baseline model for reinforcement learning. Predicted action probabilities were binned in 100 percentile-bins for each animal, and for each bin the average frequency of these actions occurring was compared to the average predicted probability, resulting in a strong correlation (R² = 0.87, p < 0.0001, linear mixed-effects model). While individual rats alternated between arms on 94–96% of trials, the Q-learning agents fitted to each rat’s behaviour alternated between arms on 92–95% of trials.

**Fig. 3: Goodness of fit of the optimised Q-learning parameters, with no replay.**

The error between predicted action probability and observed action frequency spanned a large range, which was greatest in the earlier training sessions and diminished towards 0 for later training sessions as Q values were learned (Fig. 3b; early trials in blue have larger errors).

Error scores spanned a different range for each rat (Fig. 3c), so all further analysis was performed on error scores normalised by the mean for each animal. On this measure, normalised error was similarly highest in early training sessions, when behaviour is least optimal and most unpredictable. Following this, error became consistently low for most sessions (Fig. 3d), confirming a consistent fit with behaviour which captured the learning process over multiple sessions and changes in reward probabilities.

As described in Methods, the error score was used as the cost function to optimise three parameters in the Q-learning algorithm for each animal: a learning rate α, a discount factor γ, and an exploration factor ϵ. The resulting optimised parameter values are shown in Table 1. A perturbation analysis was performed to verify that the Q-learning results were sufficiently insensitive to perturbations to the optimised parameter values. At the optimised values, the average normalised error over all trials was, by definition, 1. Perturbing these values by up to 50% in either direction increased the normalised error by less than 0.5 in most cases (Fig. 3e), indicating that error score was not overly sensitive to small changes in parameter values. This confirms that the optimised models converged to a stable minimum that robustly captures rats’ behaviour.

Table 1 Optimised parameter values for Q-learning algorithm trained on each animal’s behavioural data

Full size table

The model makes a simplifying assumption of stationary parameters throughout learning, which may deviate from biological reality⁴⁸ but prioritises interpretability of the fitted parameter values and prevents overfitting to an overly complex model.

In summary, the Q-learning algorithm proved able to recapitulate rat behaviour over the course of training and adaptation to new task conditions. The model was robust across a range of parameter values and established a sound basis on which to quantify the effects of simulating replay by updating Q values between sessions.

Adding RPE-biased replay to the Q-learning model improved prediction accuracy over reward-biased and random replay

Against the baseline of no-replay, a variant of the Q-learning algorithm with replay was trained on the same data, with a specified number of samples chosen from all the trials experienced so far to be replayed between each session. Q-learning parameters were optimised for a fixed (1 ≤ n ≤ 100) number of replay events between each session, for each replay policy. All trials experienced by the animal were stored in a memory buffer, and for each replay event a state-action pair was chosen according to the replay policy and a sample trial from this state-action pair was used to update its Q value (Fig. 4). The policies were defined as follows:

With a random replay policy, all state-action pairs that had been experienced were sampled at random.
With a reward-biased replay policy, state-action pairs were sampled in proportion to their Q values, so that state-action pairs at which rewards had been experienced most frequently would be replayed most.
With an RPE-prioritised replay policy, the state-action pair with the highest recent average RPE was sampled.
With an RPE-proportional replay policy, state-action pairs were sampled in proportion to their recent average RPE.

**Fig. 4: An example of 10 trials and how they are prioritised for replay according to the four replay policies.**

The latter two policies offered two variations on preferentially updating state-action value(s) which had generated the greatest errors, concentrating efforts on correcting the most inaccurate expectations of reward (Fig. 4).

Compared to the no-replay Q-learning baseline, only replay which prioritised the highest-RPE state-action pair produced a more reliable model of learning (Fig. 5a; purple; linear mixed-effects model, two-sided), which was statistically significant even with one sample replayed between sessions. RPE-proportional replay produced a model which was numerically better but did not reach statistical significance (Fig. 5a; orange), while replay that was random or biased by reward did not produce a more reliable model (Fig. 5a; blue and green). Replay of information encoded during trials associated with the most unexpected outcomes therefore significantly improved learning in the model, whereas replay of rewarded trials did not. This was true for all subjects: for 4 out of 6 rats the RPE-prioritised replay policy gave the lowest error, and for 2 out of 6 rats the RPE-proportional policy gave the lowest error (at 100 samples replayed for each policy).

**Fig. 5: Prediction accuracy of Q-learning model with four alternative replay policies.**

The superiority of the RPE-prioritised replay policy was not uniform over the whole training period, however. With 100 replayed samples, all replay policies showed some modest improvement over no-replay in early sessions (Fig. 5c), but this effect disappeared in the random and reward-biased policies after roughly the seventh session. Conversely, the superiority of RPE-prioritised replay persisted over the whole course of learning. In the no-replay baseline, error scores increased in sessions 17–20. This reproduces an increase in optimal behaviour in these sessions during the revaluation stage and reversal stage respectively, suggesting that the model failed to capture subtleties in the learning pattern at these points when animals were adapting their behaviour to changes in reward probabilities. As animals re-evaluated the state-action pairs in sessions 17–20 and adjusted their behaviour accordingly, replay by any policy was sufficient to overcome the increase in error scores seen in the baseline, so there was no increase at these sessions (Fig. 5c). This may reflect the faster learning enabled by replaying recently experienced trials. However, as animals reversed their behaviour in session 22, requiring a substantial update to Q values and a dramatic change in behaviour, increased random replay or reward-biased replay did not improve error scores. Figure S1 shows an example of how Q values were updated more rapidly with RPE-prioritised replay than random or reward-biased.

RPE-biased replay did not improve predictions when trained on shuffled data

Given the indication that replay might play different roles in different learning stages, it is important to control for the possibility that parameter values were optimised for the general statistics of rewards and actions in the task, rather than truly modelling the learning curve. Otherwise, the apparent superiority of RPE-biased replay may result from anomalous irregularities in the learning patterns and not true cognitive processes. Therefore, the same algorithms were trained on shuffled behavioural data in which the order of trials was randomly permuted 1000-fold. This preserved the average frequency of state-action pairs and their associated rewards, as well as the lengths of training sessions, but altered the learning curve including revaluation and reversal learning.

Overall, the errors for Q-learning with no replay were lower for shuffled data than real data, because shuffled behaviour was necessarily more consistent over time and therefore more predictable. Similarly to real data, error decreased sharply in early training sessions before reaching an asymptotic level (Fig. 6), because Q values in early training sessions were distorted by unrepresentative rewards as a result of a small sample size of trials experienced. Unlike real data, the approach to asymptotic error was smooth and nearly monotonic.

**Fig. 6: Prediction accuracy of Q-learning model with five alternative replay policies on shuffled data.**

Crucially, compared to the no-replay baseline, none of the replay policies improved error scores. This confirms that the improvement in error in the real data is a result of better predictions of the learning process, and not better convergence to general statistics in the task.

Replay-biased RPE was the best predictor for all state-action pairs

We next accounted for the skew in training data towards the state-action pairs that were chosen most frequently. The transition from the high-probability arm to the mid-probability arm and vice versa (as they were in the initial and revaluation learning stages) were the most commonly experienced state-action pairs, representing 42% of trials overall, and the error was weighted by the frequency of each state such that errors in the more common states contributed more to the overall error than errors in the less common states. We therefore confirmed that Q-learning with RPE-biased replay learned to correctly predict all actions and not just the more-frequently chosen actions to which the cost function was skewed.

Figure S3 shows the improvement in error scores for each replay policy over no-replay baseline, for each state-action pair separately. Despite the skew in training data, the RPE-biased replay policies outperformed random and reward-biased replay policies for every state-action pair, although the improvement was not identical in each case. Nevertheless, the broad conclusion can be reached that RPE-biased replay policies better predicted learning than either no-replay, random replay or reward-biased replay for all state-action pairs.

A subpopulation of ventral striatal units encodes reward information

RPE signals have been hypothesised to be generated by the hippocampus—striatal—VTA dopaminergic circuit, in which states are encoded by the hippocampus, reward predictions are generated in the ventral striatum, and RPE signals are computed by the VTA and broadcast back to the hippocampus and neocortex, potentiating synapses and offering a mechanism by which RPE might influence plasticity and learning^49,50,51,52. The results of the modelling suggest that replay between sessions is influenced by such RPE signals, and should be observable in the single-unit activity in this circuit during post-task rest.

To test this, a separate cohort of three rats was trained on the same task for 17–20 sessions each, and implanted with silicon probes in both dorsal CA1 and ventral striatum enabling recording of extracellular unit activity during learning and for pre- and post-task rest periods. Rats underwent 12–15 sessions of an initial learning stage with reward probabilities of 87.5%, 50% and 12.5% on high-, medium- and low-probability arms respectively, followed by 5 sessions of reversal learning stage in which the reward probability of the high and low arms was swapped. Rats reached a greater-than-chance rate of optimal arm selection by day 5. A total of 617 CA1 units and 1406 striatal units were recorded, after excluding those with low isolation distance, and those from sessions where video tracking data of the animal’s movement was unsuccessful.

Cells in the ventral striatum have previously been reported to encode many elements of behaviour, including upcoming action choice, predicted action outcome, current action, reward and RPE⁵³. To compare with previous studies, striatal cells were divided into reward-modulated and non-reward-modulated by combining all trials in a given session and assessing whether firing rate varied significantly in 250 ms bins from the period −1 to +1 s around arrival at the reward location, compared to control time bins. A subset of striatal units, 232 of 1406 (17%) of the total, or 12.7–29.8% per rat, were categorised as reward-modulated according to this metric, similar to values reported previously (e.g. ref. ⁵⁴).

Trials typically consisted of two self-initiated runs separated by an imposed 5-s delay period: first towards the central platform, and second from the central platform to the reward location (Fig. 7a). Population activity in both CA1 (Fig. 7b) and ventral striatum (Fig. 7c) increased on approach to the reward location more markedly than on the approach to the central platform, indicating that activity in both areas was modulated by anticipation or prediction of immediate reward, not simply reflecting running behaviour. This is consistent with previous findings of ramping increases in ventral striatal firing rate on the approach to expected reward⁵⁵.

**Fig. 7: Electrophysiological data from CA1 and ventral striatum (vStr).**

Significant reactivation of intra-region and inter-region unit pairs in post-task rest

Previous studies have found significant reactivation of correlated activity in spatial tasks during post-task rest, both within the ventral striatum and between hippocampus and ventral striatum^25,54,56,57. To confirm whether there was significant reactivation during post-task rest in these experiments, correlations between cell pairs were assessed during the TASK, PRE-task sharp-wave ripple periods and POST-task sharp-wave ripple periods to calculate the percentage of variance in POST correlations that could be explained by RUN correlations, controlling for PRE correlations. This approach was based on the explained variance (EV) metric also used by ref. ⁵⁴ for hippocampal-striatal cell pairs, and²¹ for other hippocampal-subcortical reactivation. Pooling across all 45 sessions from all rats, for pairs of CA1-CA1 cells there was an overall average EV of 0.24 and reverse explained variance (REV) of 0.17 (t(42) = 1.79, p = 0.0400, one-sided paired t-test). EV and REV values were 0.32 and 0.10 (t(23) = 6.33, p < 0.0001, one-sided paired t-test) for striatal-striatal cell pairs, and 0.09 and 0.04 (t(41) = 2.84, p = 0.0035, one-sided paired t-test) for CA1-striatal cell pairs (Fig. 7d). Therefore CA1-CA1, striatal-striatal and CA1-striatial cell pairs showed significantly larger EV values compared to REV values, indicating TASK-dependent patterns of coactivity during POST, i.e. reactivation, both within and between brain regions.

Reactivated cell pairs encode reward prediction

To interrogate the behavioural salience of the task-dependent reactivation implied by the EV analysis, we assessed the contributions of individual cell pairs and their behavioural correlates (see ‘Methods’).

We restricted the analysis to sessions in the initial learning stage when performance was significantly above 33% chance rate: at this level of performance, rats had acquired an association of higher reward probability or value to the high-probability arm than the medium-probability arm, which we refer to as reward prediction. CA1-striatal cell pairs were ranked according to their drop-one-cell-pair-out contribution to the session’s EV-REV reactivation metric (see ‘Methods’; c.f.²¹), and the cell pairs with contributions in the highest decile and firing rate correlations higher during POST than PRE were labelled as reactivated cell pairs. Cell pairs with contributions in the smallest decile were used as a control population. Among 163 cell pairs classified as reactivated, 52 (31.9%) comprised a reward-modulated striatal cell, compared to 50 out of 360 (13.9%) of non-reactivated cell-pairs, indicating a preference for reactivation of reward-related information between hippocampus and ventral striatum (χ²(1) = 23.2, p < 0.0001, χ² test), consistent with previous observations^18,54.

We used the times during the TASK period when these cell pairs were coactive to indicate the behavioural correlates of the reactivation: for each cell pair, the binwise minumum of their firing rates was calculated to create a measure of their coactivity (Fig. 7e–g). The z-scored coactivity averaged across medium-reward-expectation trials (both rewarded and unrewarded) showed a ramping up towards the point of arrival at the reward location that was stronger in the reactivated cell pairs than the control cell pairs (Fig. 8a). Z-scored coactivity averaged across high-reward-expectation trials showed a similar pattern, but with a higher peak just before arrival. A mixed-effects ANOVA comparing the peak coactivity for 163 reactivated versus 360 control cell pairs on high- versus medium-expectation arms showed a significant interaction effect between cell-pair type and trial type (F(1) = 12.6, p = 0.0004, two-sided; 8b). This effect was in addition to significantly greater coactivity of reactivated cell pairs for each trial type individually (F(753) > 2.4, p < 0.0001, post-hoc two-sided t-tests; Fig. 8b). A similar pattern was found for coactivity on rewarded trials only (Fig. S4). Thus, pairs of CA1 and ventral striatal cells displaying a higher degree of reactivation in post-task rest appear to be involved in encoding the anticipation of reward, and its expected probability, rather than reward outcome or error.

**Fig. 8: Coactivity of CA1-striatal cell pairs around the time of approach to reward location.**

We then performed the same analysis for within-striatum reactivation: pairs of striatal-striatal cells were divided into reactivated and non-reactivated according to their contribution to the overall EV-REV metric for within-striatum reactivation. On rewarded trials, the reactivated pairs’ z-scored coactivity showed a similar ramp up in anticipation of reward, plus a subsequent increase in coactivity in the 5 s following reward delivery on the medium-reward-expectation arm (i.e. corresponding to high, positive reward-prediction error) that was not present in the 5 s following reward delivery on the high-reward-expectation arm (i.e. corresponding to low, positive reward-prediction-error; Fig. 8c). This was confirmed by a mixed-effects ANOVA comparing the peak coactivity for reactivated versus control cell pairs in the 5 s following reward delivery on high- versus medium-expectation rewards, which shows a significant interaction effect between cell-pair type and trial type (F(1) = 8.6, p = 0.0035, two-sided; Fig. 8d). In contrast to pairs of CA1-ventral-striatal cells’ reactivation of reward-prediction signals, striatal-striatal cell pairs therefore showed preferential reactivation of RPE signals.

Discussion

We trained rats on a reinforcement learning task designed to dissociate reward outcome (presence or absence of reward) from reward prediction error (RPE; an unexpected reward or absence of reward) on each trial. Training variations of a Q-learning reinforcement learning model to predict behaviour on the task revealed that Q-learning with replay prioritised by RPE was the best predictor of learning. Consistent with this, we found that pairs of CA1-ventral striatal cells which are the most strongly reactivated during post-task rest encode reward prediction, ramping up to the point of reward delivery, while pairs of ventral striatal cells encode RPEs, being more strongly coactived following less certain reward.

Our first main result was that Q-learning can model rats’ learning of the stochastic reinforcement learning task, producing low reliability-errors when trained on rats’ behaviour and predicting the likelihood of actions on each trial. This is consistent with other studies showing that Q-learning can predict behaviour in a range of tasks in rodents monkeys and humans³⁴. Given this result, we then hypothesised that adding replay to the Q-learning model between sessions might better reflect learning and therefore better predict behaviour. However, a policy of replaying state-action pairs randomly did not produce lower errors overall, indicating a poor model of the cognitive processes underlying reinforcement learning. Similarly, biasing replay by sampling from state-action pairs which had produced the largest recent reward did not produce lower errors relative to no-replay.

In contrast, biasing replay by sampling from state-action pairs which had produced the largest recent RPE decreased reliability errors, demonstrating that the cognitive processes involved in the learning of this task are influenced by offline activity that takes place between sessions biased by RPE. This result did not hold when training data was shuffled, demonstrating that the influence of RPE is a feature of the learning process and not an epiphenomenon resulting from the general statistics of behaviour. Moreover, the result did hold for all state-action pairs, despite the overrepresentation in training data of those most frequently experienced. This gives credence to the notion that the Q-learning model with replay biased by RPE is a good overall model of state-action values held by the brain and offers a viable means to extend hippocampus-based models of replay’s contributions to spatial memory⁵⁸.

Performance on memory tasks has widely been found to improve following a period of sleep^59,60,61, associated with replay of activity which codes recent experiences during hippocampal sharp-wave ripples³. Associations between spatial location and reward or action values are encoded in the ventral striatum, which receives direct inputs from dorsal CA1 whose activation after learning is required to consolidate spatial memories^62,63. The modelling results predict post-task reactivation of such connectivity within the hippocampal-striatal network to induce long-term potentiation at the synapses active during replay. Accordingly, we found reactivation in hippocampal-striatal cell pairs, with an increase in cell-pair coactivation particularly for cell pairs whose coactivity was higher on the approach to high-probability rewards than medium-probability rewards. We also found reactivation in striatal-striatal cell pairs, with an increase in coactivation for pairs whose activity was higher following less-expected reward than more-expected reward. These represent a reward-prediction signal and reward-prediction-error signal, respectively, consistent with Q-learning, supporting the hypothesis that hippocampal replay modulates the midbrain circuit responsible for updating reward predictions and RPEs. The reactivated hippocampal-striatal cell pairs showed a ramping pattern on the approach to reward location, which has been shown to reflect a dopaminergic RPE signal. While various studies report projections from hippocampus to ventral striatum, there are no known projections from ventral striatum to hippocampus⁶⁴, which implies that this coactivation during learning and reactivation during post-task rest are both driven by the hippocampus, perhaps as part of a broader network incorporating other brain areas including VTA and prefrontal cortex. Being limited to these particular recording areas gives a narrow view of the possible physiological implementations of the modelling results, and cannot serve as direct tests of the competing hypotheses which could rely on unobserved parts of the circuit. We therefore propose that post-task replay underlies the RPE-biased offline updating of state-action values which influenced reinforcement learning in this task.

The apparent dual computational function of reactivation between and within brain areas likely reflects the distributed nature of reinforcement learning in the hippocampal-striatal-VTA circuit. Similar simultaneous but distinct replay patterns have been observed between the hippocampus and entorhinal cortex⁶⁵, and between hippocampus and prefrontal cortex⁶⁶. Further investigation of how hippocampal-hippocampal, hippocampal-striatal and striatal-striatal replay events are temporally or computationally related would be valuable for elucidating how offline activity influences learning processes. One interpretation of the electrophysiological results here is that hippocampal-striatal reactivation is biased by reward prediction to reinforce the learned Q values, while striatal-striatal reactivation is biased by RPEs to update the Q values. Another interpretation is that striatal-striatal reactivation follows the RPE-biased sample selection predicted by our modelling, while hippocampal-striatal reactivation follows a policy-biased (replaying the most likely upcoming paths)⁶⁷ or experience-biased (replaying the most frequently experienced paths)²⁴ sample selection.

The suggestion that hippocampal replay might be biased by RPEs differs from the commonly held view that replay is biased by reward itself^{4,19,20,68,69,70}. However, the studies on which this conclusion is based generally do not use tasks which explicitly dissociate reward from RPE, so these results in the literature are not inconsistent with our suggestion that RPE biases replay.

Despite the prevalence of the idea that reward biases replay, our alternative theory that RPE biases replay fits better with existing research into the roles of dopamine. Dopaminergic projections from the VTA to CA1 in the hippocampus have been found to modulate both replay during sleep following exposure to a novel environment, and subsequent memory performance in the same environment⁷¹. It is suggested that dopaminergic neuromodulation might tag synapses by upregulating plasticity-related proteins, causing long-lasting potentiation which allows the stabilisation of the memory trace during subsequent sleep and rest^72,73. Phasic dopaminergic inputs to the hippocampus are triggered not only in response to novelty, but also in the context of reward⁴⁹, offering a likely mechanism by which reward-related information might influence replay. Indeed, replay has been found in reward-related VTA cells^74,75, confirming the involvement of the full hippocampal-striatal-VTA loop in post-task reactivation.

Several studies have expressly linked replay to reward, ostensibly in contrast with our results, but often RPE is a confounding factor in these which cannot be discounted. In humans, high monetary reward (but not low monetary reward) is linked to sleep-dependent improvements in associative memory^76,77; in these human studies, RPE was not estimated but would presumably be higher overall in the high-reward than low-reward condition, conflating reward-dependent effects with RPE-dependent effects. In rodents, newly-rewarded behaviour has been associated with replay more than behaviour which had been rewarded in previous sessions¹⁹; the authors attributed this replay bias to novelty, but it is also consistent with increased RPE when new behaviours are rewarded for the first time. Moreover, following extended reinforcement of both behaviours, the replay bias for the newly-rewarded behaviour was eliminated. In a third study, results were more mixed: following an increase in reward magnitude at one end of a linear track, there was more replay associated with the larger-magnitude end than the unchanged-magnitude end, correlated with both reward and RPE⁶⁸. However, following an elimination of reward at one end, there was a reduction in replay following a reduction in reward despite the increase in RPE. This is more consistent with reward-biased than RPE-biased replay, although the authors noted a rebound effect when the eliminated reward was reinstated: greater replay was found at the reinstated-reward end than the unchanged-reward end, despite identical reward magnitudes. This leaves open the possibility of bias by positive over negative RPEs. A fourth study found more replay of large-reward-related activity than small-reward-related activity on a maze task¹⁶, but because reward was received on every trial analysed, any effects of reward magnitude are conflated with positive RPE.

Conversely, the specific case for RPE-biased replay is supported by findings that neural sensitivity to RPEs in humans predicts the amount of awake replay during a reinforcement learning task, and replay amount correlated with subsequent performance in a task requiring behavioural flexibility⁷⁸.

In addition to human and rodent studies, findings from the literature on machine learning show some consistency with our results. A number of machine learning studies have found that storing new information in memory buffers and sampling from it at regular intervals, similar to hippocampal replay, can speed up learning^47,79,80,81 and more so when replay is biased by prediction errors^82,83. RPE-biased replay may therefore represent an adaptive focus whereby resources are focused on areas of a cognitive model which needs updating^84,85,86.

We do not claim that this tells the whole story: RPE is highly unlikely to be the only factor that biases replay and the phenomenon is likely to be much more multifaceted than this model suggests. First, phasic dopamine signalling to hippocampus may encode other kinds of prediction errors or aspects of reward to which the VTA is sensitive^{87,88,89,90,91,92}, and bias replay by the same mechanism. Reward itself may bias replay, especially if positive RPEs influence replay more than negative RPEs; there is also evidence that novelty^93,94, the expectation of reward⁷⁰, frequency of experience⁹⁵ and strength of encoding⁹⁶ bias replay too. Furthermore, in addition to aiding reinforcement learning, replay has been associated with other memory-related functions including planning^5,97, processing of emotional memories⁹⁸, creative problem-solving⁹⁹ and generalising from episodic memories to abstractions^7,100, all of which are likely to necessitate some biasing of replay distinct from RPEs. In sum, while we fully expect replay to be more complex, we have focused on one facet with important neurobiological foundations.

Our model assumes that a cache of all experience is stored from which to be sampled, which is expensive and unrealistic at large scales. This may not be necessary if memory for individual trials is gradually forgotten and subsumed into cortical long-term memory, for example over the course of hours over which cell assembly activation decays¹⁰¹.

Finally, this model leaves open some questions. Although the role of post-task VTA activity in influencing future reward-related behaviour has been demonstrated previously^75,102, it remains unclear how this part of the hippocampal-striatal-VTA loop contributes to replay in this task. There is also an open question about possible diverging roles of replay during behaviour compared to prolonged rest and sleep. Here, we have considered replay between sessions, which is likely to take place at least partly during sleep; but replay during wake has also been shown to be necessary for learning¹⁵.

In summary, we found that a Q-learning-based reinforcement learning model which assumes offline updates between sessions is a better predictor of learning behaviour than one which does not assume offline updates. Specifically, this is true when updates are prioritised according to experiences that have recently elicited high RPEs, and not when they are prioritised according to reward or random recent experiences. Activity reflecting reward-prediction signals in the CA1-ventral-striatal network and RPEs in the striatal network is reactivated, demonstrating a mechanism by which state-action values across hippocampus and striatum may be updated offline. This finding offers a refined interpretation of how offline activity during rest and sleep might aid reinforcement learning, in terms of RPE rather than solely reward.

Methods

Behavioural task

All procedures were performed in accordance with the United Kingdom Animals (Scientific Procedures) Act 1986 and European Union Directive 2010/63/EU and were reviewed by the University of Bristol Animal Welfare and Ethical Review Board.

Six adult male Lister hooded rats in the first cohort (weighing 260-330g) and three adult male Lister hooded rats in the second cohort (weighing 300–430 g, Charles River Laboratories, UK) were individually housed with environmental enrichment, and food-restricted to no less than 85% of their pre-restriction body weight. Following habituation to the recording room, they were trained during the light part of a 12:12 light/dark cycle to forage on a 3-armed radial maze for liquid sucrose rewards in a dimly-lit room. The maze consisted of a raised central platform 25cm in diameter, with three arms (60 cm × 7 cm) protruding from it (Fig. 1a). Arms were separated from the central platform by inverted-guillotine pneumatic doors, which raised to block access to the arms, and fell below the maze floor to allow access. Turning zones (10 cm × 10 cm) with lick ports were positioned at the end of each arm, at which 20% sucrose solution rewards were delivered. Door movements and reward delivery were operated automatically according to the animal’s position, tracked using a webcam mounted above the maze, using custom MATLAB (The MathWorks) code. Following at least three days of habituation to the recording room and maze-operation sounds, each animal performed 17–22 once-daily training sessions, between 5 and 7 days per week, lasting 1 h each.

Trials began when a rat entered, or was placed by the experimenter on, the central platform with all doors closed. Doors opened following a 5-s delay period. When the animal reached the lick port, reward was probabilistically delivered or withheld, and doors to the other two arms were closed; the third door was closed when the animal re-entered the central platform to begin a new trial.

Each arm was assigned as either high probability, mid probability or low probability, which determined the protocol for reward delivery. These assignments remained fixed throughout training for each animal, but were counter-balanced between animals. The cohort of rats on which the behavioural model was fit underwent three learning stages with three sets of reward probabilities. In the initial learning stage, sessions 1–15, the high-probability arm delivered a reward on 6 out of 8 (75%) legitimate entries to the arm, the mid-probability arm on 4 out of 8 (50%), and the low-probability arm on 2 out of 8 (25%). A legitimate entry was one in which a different arm had been entered on the previous trial; entering the same arm twice in a row was incorrect and did not result in a reward delivery. In the revaluation stage, sessions 16–20, the reward probabilities for the high- and low-probability arms were amplified: reward was delivered on 7 out of 8 (87.5%) and 1 out of 8 (12.5%) legitimate entries respectively. In the reversal learning stage, sessions 21–22, the reward probabilities for the high- and low-probability arms were switched, such that the (formerly) high- and low- probability arms delivered reward on 1 out of 8 (12.5%) and 7 out of 8 (87.5%) of legitimate entries respectively.

The cohort of rats from which hippocampal and striatal activity was recorded underwent just one change in reward probabilities. In the first 12-15 sessions, the high-probability arm delivered a reward on 7 out of 8 (87.5%) legitimate entries to the arm, the mid-probability arm on 4 out of 8 (50%), and the low-probability arm on 1 out of 8 (12.5%). In the remaining 5 sessions, the reward probabilities for the high- and low-probability arms were switched.

For this cohort, training sessions were flanked by rest sessions in the home cage of ~2 h before and after training.

Q-learning

We trained several variations of a Q-learning algorithm on the behavioural data to predict choices of which arm would be entered on each trial. Q-learning is a reinforcement learning algorithm developed for Markov decision processes in which an agent selects actions in its environment and observes the outcome, recording at each time step t its starting state s_t, selected action a_t, resulting reward r_t and resulting state s_t+1. The agent builds up a matrix Q of Q value estimates for every state-action pair (1) corresponding to the future discounted expected reward, i.e. the temporal difference between the current state and the reward state. These Q value estimates are used to guide actions to maximise reward. At each time step t, the Q value for the state-action pair observed is updated by (2), where α ∈ (0, 1) is a learning rate parameter which determines the degree to which new information overrides old information, and γ ∈ (0, 1) is a discount parameter which determines the importance of long-term gains.

In this task, entries into a chosen arm (and arrival at the goal location at the end of the arm) were modelled as actions, while the arm entered on the previous trial, on which reward probabilities were contingent, were modelled as states. Each trial therefore gave rise to one state-action transition out of nine possible state-action pairs. Actions were selected according to probabilities p_a for each action a, determined by Q values and an exploration-exploitation parameter epsilon:

$${p}_{a}=\frac{{e}^{\epsilon {Q}_{s,a}}}{{\sum }_{a=1}^{3}{e}^{\epsilon {Q}_{s,a}}}$$

(3)

To reflect rats’ natural tendency to alternate between options, Q values were initialised before learning to:

$$\left[\begin{array}{ccc}0&0.7&0.7\\ 0.7&0&0.7\\ 0.7&0.7&0\end{array}\right]$$

(4)

Q-learning with replay

We used four variants of Q-learning in which additional nominal offline updates are performed between performed online trials, based on sequences already experienced, to boost learning. This has the effect of learning from several trials per actual trial of experience, and is similar to the Dyna-Q algorithm which has been shown to speed up learning compared to Q-learning alone¹⁰³ in a manner which may underlie the function of hippocampal replay⁴⁴. Generally, sequences are selected randomly from a memory buffer of recently-acquired experiences, without bias towards any trial or type of trial. Given the observed bias reported in the literature towards salient experiences, such as those rewarded or aversive, we modified Dyna-Q to perform updates only between sessions and to reflect hypothesised biases in four different ways.

Parameter-fitting

Parameter-fitting for Q-learning

First, a Q-learning algorithm (without replay) was trained, to obtain a baseline score against which various replay policies could be compared. Q values were stored for each state-action pair on the task, and updated according to each animal’s experience. A state s_t was defined as the arm visited on the previous trial t − 1, and an action a_t was defined as the arm chosen on the current trial t. Following each trial of an animal’s training, the Q value Q(s_t, a_t) was updated according to the reward received, r ∈ {0, 1} by equation (2), and Q values were transformed into a forecast probability of choosing each arm on the subsequent trial.

The learning rate α, discount factor γ, and exploration factor ϵ were free parameters that were tuned to each rat, using the following optimisation procedure. Here we used an error score adapted from the reliability component of ref. ¹⁰⁴ and generated based on the forecast probabilities of all trials, to quantify the consistency of the forecast probabilities with the animals’ behaviour. The mean observed frequency was calculated for each state-action pair, i.e. the proportion of trials on which a given action was chosen in a given state, and the error score R_t for a given trial t was calculated according to:

$${R}_{t}={n}_{{s}_{t}}\cdot \mathop{\sum }_{a=1}^{{n}_{a}}{({p}_{a}-{o}_{{s}_{t},a})}^{2}$$

(5)

where s_t is the animal’s state on trial t, ${n}_{{s}_{t}}$ is the number of trials on which the animal was in state s_t, n_a is the number of possible actions (3) p_a is the forecast probability for entering arm a, and o_s,a is the mean observed frequency of state-action pair s, a.

Parameter optimisation was performed using Bayesian adaptive direct search (BADS)¹⁰⁵, with the error score averaged over 25 runs with different seeds used as the objective function to reduce its stochasticity. Analyses were performed on the average error over 1000 runs with seeds separate from those used during parameter optimisation, using the resulting parameter values.

Parameter-fitting for Q-learning with replay

Against the baseline of no-replay, the same optimisation procedure was performed with increasing amounts of replay according to four replay policies. Following each session, a specified number of samples were chosen from all the trials experienced so far. How the samples were selected depended on the replay policy (detailed below); a probability P(s, a) was assigned to each state-action pair to determine which pair to sample from. From the chosen state-action pair, a sample trial was chosen according to the probability P(i) in which a recency parameter ensured that more recent trials were exponentially more likely to be chosen. Q values were then updated according to the state, action and reward of the sampled trial, in the same manner as so-called online Q value updates described in equation (2).

Each replay policy required the same three parameters to be optimised as in Q-learning without replay, plus additional parameters for recency and/or RPE-weighting. Table 2 shows the number of free parameters for each replay policy.

Table 2 Number of free parameters for each replay policy

Full size table

These were optimised according to the same procedure as for Q-learning with no replay, described above, for n = {1, 3, 5, 10, 15, 20, 30, 40, 50, 75, 100} replay events between each session, resulting in 11 sets of parameter values for each replay policy and each animal. Comparing this to plausible quantities of replay events in animals is not trivial, but studies in which discrete replay events are enumerated report 100–200 bursts of hippocampal activity that can be statistically related to prior experience, over the first 1 or 2 h after experience^16,106. Separately, reactivation of cell pairs has been found to decay to baseline well within that time period following exposure to familiar environments¹⁰¹, so the first 1–2 h is likely to be when most replay of recent experience in a familiar environment occurs.

Random replay

Random replay, biased by nothing but the recency of an action, was included as a control. For each replay event, a state-action pair was chosen at random out of all state-action pairs experienced so far:

$$P(s,a)=\frac{1}{{n}_{sa}}$$

(6)

where n_sa is the number of state-action pairs experienced (up to 9). The subset of trials experienced, i ∈ (1, I), which represented this state-action pair were ordered chronologically, and the probability P(i) of a trial i being replayed was determined according to a recency parameter φ:

$$P(i)=\frac{{i}^{\varphi }}{\mathop{\sum }_{j=1}^{I}{j}^{\varphi }}$$

(7)

Reward-biased replay

Reward-biased replay represents the predominant interpretation of how reward influences replay^69,107. For each replay event, a state-action pair s, a was chosen probabilistically in proportion to its Q value:

$$P(s,a)=\frac{Q(s,a)}{\mathop{\sum }_{s=1}^{{n}_{s}}\mathop{\sum }_{a=1}^{{n}_{a}}Q(s,a)}$$

(8)

The subset of trials experienced which represented the chosen state-action pair were ordered chronologically, and determined according to equation (7).

RPE-prioritised replay

RPE-prioritised replay represents the policy of replaying trials associated with the most surprising outcomes, i.e. where the difference between expectation (Q values) and experience (reward) was greatest. For each trial t, RPE was calculated as the difference δ between actual reward and expected reward:

$${\delta }_{t}=r+\gamma \cdot Q \, ({s}_{t+1},{a}^{{\prime} })-Q \, ({s}_{t},{a}_{t})$$

(9)

where ${a}^{{\prime} }$ is the action with the highest Q value in state s_t+1.

For every trial i ∈ (1, I) which was an example of a given state-action pair, its absolute value was weighted, determined by a parameter ψ raised to the power of its recency i:

$${\Delta }_{i}=| {\delta }_{i}| \cdot {\psi }^{i}$$

(10)

The weighted RPEs, Δ, were then averaged to produce an overall weighted-average RPE, ${\overline{\Delta }}_{s,a}$, for each state-action pair s, a, which was more heavily influenced by recent trials:

$${\overline{\Delta }}_{s,a}=\frac{\mathop{\sum }_{i=1}^{I}{\Delta }_{i}}{I}$$

(11)

The state-action pair with the highest ${\overline{\Delta }}_{s,a}$ was selected, and the subset of trials experienced which represented the chosen pair were ordered chronologically, and determined according to equation (7). Once replayed, the δ_t for the trial sampled was updated to reflect the RPE resulting from the replay event.

RPE-proportional replay

RPE-proportional replay is a variant of RPE-prioritised replay, in which state-action pairs are chosen in proportion to their weighted-average-RPE instead of choosing the pair with the highest weighted-average-RPE. The RPE was calculated according to eq. (11) and a state-action pair to be sampled from was chosen probabilistically according to:

$${p}_{s,a}=\frac{{\overline{\Delta }}_{s,a}}{\sum {\overline{\Delta }}_{s,a}}$$

(12)

The subset of trials experienced which represented the chosen state-action pair were ordered chronologically, and determined according to equation (7). Once replayed, the δ_t for the trial sampled was updated to reflect the RPE resulting from the replay event.

Shuffling procedure

As an additional control, the parameters were also optimised for shuffled data, in which trial order was randomly permuted 1000-fold. This preserved the large-scale information in the training data, such as the mean observed frequency and average rewards of state-action pairs and the number of trials in each session between replays, but disrupted the specific structure of how this information was acquired over time.

Electrophysiology

Three rats were implanted with a 9mm, 2-shank H2 silicon probe and a 9mm, 4-shank E silicon probe (Cambridge NeuroTech, UK), each with 64 recording sites, targeted at dorsal CA1 and ventral striatum, respectively. Probes were mounted on aluminium blocks (7.5 mm ×3.3 mm × 3.0 mm) and targeted at 2.1 mm lateral, 4mm posterior and 2.5 mm ventral to bregma (CA1) and 1.5 mm lateral, 1.7 mm anterior and 7 mm ventral to bregma (striatum), in the right hemisphere, based on the atlas of¹⁰⁸. Surgery was performed under isoflurane recovery anaesthesia in sterile conditions and probes cemented to the skull using Gentamycin-impregnated bone cement (dePuy CMW). A subcutaneous injection of the analgesic buprenorphine (0.05 mg/kg) was given post-surgery.

Extracellular recordings were made using an Open Ephys acquisition system at a sampling rate of 30 kHz, with two RHD2164 headstages, one with an integrated accelerometer. Recordings were referenced to a stainless steel screw implanted over the cerebellum. A red LED was attached to the implant, and the session was recorded by a ceiling-mounted webcam which allowed the rat’s movement to be tracked. Electrophysiological recordings and position tracking were synchronised post-hoc using a second LED which blinked at random intervals.

Raw data were automatically spike-sorted using Kilosort software¹⁰⁹ and manually curated using Phy (https://github.com/cortex-lab/phy). In brief, raw data were common-average referenced, high-pass filtered and whitened to remove correlated noise, before prototypical spikes were detected whenever the amplitude exceeded a given threshold. Detection and clustering of dimensionality-reduced spike waveforms were then optimised iteratively using a template-matching procedure. In the manual curation step, clusters were merged, accepted or rejected as noise by visual inspection, according to their inter-spike interval histograms, amplitude and spike waveform. Finally, clusters were restricted to those with an isolation distance of >15¹¹⁰.

Data analysis

Reward-related firing

Following⁵⁴, spike trains of ventral striatal cells were divided into 250 ms bins, centred around the time of arrival at reward location, and averaged across trials. A cell’s mean firing rate in each of the 8 bins from −1 to +1 s was compared to firing during 3 control bins using Wilcoxon’s signed rank test. Cells for which at least one bin was significantly different from all 3 control bins were classified as reward-responsive, using an alpha value of 0.05.

To analyse striatal cells’ encoding of reward expectation, binless spike trains equivalent to 50 ms bins¹¹¹ were z-scored with respect to the whole training session. Analysis was restricted to sessions in the initial learning stage in which performance was above chance and before reward probabilities changed. Cells which showed a peak firing rate in the 2-s period before arrival at reward location, before the reward outcome (reward or no reward) was known, exceeding 2 standard deviations were classified as encoding reward expectation. The same 2-s period was compared for arrival at the high-probability reward location and the mid-probability reward location, pooled across rats, using a paired t-test to test for differences in population-level firing.

Sharp-wave ripple detection

Sharp-wave ripples were detected using the SleepWalker toolbox in MATLAB (https://gitlab.com/ubartsch/sleepwalker). Hippocampal LFP was filtered at 120–250 Hz, and events were extracted when ripple power exceeded 3.5 standard deviations above the mean, and no more than 25 standard deviations. Events with a duration of 10–500 ms, an amplitude of 30–1000 μV, and separated by at least 30 ms were included as ripples.

Explained variance and reverse explained variance

To analyse ripple-related reactivation, sessions with at least 5 CA1 and 5 ventral striatal cells were included. The PRE and POST periods were restricted to concatenated windows of 200 ms from each ripple peak. Pearson’s correlation coefficients were calculated between binless spike trains equivalent to 50 ms bins in the PRE, TASK and POST periods separately and combined to create three correlation matrices. The similarity between PRE, TASK and POST was calculated by taking the correlation coefficient r between their correlation matrices⁹⁴:

$$EV={\left(\frac{\left.{r}_{TASK,POST}-{r}_{TASK,PRE}{r}_{POST,PRE}\right)}{\left.\sqrt{(1-{r}_{TASK,PRE}^{2})(1-{r}_{POST,PRE}^{2})}\right)}\right)}^{2}$$

(13)

giving a measure of the partial correlation between cell-pair coactivity during post-task ripples with that during the task, controlling for cell-pair coactivity during pre-task coactivity.

REV was calculated by exchanging r_PRE and r_POST in eq. (13).

Experience-dependent increases in cell-pair coactivity during sleep and rest

The contribution of each CA1-striatal or striatal-striatal cell pair to overall inter-region reactivation was measured by recalculating EV-REV with the cell pair removed and subtracted from the session’s overall EV-REV value. A threshold of the top decile within each session was used to classify candidate reactivated cell pairs (the analysis was also repeated for the top 5% and the top 20% with similar results). Mathematically, EV-REV can be driven by cell pairs whose correlation gets stronger from PRE to TASK and stays strong in POST, or whose correlation weakens from PRE to TASK and stays low in POST. The former could be said to carry or encode reactivated content, while the latter reflects more general network reorganisation without encoding task-relevant information. Therefore, from this top decile, only the cell pairs whose correlation increased from PRE to POST were included as reactivated cell pairs. These reactivated cell pairs were compared to the decile that had the lowest magnitude of contributions to EV-REV (i.e. closest to 0), reflecting cells pairs which did not encode reactivated content. (Similar results were obtained using the decile with the lowest signed contribution.)

Having established the reactivated and non-reactivated (baseline) cell pairs for each session, the reactivation content was identified by analysing when during the task the reactivated cell pairs were more coactive than the non-reactivated cell pairs. Coactivity was used for this measure for methodological consistency, because the (R)EV method depends on firing rate correlations between the cell pair: high EV-REV is driven by coherent fluctuations in firing rate (we ignore the possibility that synchronous decreases or pauses in firing rate might encode task-relevant information). To measure coactivity, the binless 50-ms spike trains for the two members of a cell pair were compared, and a pointwise minimum was taken between them such that if either cell had low or zero firing rate, the coactivity would be correspondingly low or zero. The coactivity was then z-scored with respect to the whole recording session to control for bias by the cells’ inherent firing rates.

Behavioural correlates of preferentially reactivated cell pairs

With the hypothesis that reactivated CA1-striatal or striatal-striatal cell pairs preferentially encoded reward prediction and/or error, coactivity was compared between reactivated and non-reactivated cell pairs and between their coactivity on high- and medium-probability arms on the approach to the reward location (CA1-striatal) or after rewarded outcome (striatal-striatal). A nested mixed-effects ANOVA was constructed with cell-pair type (reactivated or non-reactivated) and arm (high or medium) as fixed effects, cell-pair identity nested within rat identity as random effects and mean z-scored coactivity of a cell pair in the 2 s prior to arrival at the reward location (for CA1-striatal pairs) or 5 s after arrival at the reward location (for striatal-striatal pairs, on rewarded trials only) as the dependent variable. The interaction between the two fixed effects was the effect of interest, with post-hoc t-tests conducted to compare coactivity between reactivated and non-reactivated cell pairs on each arm separately.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Pre-processed data used in this study is available at https://github.com/EmmaRoscow/QlearningReplay¹¹². Source data for figures are provided with this paper. Source data are provided with this paper.

Code availability

All code used in this study is available at https://github.com/EmmaRoscow/QlearningReplay¹¹².

References

Yu, J. Y. et al. Distinct hippocampal-cortical memory representations for experiences associated with movement versus immobility. eLife 6, https://doi.org/10.7554/eLife.27621.001 (2017).
Foster, D. J. Replay comes of age. Annu. Rev. Neurosci. 40, 581–602 (2017).
Article PubMed CAS Google Scholar
Ólafsdóttir, H. F., Bush, D. & Barry, C. The role of hippocampal replay in memory and planning. Curr. Biol. 28, R37–R50 (2018).
Article PubMed PubMed Central Google Scholar
Sterpenich, V. et al. Reward biases spontaneous neural reactivation during sleep. Nat. Commun. 12, 4162 (2021).
Article ADS PubMed PubMed Central CAS Google Scholar
Pfeiffer, B. E. & Foster, D. J. Hippocampal place-cell sequences depict future paths to remembered goals. Nature 497, 74–9 (2013).
Article ADS PubMed PubMed Central CAS Google Scholar
Cairney, S. A., Durrant, S. J., Jackson, R. & Lewis, P. A. Sleep spindles provide indirect support to the consolidation of emotional encoding contexts. Neuropsychologia 63, 285–292 (2014).
Article PubMed Google Scholar
Lewis, P. A. & Durrant, S. J. Overlapping memory replay during sleep builds cognitive schemata. Trends Cogn. Sci. 15, 343–351 (2011).
Article PubMed Google Scholar
Dupret, D., O'neill, J., Pleydell-Bouverie, B. & Csicsvari, J. The reorganization and reactivation of hippocampal maps predict spatial memory performance. Nat. Neurosci. 13, 995–1002 (2010).
Rasch, B., Büchel, C., Gais, S. & Born, J. Odor cues during slow-wave sleep prompt declarative memory consolidation. Science 315, 1426–1429 (2007).
Article ADS PubMed CAS Google Scholar
Rudoy, J., Voss, J., Westerberg, C. & Paller, K. Strengthening individual memories by reactivating them during sleep. Science http://science.sciencemag.org/content/326/5956/1079.short (2009).
Antony, J. W., Gobel, E. W., O’Hare, J. K., Reber, P. J. & Paller, K. A. Cued memory reactivation during sleep influences skill learning. Nat. Neurosci. 15, 1114–1116 (2012).
Article PubMed PubMed Central CAS Google Scholar
Bendor, D. & Wilson, M. A. Biasing the content of hippocampal replay during sleep. Nat. Neurosci. 15, 1439–1444 (2012).
Article PubMed PubMed Central CAS Google Scholar
Girardeau, G., Benchenane, K., Wiener, S. I., Buzsáki, G. & Zugaro, M. B. Selective suppression of hippocampal ripples impairs spatial memory. Nat. Neurosci. 12, 1222–1223 (2009).
Article PubMed CAS Google Scholar
Ego-Stengel, V. & Wilson, M. A. Disruption of ripple-associated hippocampal activity during rest impairs spatial learning in the rat. Hippocampus 20, 1–10 (2009).
Article Google Scholar
Jadhav, S. P., Kemere, C., German, P. W. & Frank, L. M. Awake hippocampal sharp-wave ripples support spatial memory. Science 336, 1454–1458 (2012).
Article ADS PubMed PubMed Central CAS Google Scholar
Michon, F., Sun, J.-J. J., Kim, C. Y., Ciliberti, D. & Kloosterman, F. Post-learning hippocampal replay selectively reinforces spatial memory for highly rewarded locations. Curr. Biol. 29, 1436–1444.e5 (2019).
Article PubMed Google Scholar
Foster, D. J. & Wilson, M. A. Reverse replay of behavioural sequences in hippocampal place cells during the awake state. Nature 440, 680–683 (2006).
Article ADS PubMed CAS Google Scholar
Lansink, C. S., Goltstein, P. M., Lankelma, J. V., McNaughton, B. L. & Pennartz, C. M. A. Hippocampus leads ventral striatum in replay of place-reward information. PLoS Biol. 7, e1000173 (2009).
Article PubMed PubMed Central Google Scholar
Singer, A. C. & Frank, L. M. Rewarded outcomes enhance reactivation of experience in the hippocampus. Neuron 64, 910–921 (2009).
Article PubMed PubMed Central CAS Google Scholar
Bhattarai, B., Lee, J. W. & Jung, M. W. Distinct effects of reward and navigation history on hippocampal forward and reverse replays. Proc. Natl. Acad. Sci. USA 117, 689–697 (2020).
Article ADS PubMed CAS Google Scholar
Girardeau, G., Inema, I. & Buzsáki, G. Reactivations of emotional memory in the hippocampus-amygdala system during sleep. Nat. Neurosci. 20, 1634–1642 (2017).
Article PubMed CAS Google Scholar
Wu, C. T., Haggerty, D., Kemere, C. & Ji, D. Hippocampal awake replay in fear memory retrieval. Nat. Neurosci. 20, 571–580 (2017).
Article PubMed PubMed Central CAS Google Scholar
Cheng, S. & Frank, L. M. New experiences enhance coordinated neural activity in the hippocampus. Neuron 57, 303–13 (2008).
Article PubMed PubMed Central CAS Google Scholar
Huelin Gorriz, M., Takigawa, M. & Bendor, D. The role of experience in prioritizing hippocampal replay. Nat. Commun. 14, 8157 (2023).
Article ADS PubMed PubMed Central CAS Google Scholar
Trouche, S. et al. A hippocampus-accumbens tripartite neuronal motif guides appetitive memory in space. Cell 176, 1393–1406.e16 (2019).
Article PubMed PubMed Central Google Scholar
Ito, R., Robbins, T. W., Pennartz, C. M. & Everitt, B. J. Functional interaction between the hippocampus and nucleus accumbens shell is necessary for the acquisition of appetitive spatial context conditioning. J. Neurosci. 28, 6950–6959 (2008).
Article PubMed PubMed Central CAS Google Scholar
Barnstedt, O., Mocellin, P. & Remy, S. A hippocampus-accumbens code guides goal-directed appetitive behavior. Nat. Commun. 15, 3196 (2024).
Article ADS PubMed PubMed Central CAS Google Scholar
Ibrahim, K. M. et al. Dorsal hippocampus to nucleus accumbens projections drive reinforcement via activation of accumbal dynorphin neurons. Nat. Commun. 15, 750 (2024).
Article ADS PubMed PubMed Central CAS Google Scholar
Wimmer, G. E., Li, J. K., Gorgolewski, K. J. & Poldrack, R. A. Reward learning over weeks versus minutes increases the neural representation of value in the human brain. J. Neurosci. 38, 7649–7666 (2018).
Article PubMed PubMed Central CAS Google Scholar
Watkins, C. J. Learning from Delayed Rewards. Ph. D. thesis, King’s College, University of Cambridge https://ci.nii.ac.jp/naid/10007782517/ (1989).
O’Doherty, J. P., Dayan, P., Friston, K., Critchley, H. & Dolan, R. J. Temporal difference models and reward-related learning in the human brain. Neuron 38, 329–337 (2003).
Article PubMed Google Scholar
Daw, N. D., Niv, Y. & Dayan, P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat. Neurosci. 8, 1704–1711 (2005).
Article PubMed CAS Google Scholar
Kim, H., Lee, D. & Jung, M. W. Signals for previous goal choice persist in the dorsomedial, but not dorsolateral striatum of rats. J. Neurosci. 33, 52–63 (2013).
Article PubMed PubMed Central Google Scholar
Ito, M. & Doya, K. Validation of decision-making models and analysis of decision variables in the rat basal ganglia. J. Neurosci. 29, 9861–9874 (2009).
Article PubMed PubMed Central CAS Google Scholar
Lindsey, J., Markowitz, J. E., Gillis, W. F., Datta, S. R. & Litwin-Kumar, A. Dynamics of striatal action selection and reinforcement learning. eLife 13, RP101747 (2025).
Day, J. J., Roitman, M. F., Wightman, R. M. & Carelli, R. M. Associative learning mediates dynamic shifts in dopamine signaling in the nucleus accumbens. Nat. Neurosci. 10, 1020–1028 (2007).
Article PubMed CAS Google Scholar
Morris, G., Schmidt, R. & Bergman, H. Striatal action-learning based on dopamine concentration. Exp. Brain Res. 200, 307–317 (2010).
Article PubMed Google Scholar
Pagnoni, G., Zink, C. F., Montague, P. R. & Berns, G. S. Activity in human ventral striatum locked to errors of reward prediction. Nat. Neurosci. 5, 97–98 (2002).
Article PubMed CAS Google Scholar
Roesch, M. R., Calu, D. J. & Schoenbaum, G. Dopamine neurons encode the better option in rats deciding between differently delayed or sized rewards. Nat. Neurosci. 10, 1615–1624 (2007).
Article PubMed PubMed Central CAS Google Scholar
McClure, S. M., Berns, G. S. & Montague, P. Temporal prediction errors in a passive learning task activate human striatum. Neuron 38, 339–346 (2003).
Article PubMed CAS Google Scholar
Schultz, W. Dopamine reward prediction error coding. Dialogues Clin. Neurosci. 18, 23–32 (2016).
Article PubMed PubMed Central Google Scholar
Calabresi, P., Picconi, B., Tozzi, A. & Di Filippo, M. Dopamine-mediated regulation of corticostriatal synaptic plasticity. Trends Neurosci. 30, 211–219 (2007).
Article PubMed CAS Google Scholar
Sutton, R. S. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. Machine Learning Proceedings 1990 216–224 https://pdfs.semanticscholar.org/642d/b624b5b33a02a435ee1415d7c9f9cef36e1d.pdf?_ga=2.139378995.1242374671.1528715677-1919865530.1527610296 (2014).
Johnson, A. & Redish, A. D. Hippocampal replay contributes to within session learning in a temporal difference reinforcement learning model. Neural Netw. 18, 1163–1171 (2005).
Article PubMed Google Scholar
Andrychowicz, M. et al. Hindsight Experience Replay. Advances in Neural Information Processing Systems (NIPS) 5049–5059 http://papers.nips.cc/paper/7090-hindsight-experience-replay (2017).
Karimpanal, T. G. & Bouffanais, R. Experience replay using transition sequences. Front. Neurorobotics 12, 32 (2017).
Article Google Scholar
Roscow, E. L., Chua, R., Costa, R. P., Jones, M. W. & Lepora, N. Learning offline: memory replay in biological and artificial reinforcement learning. Trends Neurosci. 44, 808–821 (2021).
Article PubMed CAS Google Scholar
Coddington, L. T., Lindo, S. E. & Dudman, J. T. Mesolimbic dopamine adapts the rate of learning from action. Nature 614, 294–302 (2023).
Article ADS PubMed PubMed Central CAS Google Scholar
Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).
Article PubMed CAS Google Scholar
Schultz, W. Updating dopamine reward signals. Curr. Opin. Neurobiol. 23, 229–238 (2013).
Article PubMed PubMed Central CAS Google Scholar
Glimcher, P. W. Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis. Proc. Natl Acad. Sci. USA 108, 15647–15654 (2011).
Article ADS PubMed PubMed Central CAS Google Scholar
Watabe-Uchida, M., Eshel, N. & Uchida, N. Neural circuitry of reward prediction error. Annu. Rev. Neurosci. 40, 373–394 (2017).
Article PubMed PubMed Central CAS Google Scholar
Pennartz, C., Ito, R., Verschure, P., Battaglia, F. & Robbins, T. The hippocampal–striatal axis in learning, prediction and goal-directed behavior. Trends Neurosci. 34, 548–559 (2011).
Article PubMed CAS Google Scholar
Lansink, C. S. et al. Preferential reactivation of motivationally relevant information in the ventral striatum. J. Neurosci. 28, 6372–6382 (2008).
Article PubMed PubMed Central CAS Google Scholar
Van Der Meer, M. A. & Redish, A. D. Theta phase precession in rat ventral striatum links place and reward information. J. Neurosci. 31, 2843–2854 (2011).
Article PubMed PubMed Central Google Scholar
Sjulson, L., Peyrache, A., Cumpelik, A., Cassataro, D. & Buzsáki, G. Cocaine place conditioning strengthens location-specific hippocampal coupling to the nucleus accumbens. Neuron 98, 926–934.e5 (2018).
Article PubMed PubMed Central Google Scholar
Sosa, M., Joo, H. R. & Frank, L. M. Dorsal and ventral hippocampus engage opposing networks in the nucleus accumbens. Neuron 105, 725–741 (2020).
Babichev, A., Morozov, D. & Dabaghian, Y. Replays of spatial memories suppress topological fluctuations in cognitive map. Netw. Neurosci. 3, 707–724 (2019).
Article PubMed PubMed Central Google Scholar
Stickgold, R. Sleep-dependent memory consolidation. Nature 437, 1272–1278 (2005).
Article ADS PubMed CAS Google Scholar
Marshall, L. & Born, J. The contribution of sleep to hippocampus-dependent memory consolidation. Trends Cogn. Sci. 11, 442–450 (2007).
Article PubMed Google Scholar
Diekelmann, S. & Born, J. The memory function of sleep. Nat. Rev. Neurosci. 11, 114–126 (2010).
Article PubMed CAS Google Scholar
Torromino, G. et al. Offline ventral subiculum-ventral striatum serial communication is required for spatial memory consolidation. Nat. Commun. 10, 5721 (2019).
Del Ferraro, G. et al. Finding influential nodes for integration in brain networks using optimal percolation theory. Nat. Commun. 9, 2274 (2018).
Pronier, É., Morici, J. F. & Girardeau, G. The role of the hippocampus in the consolidation of emotional memories during sleep. Trends Neurosci. 46, 912–925 (2023).
Article PubMed CAS Google Scholar
O’Neill, J., Boccara, C. N., Stella, F., Schönenberger, P. & Csicsvari, J. Superficial layers of the medial entorhinal cortex replay independently of the hippocampus. Science 355, 184–188 (2017).
Article ADS PubMed Google Scholar
Kaefer, K., Nardin, M., Blahna, K. & Csicsvari, J. Replay of behavioral sequences in the medial prefrontal cortex during rule switching. Neuron 106, 154–165.e6 (2020).
Article PubMed Google Scholar
Fischer, S. & Born, J. Anticipated reward enhances offline learning during sleep. J. Exp. Psychol. Learn. Mem. Cogn. 35, 1586 (2009).
Article PubMed Google Scholar
Ambrose, R. E., Pfeiffer, B. E. & Foster, D. J. Reverse replay of hippocampal place cells is uniquely modulated by changing reward. Neuron 91, 1124–1136 (2016).
Article PubMed PubMed Central CAS Google Scholar
Atherton, L. A., Dupret, D. & Mellor, J. R. Memory trace replay: the shaping of memory consolidation by neuromodulation. Trends Neurosci. 38, 560–570 (2015).
Article PubMed PubMed Central CAS Google Scholar
Gruber, M. J., Ritchey, M., Wang, S.-F., Doss, M. K. & Ranganath, C. Post-learning hippocampal dynamics promote preferential retention of rewarding events. Neuron 89, 1110–1120 (2016).
Article PubMed PubMed Central CAS Google Scholar
McNamara, C. G., Tejero-Cantero, Á., Trouche, S., Campo-Urriza, N. & Dupret, D. Dopaminergic neurons promote hippocampal reactivation and spatial memory persistence. Nat. Neurosci. 17, 1658–1660 (2014).
Article PubMed PubMed Central CAS Google Scholar
Frey, U. & Morris, R. G. Synaptic tagging: implications for late maintenance of hippocampal long-term potentiation. Trends Neurosci. 21, 181–188 (1998).
Article PubMed CAS Google Scholar
Redondo, R. L. & Morris, R. G. Making memories last: the synaptic tagging and capture hypothesis. Nat. Rev. Neurosci. 12, 17–30 (2011).
Article PubMed CAS Google Scholar
Gomperts, S. N. et al. VTA neurons coordinate with the hippocampal reactivation of spatial experience. eLife 4, 321–352 (2015).
Article Google Scholar
Valdés, J. L., McNaughton, B. L. & Fellous, J. M. Offline reactivation of experience-dependent neuronal firing patterns in the rat ventral tegmental area. J. Neurophysiol. 114, 1183–1195 (2015).
Article PubMed PubMed Central Google Scholar
Igloi, K., Gaggioni, G., Sterpenich, V. & Schwartz, S. A nap to recap or how reward regulates hippocampal-prefrontal memory networks during daytime sleep in humans. eLife 4, https://cdn.elifesciences.org/articles/07903/elife-07903-v2.pdf, https://elifesciences.org/articles/07903 (2015).
Studte, S., Bridger, E. & Mecklinger, A. Sleep spindles during a nap correlate with post sleep memory performance for highly rewarded word-pairs. Brain Lang. 167, 28–35 (2017).
Article PubMed Google Scholar
Momennejad, I. et al. Offline replay supports planning in human reinforcement learning. eLife 7, https://doi.org/10.7554/eLife.32548.001 (2018). https://elifesciences.org/articles/32548
Lin, L.-J. Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn. 8, 293–321 (1992).
Article Google Scholar
Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).
Article ADS PubMed CAS Google Scholar
Wittkuhn, L., Chien, S., Hall-McMaster, S. & Schuck, N. W. Replay in minds and machines. Neurosci. Biobehav. Rev. 129, 367–388 (2021).
Article PubMed Google Scholar
Cichosz, P. An analysis of experience replay in temporal difference learning. Cybern. Syst. 30, 341–363 (1999).
Article Google Scholar
Schaul, T., Quan, J., Antonoglou, I. & Silver, D. Prioritized experience replay. 4th International Conference on Learning Representations, ICLR 2016 - Conference Track Proceedings http://arxiv.org/abs/1511.05952, https://arxiv.org/pdf/1511.05952.pdf (2016).
Mattar, M. G. & Daw, N. D. Prioritized memory access explains planning and hippocampal replay. Nat. Neurosci. 21, 1609–1617 (2018).
Article PubMed PubMed Central CAS Google Scholar
Antonov, G., Gagne, C., Eldar, E. & Dayan, P. Optimism and pessimism in optimised replay. PLoS Comput. Biol. 18, e1009634 (2022).
Sagiv, Y., Akam, T., Witten, I. B. & Daw, N. D. Prioritizing replay when future goals are unknown. bioRxiv https://doi.org/10.1101/2024.02.29.582822 (2024).
Keiflin, R., Pribut, H. J., Shah, N. B. & Janak, P. H. Ventral tegmental dopamine neurons participate in reward identity predictions. Curr. Biol. 29, 93–103.e3 (2019).
Article PubMed Google Scholar
Sharpe, M. J. et al. Dopamine transients do not act as model-free prediction errors during associative learning. Nat. Commun. 11, 106 (2020).
Batchelor, H. M. et al. Dopamine neurons respond to errors in the prediction of sensory features of expected rewards article dopamine neurons respond to errors in the prediction of sensory features of expected rewards. Neuron 95, 1395–1405.e3 (2017).
PubMed PubMed Central Google Scholar
Takahashi, Y. K. et al. Dopaminergic prediction errors in the ventral tegmental area reflect a multithreaded predictive model. Nat. Neurosci. 26, 830–839 (2023).
Article PubMed PubMed Central CAS Google Scholar
Costa, K. M., Raheja, N., Mirani, J., Sercander, C. & Schoenbaum, G. Striatal dopamine release reflects a domain-general prediction error. bioRxiv 2023–08 https://doi.org/10.1101/2023.08.19.553959 (2023).
Lee, R. S., Sagiv, Y., Engelhard, B., Witten, I. B. & Daw, N. D. A feature-specific prediction error model explains dopaminergic heterogeneity. Nat. Neurosci. 27, 1574–1586 (2024).
Article PubMed CAS Google Scholar
Hirase, H. et al. Firing rates of hippocampal neurons are preserved during subsequent sleep episodes and modified by novel awake experience. Proc. Natl. Acad. Sci. USA 98, 9386–90 (2001).
Article ADS PubMed PubMed Central CAS Google Scholar
Kudrimoti, H. S., Barnes, C. A. & McNaughton, B. L. Reactivation of hippocampal cell assemblies: effects of behavioral state, experience, and EEG dynamics. J. Neurosci. 19, 4090–4101 (1999).
Article PubMed PubMed Central CAS Google Scholar
Gupta, A. S., van der Meer, M. A., Touretzky, D. S. & Redish, A. D. Hippocampal replay is not a simple function of experience. Neuron 65, 695–705 (2010).
Article PubMed PubMed Central CAS Google Scholar
Schapiro, A. C., McDevitt, E. A., Rogers, T. T., Mednick, S. C. & Norman, K. A. Human hippocampal replay during rest prioritizes weakly learned information and predicts memory performance. Nat. Commun. 9, 3920 (2018).
Article ADS PubMed PubMed Central Google Scholar
Ólafsdóttir, H. F., Carpenter, F. & Barry, C. Task demands predict a dynamic switch in the content of awake hippocampal replay. Neuron 96, 925–935.e6 (2017).
Article PubMed PubMed Central Google Scholar
Genzel, L., Spoormaker, V., Konrad, B. & Dresler, M. The role of rapid eye movement sleep for amygdala-related memory processing. Neurobiol. Learn. Mem. 122, 110–121 (2015).
Article PubMed CAS Google Scholar
Lewis, P. A., Knoblich, G. & Poe, G. How memory replay in sleep boosts creative problem-solving. Trends Cogn. Sci. 22, 491–503 (2018).
Article PubMed PubMed Central Google Scholar
McDevitt, E. A., Zhang, J., MacKenzie, K. J., Fiser, J. & Mednick, S. C. The effect of interference, offline sleep, and wake on spatial statistical learning. Neurobiol. Learn. Mem. 193, 107650 (2022).
Article PubMed Google Scholar
Giri, B. et al. Hippocampal reactivation extends for several hours following novel experience. J. Neurosci. 39, 866–875 (2019).
Article PubMed PubMed Central CAS Google Scholar
Harris, J. J., Kollo, M., Erskine, A., Schaefer, A. & Burdakov, D. Natural VTA activity during NREM sleep influences future exploratory behavior. iScience 25, 104396 (2022).
Sutton, R. S. & Barto, A. G. Reinforcement Learning. (MIT Press, 2018).
Google Scholar
Murphy, A. H. A new vector partition of the probability score. J. Appl. Meteorol. 12, 595–600 (1973).
Article ADS Google Scholar
Acerbi, L. & Ma, W. J. Practical Bayesian optimization for model fitting with Bayesian adaptive direct search. Adv. Neural Inf. Process. Syst. 2017-Decem, 1837–1847 (2017).
Google Scholar
Ólafsdóttir, H. F., Carpenter, F. & Barry, C. Coordinated grid and place cell replay during rest. Nat. Neurosci. 19, 792–794 (2016).
Article PubMed Google Scholar
Carr, M. F., Jadhav, S. P. & Frank, L. M. Hippocampal replay in the awake state: a potential substrate for memory consolidation and retrieval. Nat. Neurosci. 14, 147–153 (2011).
Article PubMed PubMed Central CAS Google Scholar
Paxinos, G. & Watson, C. The Rat Brain in Stereotaxic Coordinates 4th edn. (Academic, 1996).
Google Scholar
Pachitariu, M., Steinmetz, N., Kadir, S., Carandini, M. & Harris, K. D. Kilosort: realtime spike-sorting for extracellular electrophysiology with hundreds of channels. bioRxiv 061481 http://biorxiv.org/lookup/doi/10.1101/061481 (2016).
Schmitzer-Torbert, N., Jackson, J., Henze, D., Harris, K. & Redish, A. D. Quantitative measures of cluster quality for use in extracellular recordings. Neuroscience 131, 1–11 (2005).
Article PubMed CAS Google Scholar
Kruskal, P. B., Stanis, J. J., McNaughton, B. L. & Thomas, P. J. A binless correlation measure reduces the variability of memory reactivation estimates. Stat. Med. 26, 3997–4008 (2007).
Article MathSciNet PubMed Google Scholar
Roscow, E. EmmaRoscow/QlearningReplay [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.17115788 (2025).

Download references

Acknowledgements

We are grateful to Aleksander Domanski and Andrew New for assistance with experimental set-up, and Luke Burguete for assistance with spike-sorting. E.L.R. received funding from a Wellcome Trust PhD scholarship (109070/Z/15/Z); M.W.J. received funding from a Wellcome Senior Research Fellowship in Basic Biomedical Science (202810/Z/16/Z).

Author information

These authors contributed equally: Nathan F. Lepora, Matthew W. Jones.

Authors and Affiliations

School of Physiology, Pharmacology & Neuroscience, University of Bristol, Bristol, UK
Emma L. Roscow, Timothy Howe & Matthew W. Jones
School of Engineering Mathematics and Technology, University of Bristol, Bristol, UK
Nathan F. Lepora

Authors

Emma L. Roscow
View author publications
Search author on:PubMed Google Scholar
Timothy Howe
View author publications
Search author on:PubMed Google Scholar
Nathan F. Lepora
View author publications
Search author on:PubMed Google Scholar
Matthew W. Jones
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualisation, E.L.R., N.F.L. and M.W.J.; Methodology, E.L.R., N.F.L. and M.W.J.; Investigation, E.L.R.; Formal Analysis, E.L.R. and T.H.; Data Curation, E.L.R. and T.H.; Writing—Original Draft, E.L.R., N.F.L. and M.W.J.; Writing—Reviewing and Editing, E.L.R. and M.W.J.; Visualisation, E.L.R. and T.H.; Supervision, N.F.L. and M.W.J.; Funding Acquisition N.F.L. and M.W.J.

Corresponding author

Correspondence to Emma L. Roscow.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Reporting Summary (download PDF )

Transparent Peer Review file (download PDF )

Source data

Source Data (download ZIP )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Roscow, E.L., Howe, T., Lepora, N.F. et al. Post-learning replay of hippocampal-striatal activity is biased by reward-prediction signals. Nat Commun 16, 10394 (2025). https://doi.org/10.1038/s41467-025-65354-2

Download citation

Received: 26 July 2019
Accepted: 10 October 2025
Published: 24 November 2025
Version of record: 24 November 2025
DOI: https://doi.org/10.1038/s41467-025-65354-2