Abstract
Deep reinforcement learning algorithms are known for their sample inefficiency, requiring extensive episodes to reach optimal performance. Episodic reinforcement learning algorithms aim to overcome this issue by using extended memory systems to leverage past experiences. However, these memory augmentations are often used as mere buffers, from which isolated events are resampled for offline learning (for example, replay). In this Article, we introduce Sequential Episodic Control (SEC), a hippocampal-inspired model that stores entire event sequences in their temporal order and employs a sequential bias in their retrieval to guide actions. We evaluate SEC across various benchmarks from the Animal-AI testbed, demonstrating its superior performance and sample efficiency compared to several state-of-the-art models, including Model-Free Episodic Control, Deep Q-Network and Episodic Reinforcement Learning with Associative Memory. Our experiments show that SEC achieves higher rewards and faster policy convergence in tasks requiring memory and decision-making. Additionally, we investigate the effects of memory constraints and forgetting mechanisms, revealing that prioritized forgetting enhances both performance and policy stability. Further, ablation studies demonstrate the critical role of the sequential memory component in SEC. Finally, we discuss how fast, sequential hippocampal-like episodic memory systems could support both habit formation and deliberation in artificial and biological systems.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
Data availability
The data sets supporting the findings of this study are available via Zenodo at https://doi.org/10.5281/zenodo.11506323 (ref. 77).
Code availability
The implementation of the SEC model used in this study is available in the GitHub repository at https://github.com/IsmaelTito/SEC. The specific version used to generate the results is also archived on Zenodo at https://doi.org/10.5281/zenodo.14014111 (ref. 78).
References
Silver, D. et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362, 1140–1144 (2018).
Vinyals, O. et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 350–354 (2019).
Berner, C. et al. Dota 2 with large scale deep reinforcement learning. Preprint at http://arxiv.org/abs/1912.06680 (2019).
Lake, B. M., Ullman, T. D., Tenenbaum, J. B. & Gershman, S. J. Building machines that learn and think like people. Behav. Brain Sci. 40, 1–58 (2017).
Marcus, G. Deep learning: a critical appraisal. Preprint at http://arxiv.org/abs/1801.00631 (2018).
Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).
Baker, B. et al. Emergent tool use from multi-agent autocurricula. International Conference on Learning Representations (ICLR, 2020).
Botvinick, M. et al. Reinforcement learning fast and slow. Trends Cogn. Sci. 23, 408–422 (2019).
Hansen, S., Pritzel, A., Sprechmann, P., Barreto, A. & Blundell, C. Fast deep reinforcement learning using online adjustments from the past. In Adv. Neural Information Processing Systems (eds. Bengio, S. et al.) 10567–10577 (Curran Associates, 2018).
Zhu, G., Lin, Z., Yang, G. & Zhang, C. Episodic reinforcement learning with associative memory. In International Conference on Learning Representations (eds Zhu, G, Lin, Z., Yang G. & Zhang, C.) 370–384 (Curran Associates, 2019).
Lin, Z., Zhao, T., Yang, G. & Zhang, L. Episodic memory deep q-networks. In Proc. IJCAI International Joint Conference on Artificial Intelligence (ed. Lang, J.) 2433–2439 (IJCAI, 2018).
Lee, S. Y., Sungik, C. & Chung, S. Y. Sample-efficient deep reinforcement learning via episodic backward update. In Advances in Neural Information Processing Systems (eds Wallach, H. et al.) 2112–2121 (Curran Associates, 2019).
Blundell, C. et al. Model-free episodic control. Preprint at http://arxiv.org/abs/1606.04460 (2016).
Pritzel, A. et al. Neural episodic control. In Proc. 34th International Conference on Machine Learning (eds Precup, D. & Yeh, Y. W.) 2827–2836 (ACM, 2017).
Yalnizyan-Carson, A. & Richards, B. A. Forgetting enhances episodic control with structured memories. Front. Comput. Neurosci. 16, 757244 (2022).
Davidson, T. J., Kloosterman, F. & Wilson, M. A. Hippocampal replay of extended experience. Neuron 63, 497–507 (2009).
Voegtlin, T. & Verschure, P. F. What can robots tell us about brains? A synthetic approach towards the study of learning and problem solving. Rev. Neurosci. 10, 291–310 (1999).
Lisman, J. E. & Idiart, M. A. Storage of 7+/-2 short-term memories in oscillatory subcycles. Science 267, 1512–1515 (1995).
Jensen, O. & Lisman, J. E. Dual oscillations as the physiological basis for capacity limits. Behav. Brain Sci. 24, 126 (2001).
Ramani, D. A short survey on memory based reinforcement learning. Preprint at http://arxiv.org/abs/1904.06736 (2019).
Buzsáki, G. & Tingley, D. Space and time: the hippocampus as a sequence generator. Trends Cogn. Sci. 22, 853–869 (2018).
Lisman, J. & Redish, A. D. Prediction, sequences and the hippocampus. Philos. Trans. R. Soc. B 364, 1193–1201 (2009).
Verschure, P. F., Pennartz, C. M. & Pezzulo, G. The why, what, where, when and how of goal-directed choice: neuronal and computational principles. Philos. Trans. R. Soc. B 369, 20130483 (2014).
Merleau-Ponty, M. et al. The Primacy of Perception: And Other Essays on Phenomenological Psychology, the Philosophy of Art, Hhistory, and Politics (Northwestern Univ. Press, 1964).
Bornstein, A. M. & Norman, K. A. Reinstated episodic context guides sampling-based decisions for reward. Nat. Neurosci. 20, 997–1003 (2017).
Wimmer, G. E. & Shohamy, D. Preference by association: how memory mechanisms in the hippocampus bias decisions. Science 338, 270–273 (2012).
Wu, C. M., Schulz, E. & Gershman, S. J. Inference and search on graph-structured spaces. Comput. Brain Behav. 4, 125–147 (2021).
Johnson, A. & Redish, A. D. Neural ensembles in ca3 transiently encode paths forward of the animal at a decision point. J. Neurosci. 27, 12176–12189 (2007).
Ludvig, E. A., Madan, C. R. & Spetch, M. L. Priming memories of past wins induces risk seeking. J. Exp. Psychol. Gen. 144, 24 (2015).
Wang, S., Feng, S. F. & Bornstein, A. M. Mixing memory and desire: How memory reactivation supports deliberative decision-making. Wiley Interdiscip. Rev. Cogn. Sci. 13, e1581 (2022).
Gershman, S. J. & Daw, N. D. Reinforcement learning and episodic memory in humans and animals: an integrative framework. Annu. Rev. Psychol. 68, 101–128 (2017).
Santos-Pata, D. et al. Epistemic autonomy: self-supervised learning in the mammalian hippocampus. Trends Cogn. Sci. 25, 582–595 (2021).
Santos-Pata, D. et al. Entorhinal mismatch: a model of self-supervised learning in the hippocampus. iScience 24, 102364 (2021).
Amil, A. F., Freire, I. T. & Verschure, P. F. Discretization of continuous input spaces in the hippocampal autoencoder. Preprint at http://arxiv.org/abs/2405.14600 (2024).
Rennó-Costa, C., Lisman, J. E. & Verschure, P. F. The mechanism of rate remapping in the dentate gyrus. Neuron 68, 1051–1058 (2010).
Estefan, D. P. et al. Coordinated representational reinstatement in the human hippocampus and lateral temporal cortex during episodic memory retrieval. Nat. Commun. 10, 1–13 (2019).
de Almeida, L., Idiart, M. & Lisman, J. E. A second function of gamma frequency oscillations: an E%-max winner-take-all mechanism selects which cells fire. J. Neurosci. 29, 7497–7503 (2009).
Skaggs, W. E., McNaughton, B. L., Wilson, M. A. & Barnes, C. A. Theta phase precession in hippocampal neuronal populations and the compression of temporal sequences. Hippocampus 6, 149–172 (1996).
Redish, A. D. Vicarious trial and error. Nat. Rev. Neurosci. 17, 147–159 (2016).
Clayton, N. S. & Dickinson, A. Episodic-like memory during cache recovery by scrub jays. Nature 395, 272–274 (1998).
Foster, D. J. & Knierim, J. J. Sequence learning and the role of the hippocampus in rodent navigation. Curr. Opin. Neurobiol. 22, 294–300 (2012).
Mattar, M. G. & Daw, N. D. Prioritized memory access explains planning and hippocampal replay. Nat. Neurosci. 21, 1609–1617 (2018).
Eichenbaum, H. Memory: organization and control. Annu. Rev. Psychol. 68, 19–45 (2017).
Estefan, D. P. et al. Volitional learning promotes theta phase coding in the human hippocampus. Proc. Natl Acad. Sci. USA 118, e2021238118 (2021).
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, 2018); https://doi.org/10.1109/tnn.2004.842673
Watkins, C. J. C. H. & Dayan, P. Q-learning. Mach. Learn. 8, 279–292 (1992).
Kubie, J. L. & Fenton, A. A. Heading-vector navigation based on head-direction cells and path integration. Hippocampus 19, 456–479 (2009).
Mathews Z. et al. Insect-like mapless navigation based on head direction cells and contextual learning using chemo-visual sensors. 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems 2243–2250 (IEEE, 2009).
Amil, A. F. & Verschure, P. F. Supercritical dynamics at the edge-of-chaos underlies optimal decision-making. J. Phys. Complex. 2, 045017 (2021).
Verschure, P. F., Voegtlin, T. & Douglas, R. J. Environmentally mediated synergy between perception and behaviour in mobile robots. Nature 425, 620–624 (2003).
Vikbladh, O., Shohamy, D. & Daw, N. Episodic contributions to model-based reinforcement learning. In Annual Conference on Cognitive Computational Neuroscience (CCN, 2017).
Cazé, R., Khamassi, M., Aubin, L. & Girard, B. Hippocampal replays under the scrutiny of reinforcement learning models. J. Neurophysiol. 120, 2877–2896 (2018).
Gonzalez, C., Lerch, J. F. & Lebiere, C. Instance-based learning in dynamic decision making. Cogn. Sci. 27, 591–635 (2003).
Gonzalez, C. & Dutt, V. Instance-based learning: integrating sampling and repeated decisions from experience. Psychological Rev. 118, 523 (2011).
Lengyel, M. & Dayan, P. Hippocampal contributions to control: the third way. In Proc. Advances in Neural Information Processing Systems (eds. Platt, J. et al.) 889–896 (Curran, 2008).
Freire, I. T., Moulin-Frier, C., Sanchez-Fibla, M., Arsiwalla, X. D. & Verschure, P. F. Modeling the formation of social conventions from embodied real-time interactions. PLoS ONE 15, e0234434 (2020).
Papoudakis, G., Christianos, F., Rahman, A. & Albrecht, S. V. Dealing with non-stationarity in multi-agent deep reinforcement learning. Preprint at http://arxiv.org/abs/1906.04737 (2019).
Freire, I. & Verschure, P. High-fidelity social learning via shared episodic memories can improve collaborative foraging. Paper presented at Intrinsically Motivated Open-Ended Learning Workshop@NeurIPS 2023 (2023).
Albrecht, S. V. & Stone, P. Autonomous agents modelling other agents: a comprehensive survey and open problems. Artif. Intell. 258, 66–95 (2018).
Freire, I. T., Arsiwalla, X. D., Puigbò, J.-Y. & Verschure, P. F. Limits of multi-agent predictive models in the formation of social conventions. In Proc. Artificial Intelligence Research and Development (eds Falomir, Z. et al.) 297–301 (IOS, 2018).
Freire, I. T., Puigbò, J.-Y., Arsiwalla, X. D. & Verschure, P. F. Modeling the opponent’s action using control-based reinforcement learning. In Proc. Conference on Biomimetic and Biohybrid Systems (eds Vouloutsi, V. et al.) 179–186 (Springer, 2018).
Freire, I. T., Arsiwalla, X. D., Puigbò, J.-Y. & Verschure, P. Modeling theory of mind in dyadic games using adaptive feedback control. Information 14, 441 (2023).
Kahali, S. et al. Distributed adaptive control for virtual cyborgs: a case study for personalized rehabilitation. In Proc. Conference on Biomimetic and Biohybrid Systems (eds Meder, F. et al.) 16–32 (Springer, 2023).
Freire, I. T., Guerrero-Rosado, O., Amil, A. F. & Verschure, P. F. Socially adaptive cognitive architecture for human-robot collaboration in industrial settings. Front. Robot. AI 11, 1248646 (2024).
Verschure, P. F. Distributed adaptive control: a theory of the mind, brain, body nexus. BICA 1, 55–72 (2012).
Rosado, O. G., Amil, A. F., Freire, I. T. & Verschure, P. F. Drive competition underlies effective allostatic orchestration. Front. Robot. AI 9, 1052998 (2022).
Daw, N. D. Are we of two minds? Nat. Neurosci. 21, 1497–1499 (2018).
Freire, I. T., Urikh, D., Arsiwalla, X. D. & Verschure, P. F. Machine morality: from harm-avoidance to human-robot cooperation. In Proc. Conference on Biomimetic and Biohybrid Systems (eds Vouloutsi, V. et al.) 116–127 (Springer, 2020).
Verschure, P. F. Synthetic consciousness: the distributed adaptive control perspective. Philos. Trans. R. Soc. B 371, 20150448 (2016).
Goode, T. D., Tanaka, K. Z., Sahay, A. & McHugh, T. J. An integrated index: engrams, place cells, and hippocampal memory. Neuron 107, 805–820 (2020).
Amil, A F., Albesa-González, A. & Verschure, P. F. M. J. Theta oscillations optimize a speed-precision trade-off in phase coding neurons. PLOS Comp. Biol. 20.12, e1012628 (2024).
Tremblay, L. & Schultz, W. Relative reward preference in primate orbitofrontal cortex. Nature 398, 704–708 (1999).
Cromwell, H. C., Hassani, O. K. & Schultz, W. Relative reward processing in primate striatum. Exp. Brain Res. 162, 520–525 (2005).
Soldati, F., Burman, O. H., John, E. A., Pike, T. W. & Wilkinson, A. Long-term memory of relative reward values. Biol. Lett. 13, 20160853 (2017).
Beyret, B. et al. The Animal-AI environment: training and testing animal-like artificial cognition. Preprint at http://arxiv.org/abs/1909.07483 (2019).
Crosby, M., Beyret, B. & Halina, M. The Animal-AI olympics. Nat. Mach. Intell. 1, 257 (2019).
Freire, I. T. Dataset for ‘Sequential memory improves sample and memory efficiency in episodic control’. Zenodo https://doi.org/10.5281/zenodo.11506323 (2024).
Freire, I. T. IsmaelTito/SEC: SEC v.1.0 release (v.1.0.0). Zenodo https://doi.org/10.5281/zenodo.14014111 (2024).
Acknowledgements
This study was funded by the Counterfactual Assessment and Valuation for Awareness Architecture (CAVAA) project (European Innovation Council’s Horizon programme, grant no. 101071178) awarded to P.F.M.J.V.
Author information
Authors and Affiliations
Contributions
Conceptualization: all authors. Methodology: I.T.F. and A.F.A. Software: I.T.F. and A.F.A. Data curation: I.T.F. Formal analysis: I.T.F. Resources: P.F.M.J.V. Writing—original draft: I.T.F. Writing—review and editing: all authors. Visualization: I.T.F. Supervision: P.F.M.J.V. Project administration: P.F.M.J.V. Funding acquisition: P.F.M.J.V. All authors have read and agreed to the published version of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Mehdi Khamassi and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Performance increase of SEC over NSEC across different memory- limit conditions in the Double T-maze benchmark.
Units reported are the total mean perfor- mance of SEC over NSEC. Each column shows SEC’s performance with a limited memory capacity of 125, 250, 500, and 1000 sequences respectively. Each row shows the memory ratio between SEC and NSEC, ranging from 1/1 (equal memory limit) to 1/8 (SEC memory limit is 8 times smaller than NSEC).
Extended Data Fig. 2 Ablation studies of the Sequential Episodic Control (SEC) algorithm in the Double T-Maze task.
The left panel presents the single mechanism ablations with average reward (top) and entropy (bottom). The right panel shows the double mechanism ablations under the same metrics. In the single ablations, ’SEC- noDist’ lacks the distance to the goal component, ’SEC-noRR’ lacks the relative reward component, and ’SEC-noGi’ lacks the eligibility score component. In the double ablations, ’SEC- soloDist’, ’SEC-soloRR’, and ’SEC-soloGi’ operate with only the distance to the goal, relative reward, and eligibility score components, respectively. The full SEC model is included as a benchmark in both panels. These graphs demonstrate the comparative impact of individual and combined components of the SEC algorithm on its performance and decision-making uncertainty. Vertical bars represent the average episode around which the memory was filled. Average values were computed using a sliding window of 20 episodes. Error bars represent SE.
Supplementary information
Supplementary Information
Supplementary Tables 1–6.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Freire, I.T., Amil, A.F. & Verschure, P.F.M.J. Sequential memory improves sample and memory efficiency in episodic control. Nat Mach Intell 7, 43–55 (2025). https://doi.org/10.1038/s42256-024-00950-3
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s42256-024-00950-3