Multi-timescale reinforcement learning in the brain

Masset, Paul; Tano, Pablo; Kim, HyungGoo R.; Malik, Athar N.; Pouget, Alexandre; Uchida, Naoshige

doi:10.1038/s41586-025-08929-9

Article
Published: 04 June 2025

Multi-timescale reinforcement learning in the brain

Nature volume 642, pages 682–690 (2025)Cite this article

15k Accesses
9 Citations
188 Altmetric
Metrics details

Subjects

Abstract

To thrive in complex environments, animals and artificial agents must learn to act adaptively to maximize fitness and rewards. Such adaptive behaviour can be learned through reinforcement learning¹, a class of algorithms that has been successful at training artificial agents^2,3,4,5 and at characterizing the firing of dopaminergic neurons in the midbrain^6,7,8. In classical reinforcement learning, agents discount future rewards exponentially according to a single timescale, known as the discount factor. Here we explore the presence of multiple timescales in biological reinforcement learning. We first show that reinforcement agents learning at a multitude of timescales possess distinct computational benefits. Next, we report that dopaminergic neurons in mice performing two behavioural tasks encode reward prediction error with a diversity of discount time constants. Our model explains the heterogeneity of temporal discounting in both cue-evoked transient responses and slower timescale fluctuations known as dopamine ramps. Crucially, the measured discount factor of individual neurons is correlated across the two tasks, suggesting that it is a cell-specific property. Together, our results provide a new paradigm for understanding functional heterogeneity in dopaminergic neurons and a mechanistic basis for the empirical observation that humans and animals use non-exponential discounts in many situations^9,10,11,12, and open new avenues for the design of more-efficient reinforcement learning algorithms.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Computational advantages of multi-timescale RL.**

**Fig. 2: Dopaminergic neurons exhibit a diversity of discount factors that enables decoding of reward delays.**

**Fig. 3: The diversity of discount factors across dopaminergic neurons explains qualitatively different ramping activity.**

**Fig. 4: Discount factors of single dopaminergic neurons are correlated across behavioural contexts.**

A gradual temporal shift of dopamine responses mirrors the progression of temporal difference error in machine learning

Article 07 July 2022

Temporal regularities shape perceptual decisions and striatal dopamine signals

Article Open access 17 August 2024

Mesolimbic dopamine adapts the rate of learning from action

Article Open access 18 January 2023

Data availability

The raw electrophysiological data can be found on DANDI Archive (https://dandiarchive.org/dandiset/000251). The curated electrophysiological data can be found at https://doi.org/10.17632/tc43t3s7c5.1 (see ref. ⁷²).

Code availability

The code used for simulations can be found on GitHub (https://github.com/pablotano8/multi_timescale_RL). The data analysis code for the electrophysiological experiments can be found at https://doi.org/10.17632/tc43t3s7c5.1 (see ref. ⁷²).

References

Sutton, R. S. & Barto, A. G. Reinforcement Learning 2nd edn (MIT Press, 2018).
Tesauro, G. Temporal difference learning and TD-Gammon. Commun. ACM 38, 58–68 (1995).
Article Google Scholar
Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).
Article ADS CAS PubMed Google Scholar
Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).
Article ADS CAS PubMed Google Scholar
Wurman, P. R. et al. Outracing champion Gran Turismo drivers with deep reinforcement learning. Nature 602, 223–228 (2022).
Article ADS CAS PubMed Google Scholar
Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).
Article CAS PubMed Google Scholar
Schultz, W. Neuronal reward and decision signals: from theories to data. Physiol. Rev. 95, 853–951 (2015).
Article CAS PubMed PubMed Central Google Scholar
Cohen, J. Y., Haesler, S., Vong, L., Lowell, B. B. & Uchida, N. Neuron-type-specific signals for reward and punishment in the ventral tegmental area. Nature 482, 85–88 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Ainslie, G. Specious reward: a behavioral theory of impulsiveness and impulse control. Psychol. Bull. 82, 463–496 (1975).
Article CAS PubMed Google Scholar
Frederick, S., Loewenstein, G. & O’Donoghue, T. Time discounting and time preference: a critical review. J. Econ. Lit. 40, 351–401 (2002).
Laibson, D. Golden eggs and hyperbolic discounting. Q. J. Econ. 112, 443–478 (1997).
Article Google Scholar
Sozou, P. D. On hyperbolic discounting and uncertain hazard rates. Proc. R. Soc. London. B 265, 2015–2020 (1998).
Article Google Scholar
Botvinick, M. et al. Reinforcement learning, fast and slow. Trends Cogn. Sci. 23, 408–422 (2019).
Article PubMed Google Scholar
Redish, A. D. Addiction as a computational process gone awry. Science 306, 1944–1947 (2004).
Article ADS CAS PubMed Google Scholar
Lempert, K. M., Steinglass, J. E., Pinto, A., Kable, J. W. & Simpson, H. B. Can delay discounting deliver on the promise of RDoC? Psychol. Med. 49, 190–199 (2019).
Article PubMed Google Scholar
Sutton, R. S. et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems Vol. 2 761–768 (International Foundation for Autonomous Agents and Multiagent Systems, 2011); https://dl.acm.org/doi/10.5555/2031678.2031726
Bellemare, M. G., Dabney, W. & Rowland, M. Distributional Reinforcement Learning (MIT Press, 2023).
Jaderberg, M. et al. Reinforcement learning with unsupervised auxiliary tasks. In International Conference on Learning Representations (ICLR, 2017).
Fedus, W., Gelada, C., Bengio, Y., Bellemare, M. G. & Larochelle, H. Hyperbolic discounting and learning over multiple horizons. Preprint at https://arxiv.org/abs/1902.06865 (2019).
Dabney, W. et al. A distributional code for value in dopamine-based reinforcement learning. Nature 577, 671–675 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Tano, P., Dayan, P. & Pouget, A. A local temporal difference code for distributional reinforcement learning. Adv. Neural Inf. Process. Syst. 1146, 13662–13673 (2020).
Brunec, I. K. & Momennejad, I. Predictive representations in hippocampal and prefrontal hierarchies. J. Neurosci. 42, 299–312 (2022).
Article CAS PubMed PubMed Central Google Scholar
Lowet, A. S., Zheng, Q., Matias, S., Drugowitsch, J. & Uchida, N. Distributional reinforcement learning in the brain. Trends Neurosci. 43, 980–997 (2020).
Masset, P. & Gershman, S. J. in The Handbook of Dopamine (Handbook of Behavioral Neuroscience) Vol. 32 (eds Cragg, S. J. & Walton, M.) Ch. 24 (Academic Press, 2025).
Buhusi, C. V. & Meck, W. H. What makes us tick? Functional and neural mechanisms of interval timing. Nat. Rev. Neurosci. 6, 755–765 (2005).
Article CAS PubMed Google Scholar
Tsao, A., Yousefzadeh, S. A., Meck, W. H., Moser, M.-B. & Moser, E. I. The neural bases for timing of durations. Nat. Rev. Neurosci. 23, 646–665 (2022).
Article CAS PubMed Google Scholar
Fiorillo, C. D., Newsome, W. T. & Schultz, W. The temporal precision of reward prediction in dopamine neurons. Nat. Neurosci. 11, 966–973 (2008).
Article CAS PubMed Google Scholar
Mello, G. B. M., Soares, S. & Paton, J. J. A scalable population code for time in the striatum. Curr. Biol. 25, 1113–1122 (2015).
Article CAS PubMed Google Scholar
Soares, S., Atallah, B. V. & Paton, J. J. Midbrain dopamine neurons control judgment of time. Science 354, 1273–1277 (2016).
Article ADS CAS PubMed Google Scholar
Enomoto, K., Matsumoto, N., Inokawa, H., Kimura, M. & Yamada, H. Topographic distinction in long-term value signals between presumed dopamine neurons and presumed striatal projection neurons in behaving monkeys. Sci. Rep. 10, 8912 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Mohebi, A., Wei, W., Pelattini, L., Kim, K. & Berke, J. D. Dopamine transients follow a striatal gradient of reward time horizons. Nat. Neurosci. 27, 737–746 (2024).
Article CAS PubMed PubMed Central Google Scholar
Kiebel, S. J., Daunizeau, J. & Friston, K. J. A hierarchy of time-scales and the brain. PLoS Comput. Biol. 4, e1000209 (2008).
Article ADS PubMed PubMed Central Google Scholar
Kurth-Nelson, Z. & Redish, A. D. Temporal-difference reinforcement learning with distributed representations. PLoS ONE 4, 7362 (2009).
Article ADS Google Scholar
Shankar, K. H. & Howard, M. W. A scale-invariant internal representation of time. Neural Comput. 24, 134–193 (2012).
Article MathSciNet PubMed Google Scholar
Tanaka, C. S. et al. Prediction of immediate and future rewards differentially recruits cortico-basal ganglia loops. Nat. Neurosci. 7, 887–893 (2004).
Article CAS PubMed Google Scholar
Sherstan, C., Dohare, S., MacGlashan, J., Günther, J. & Pilarski, P. M. Gamma-Nets: generalizing value estimation over timescale. In Proceedings of the AAAI Conference on Artificial Intelligence 34, 5717–5725 (2020).
Momennejad, I. & Howard, M. W. Predicting the future with multi-scale successor representations. Preprint at bioRxiv https://doi.org/10.1101/449470 (2018).
Reinke, C., Uchibe, E. & Doya, K. Average reward optimization with multiple discounting reinforcement learners. In Neural Information Processing. ICONIP 2017. Lecture Notes in Computer Science (eds Liu, D. et al.) 789–800 (Springer, 2017).
Kobayashi, S. & Schultz, W. Influence of reward delays on responses of dopamine neurons. J. Neurosci. 28, 7837–7846 (2008).
Article CAS PubMed PubMed Central Google Scholar
Schultz, W. Dopamine reward prediction-error signalling: a two-component response. Nat. Rev. Neurosci. 17, 183–195 (2016).
Article CAS PubMed PubMed Central Google Scholar
Howe, M. W., Tierney, P. L., Sandberg, S. G., Phillips, P. E. M. & Graybiel, A. M. Prolonged dopamine signalling in striatum signals proximity and value of distant rewards. Nature 500, 575–579 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Berke, J. D. What does dopamine mean? Nat. Neurosci. 21, 787–793 (2018).
Article CAS PubMed PubMed Central Google Scholar
Gershman, S. J. Dopamine ramps are a consequence of reward prediction errors. Neural Comput. 26, 467–471 (2014).
Article PubMed Google Scholar
Kim, H. G. R. et al. A unified framework for dopamine signals across timescales. Cell 183, 1600–1616.e25 (2020).
Article CAS PubMed PubMed Central Google Scholar
Mikhael, J. G., Kim, H. R., Uchida, N. & Gershman, S. J. The role of state uncertainty in the dynamics of dopamine. Curr. Biol. 32, 1077–1087.e9 (2022).
Article CAS PubMed PubMed Central Google Scholar
Guru, A. et al. Ramping activity in midbrain dopamine neurons signifies the use of a cognitive map. Preprint at bioRxiv https://doi.org/10.1101/2020.05.21.108886 (2020).
Doya, K. Reinforcement learning in continuous time and space. Neural Comput. 12, 219–245 (2000).
Article CAS PubMed Google Scholar
Lee, R. S., Sagiv, Y., Engelhard, B., Witten, I. B. & Daw, N. D. A feature-specific prediction error model explains dopaminergic heterogeneity. Nat. Neurosci. 27, 1574–1586 (2024).
Article CAS PubMed Google Scholar
Cruz, B. F. et al. Action suppression reveals opponent parallel control via striatal circuits. Nature 607, 521–526 (2022).
Article ADS CAS PubMed Google Scholar
Millidge, B., Song, Y., Lak, A., Walton, M. E. & Bogacz, R. Reward bases: a simple mechanism for adaptive acquisition of multiple reward types. PLoS Comput. Biol. 20, e1012580 (2024).
Article CAS PubMed PubMed Central Google Scholar
Engelhard, B. et al. Specialized coding of sensory, motor and cognitive variables in VTA dopamine neurons. Nature 570, 509–513 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Eshel, N., Tian, J., Bukwich, M. & Uchida, N. Dopamine neurons share common response function for reward prediction error. Nat. Neurosci. 19, 479–486 (2016).
Article CAS PubMed PubMed Central Google Scholar
Cox, J. & Witten, I. B. Striatal circuits for reward learning and decision-making. Nat. Rev. Neurosci. 20, 482–494 (2019).
Article CAS PubMed PubMed Central Google Scholar
Collins, A. L. & Saunders, B. T. Heterogeneity in striatal dopamine circuits: form and function in dynamic reward seeking. J. Neurosci. Res. https://doi.org/10.1002/jnr.24587 (2020).
Gershman, S. J. et al. Explaining dopamine through prediction errors and beyond. Nat. Neurosci. 27, 1645–1655 (2024).
Article CAS PubMed Google Scholar
Watabe-Uchida, M. & Uchida, N. Multiple dopamine systems: weal and woe of dopamine. Cold Spring Harb. Symp. Quant. Biol. 83, 83–95 (2018).
Article PubMed Google Scholar
Xu, Z., van Hasselt, H. P. & Silver, D. Meta-gradient reinforcement learning. In Advances in Neural Information Processing Systems Vol. 31 (Curran Associates, 2018).
Yoshida, N., Uchibe, E. & Doya, K. Reinforcement learning with state-dependent discount factor. In 2013 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL) https://ieeexplore.ieee.org/document/6652533 (IEEE, 2013).
Doya, K. Metalearning and neuromodulation. Neural Netw. 15, 495–506 (2002).
Article PubMed Google Scholar
Tanaka, S. C. et al. Serotonin differentially regulates short- and long-term prediction of rewards in the ventral and dorsal striatum. PLoS ONE https://doi.org/10.1371/journal.pone.0001333 (2007).
Kvitsiani, D. et al. Distinct behavioural and network correlates of two interneuron types in prefrontal cortex. Nature 498, 363–366 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Sutton, R. S. Learning to predict by the methods of temporal differences. Mach. Learn. 3, 9–44 (1988).
Article Google Scholar
Oppenheim, A., Willsky, A. & Hamid, W. Signals and Systems (Pearson, 1996).
Dayan, P. Improving generalisation for temporal difference learning: the successor representation. Neural Comput. 5, 613–624 (1993).
Article Google Scholar
Gershman, S. J. The successor representation: its computational logic and neural substrates. J. Neurosci. https://doi.org/10.1523/JNEUROSCI.0151-18.2018 (2018).
Amit, R., Meir, R. & Ciosek, K. Discount factor as a regularizer in reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning 269–278 (PMLR, 2020).
Badia, A. P. et al. Agent57: outperforming the Atari human benchmark. In Proceedings of the 37th International Conference on Machine Learning 507–517 (PMLR, 2020).
Reinke, C. Time adaptive reinforcement learning. ICLR 2020 work. Beyond tabula rasa RL. Preprint at https://doi.org/10.48550/arXiv.2004.08600 (2020).
Gershman, S. J. & Uchida, N. Believing in dopamine. Nat. Rev. Neurosci. 20, 703–714 (2019).
Article CAS PubMed PubMed Central Google Scholar
Leone, F. C., Nelson, L. S. & Nottingham, R. B. The folded normal distribution. Technometrics 3, 543–550 (1961).
Article MathSciNet Google Scholar
Lindsey, J. & Litwin-Kumar, A. Action-modulated midbrain dopamine activity arises from distributed control policies. Adv. Neural Inf. Process. Syst. 35, 5535–5548 (2022).
Google Scholar
Masset, P. et al. Data and code for ‘Multi-timescale reinforcement learning in the brain’, V1. Mendeley Data https://doi.org/10.17632/tc43t3s7c5.1 (2025).

Download references

Acknowledgements

We thank S. J. Gershman and J. Mikhael for their contributions to the preceding studies; M. Watabe-Uchida for advice on task design; members of the Uchida and Pouget laboratories, including A. Lowet and M. Burrell, for discussions and comments; and W. Carvalho, G. Reddy and T. Ott for their comments on the manuscript. This work is supported by US National Institutes of Health (NIH) BRAIN Initiative grants (R01NS226753 and U19NS113201), NIH grant 5R01DC017311 to N.U., and a grant from the Swiss National Science Foundation (315230_197296) to A.P. This research was carried out in part thanks to funding from the Canada First Research Excellence Fund, awarded to P.M. through the Healthy Brains, Healthy Lives initiative at McGill University.

Author information

These authors contributed equally: Paul Masset, Pablo Tano

Authors and Affiliations

Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, USA
Paul Masset, HyungGoo R. Kim, Athar N. Malik & Naoshige Uchida
Center for Brain Science, Harvard University, Cambridge, MA, USA
Paul Masset, HyungGoo R. Kim, Athar N. Malik & Naoshige Uchida
Department of Psychology, McGill University, Montréal, Québec, Canada
Paul Masset
Mila – Quebec Artificial Intelligence Institute, Montréal, Québec, Canada
Paul Masset
Department of Basic Neuroscience, Université de Genève, Geneva, Switzerland
Pablo Tano & Alexandre Pouget
Department of Biomedical Engineering, Sungkyunkwan University, Suwon, Republic of Korea
HyungGoo R. Kim
Center for Neuroscience Imaging Research, Institute for Basic Science (IBS), Suwon, Republic of Korea
HyungGoo R. Kim
Department of Neurosurgery, Warren Alpert Medical School of Brown University, Providence, RI, USA
Athar N. Malik
Norman Prince Neurosciences Institute, Rhode Island Hospital, Providence, RI, USA
Athar N. Malik
Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Cambridge, MA, USA
Naoshige Uchida

Authors

Paul Masset
View author publications
Search author on:PubMed Google Scholar
Pablo Tano
View author publications
Search author on:PubMed Google Scholar
HyungGoo R. Kim
View author publications
Search author on:PubMed Google Scholar
Athar N. Malik
View author publications
Search author on:PubMed Google Scholar
Alexandre Pouget
View author publications
Search author on:PubMed Google Scholar
Naoshige Uchida
View author publications
Search author on:PubMed Google Scholar

Contributions

P.M., P.T., A.P. and N.U. conceived the project. P.M., H.R.K., A.N.M. and N.U. designed the electrophysiology experiments. A.N.M. and H.R.K. performed the electrophysiology experiments and curated the data. P.T. performed the simulations with artificial agents. P.M. performed the analysis of electrophysiological data. P.M., P.T., A.P. and N.U. wrote the paper with input from H.R.K.

Corresponding authors

Correspondence to Paul Masset, Alexandre Pouget or Naoshige Uchida.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature thanks Kenji Doya and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Decoding simulations for multi-timescale vs. single-timescale agents.

(a-c). Experiment corresponding to Task 1 in Fig. 1. a, MDP with reward R at time t_R. b, Diagram of the decoding experiment. In each episode, the reward magnitude and time are randomly sampled from discrete uniform distributions, which defines the MDP in a. Values are learned until near convergence using TD-learning. Values with different discount factors are learned independently. The learned values for the cue (s) are fed into a non-linear decoder which learns, across MDPs, to report the reward time. c, Decoding performance as the decoder is trained. Different colors indicate the discount factors used in TD-learning. (d-f). Experiment corresponding to Task 2 in Fig. 1. d, MDP with reward R at time t_R. e, Diagram of the decoding experiment. In each episode, the reward magnitude and time are randomly sampled from discrete uniform distributions, which defines the MDP in a. Values are learned until near convergence using TD-learning. Values with different discount factors are learned independently. The learned values for the cue (s) are fed into a non-linear decoder which learns, across MDPs, to report the hyperbolic value of the cue. f, Decoding performance as the decoder is trained. Different colors indicate the discount factors used in TD-learning. (g-i). Experiment corresponding to Task 3 in Fig. 1. g, MDP with reward equal to 1 at time t_R. h, Diagram of the decoding experiment. In each episode, the reward time and the number of TD iterations (N) are sampled from discrete uniform distributions. Values are learned by performing N TD-learning backups on the MDP. Values with different discount factors are learned independently. The learned values for the cue (s) are fed into a non-linear decoder which learns, across MDPs, to report the reward time. i, Decoding performance as the decoder is trained. Different colors indicate the discount factors used in TD-learning. (j-l). Experiment corresponding to Task 4 in Fig. 1. j, MDP with reward equal to 1 at time \({t}_{R}\), and a noisy reward added to every state. k, Diagram of the decoding experiment. In each episode, the reward time is sampled from discrete uniform distributions. Values are learned by performing a single iteration of TD-learning backwards through the MDP. Values with different discount factors are learned independently. The learned values for the cue (s) are fed into a non-linear decoder which learns, across MDPs, to report the true value of the cue after experiencing the trajectory an infinite number of times (this is, ignoring the random rewards). l, Decoding performance as the decoder is trained. Different colors indicate the discount factors used in TD-learning. (m-o) Experiment with two rewards. m, MDP with two rewards of magnitude R1 and R2 at times t_R1 < t_R2. Value estimates V_γ(s) are fed into a non-linear decoder which learns, across MDPs, to report both reward times. o, Decoding performance as the decoder is trained. Different colors indicate the discount factors used in TD-learning. (p-r) Experiment to determine the shortest branch when learning from incomplete information. p, MDP with two possible trajectories. In this example, the upwards trajectory is longer than the downwards trajectory. q, Diagram of the decoding experiment. In each episode, the length of the two branches D and the number of times that TD-backups are performed for each branch are randomly sampled from uniform discrete distributions. Then, TD-backups are perform ed for the two branches the corresponding number of times. After this, they are fed into a decoder which is trained, across episodes, to report the shorter branch. r, Decoding performance as the decoder is trained. Different colors indicate the discount factors used in TD-learning. s-w: Temporal estimates are available before convergence for multi-timescale agents. s, Two experiments, one with a short wait between the cue and reward (pink), and one with a longer wait (cyan). t, The identity of the cue with the higher value for a single-timescale agent (here γ = 0.9) depends on the number of times that the experiments have been experienced. When the longer trajectory has been experienced significantly more often than the short one, the single-timescale agent can incorrectly believe that it has a larger value. u, For a multi-timescale agent, the pattern of values learned across discount factors is only affected by a multiplicative factor that depends on the learning rate, the prior values and the asymmetric learning experience. The pattern therefore contains unique information about outcome time. v,w, When plotted as a function of the number of times that trajectories are experienced, the pattern of values across discount factors is only affected by a multiplicative factor. In other words, for the pink cue, the larger discount factors are closer together than they are to the smaller discount factor, and the opposite for the cyan cue. This pattern is maintained at every point along the x-axis, and therefore is independent of the asymmetric experience, and it enables a downstream system to decode reward timing. Error bars are the standard deviations (s.d.) across 100 test episodes and 3 trained policy gradient (PG) networks.

Extended Data Fig. 2 The myopic learning bias.

a, Illustration of the myopic learning bias. Consider a scenario in which the upwards and downwards paths from s and s’ are experienced only once, such that at least one path in the small bifurcations is left unexplored. In state s’ (blue) the far future is more uncertain than the near future, and in state s (red) the near future is more uncertain than the far future. b, When experiencing the upwards and downwards paths only once, there are 4 possible scenarios depending on which path of the corresponding bifurcations is visited. When learning from limited information, the myopic (low γ) and farsighted (high γ) estimates would make different decisions depending on whether the V^UP estimate is larger, smaller or approximately equal to V^DOWN (V^UP ≈ V^DOWN occurs when both estimates have similar magnitudes and some small variations in the precise magnitude of rewards, the prior values or the learning parameters could change whether V^UP < V^DOWN or V^UP > V^DOWN). In both s and s’, the correct decision is to follow the upwards path (the ‘correct’ decision is the decision made by a hypothetical RL agent that experiences all possible trajectories an infinite number of times). Below we show the approximate probability that the agent chooses the correct path, if it follows the myopic estimates (low γ) or far-sighted estimates (high γ). Illustration of a mouse in panel a and silhouette of a raptor in panels a,b were adapted from the NIAID NIH BIOART Source. Illustration of a block of cheese in panels a,b, was adapted from SVG Repo under a CC0 1.0 licence. c, Task structure to evaluate the myopic learning bias when uncertainty arises due to stochastic rewards. The three dots collapse 5 transitions between black states. Black states give a small stochastic reward and orange states give a large deterministic reward. d, Accuracy at selecting the branch with the large deterministic reward under incomplete learning conditions. At state s (orange), agents with larger discount factors (far-sighted) are more accurate. At state s’ (blue), agents with a small discount factor (myopic) are more accurate. Error bars are half s.d. across 10,000 episodes, maximums are highlighted with stars. e, Mean performance in this task by the agent in Fig. 1d (see main text and Methods). f, Maze to highlight the myopic learning bias in cases where uncertainty arises due to incomplete exploration of the state space. Rewards are indicated with water and fire. An example trajectory is shown with transparent arrows. The red and blue bars to the right denote the states in the Lower and Upper half. g, True (grey) and estimated (green and brown) values for the example trajectory on top and shown in panel a. In the x-axis we highlight the starting timestep with s, the timestep when the fire is reached and the timestep when the water is reached. Image of fire in panels f,g was created by dstore via SVG Repo under a CC0 1.0 licence. Images of water droplet in panels f,g were adapted from SVG Repo under a CC0 1.0 licence. h, Accuracy (y-axis) is measured as the Kendall tau coefficient between the estimate with a specific gamma (x-axis) and the true value function V_γ = 0.99. Error bars are deviations across 300 sets of sampled trajectories. The red (blue) curve shows average accuracy for the states on the upper (lower) half of the maze, indicated with color lines on panel a. i, As the sampled number of trajectories increases, the myopic learning bias disappears. j, Architecture that learns about multiple timescales as auxiliary tasks. k, States are separated according to the agent being close to the goal (blue) or far from the goal (orange). Images in panels j,k were adapted from Farama Foundation under an MIT licence. l, Accuracy of the Q-values in the Lunar Lander environment as a function of their discount factor, estimated as the fraction of concordant state pairs between the empirical value function and the discount specific Q-value estimated by the network. Error bars are s.e.m across 10 trained networks, maximums are highlighted with stars. See Methods for details.

Extended Data Fig. 3 Single neuron responses and robustness of fit in the cued delay task.

a, PSTHs of single selected neurons (n = 50) responses to the cues predicting a reward delay of 0.6 s, 1.5 s, 3.75 s, and 9.375 s (from top to bottom). Neurons are sorted by the inferred value of the discount factor γ. Neural responses are normalized by z-scoring each neuron across its activity to all 4 conditions. b, PSTHs of single non-selected neurons (n = 23) responses to the cues predicting a reward delay of (from top to bottom). Neurons are sorted by the inferred value of the discount factor \({\rm{\gamma }}\). Neural responses are normalized by z-scoring each neuron across its activity to all 4 conditions. c, Variance explained for training vs testing data for the exponential model. For each bootstrap, the variance explained was computed on both the half of the trials used for fitting (train) and the other half of the trials (test). Neurons (n = 13) with a negative variance explained on the test data are excluded from the decoding analysis (grey dots). d, Same as panel c but for the fits for the hyperbolic model. e, Mean goodness of fit on held-out data across 100 bootstraps for each selected neuron for the exponential and hyperbolic models. The data lies above the diagonal line suggesting a better fit from the exponential model as shown in Fig. 2f. Error bars indicate 95% confidence interval using bootstrap, see Methods. f, The values of the inferred parameters in the exponential model are robust across bootstraps. top row, Inferred value of the parameters across two halves of the trials (single bootstrap) for the gain α, baseline b and discount factor γ, respectively. Bottom row, Distribution across n = 100 bootstraps of the Pearson correlations across neurons between the inferred parameter values in the two halves of the trials. Reported mean is the mean correlation across bootstraps and reported p-value is the highest p-value for all the bootstraps for a given parameters assessed via Student’s t-test. Distribution of correlations for the gain α (mean = 0.84, P < 1 × 10⁻²⁰), baseline b (v, mean = 0.9, P < 1.0 × 10⁻³²) and discount factor γ (vi, mean = 0.93, P < 1.0 × 10⁻⁴⁶). g, Same as panel f (lower row) but for the hyperbolic model with distribution of correlations for the gain α (mean = 0.86, P < 1 × 10⁻²⁶), baseline b (v, mean = 0.88, P < 1.0 × 10⁻²⁸) and shape parameter k (vi, mean = 0.76, P < 1.0 × 10⁻¹¹). h, Same as panel f (lower row) but for the exponential model simulated responses with distribution of correlations for the gain α (mean = 0.86, P < 1.0 × 10⁻¹⁰), baseline b (v, mean = 0.88, P < 1.0 × 10⁻²⁴) and discount factor γ (vi, mean = 0.76, P < 1.0 × 10⁻²⁶). Note that the distributions of inferred parameters are in a similar range than the fits to the data suggesting that trial numbers constrain the accuracy of parameter estimation. i, similar to the panel f (lower row) for the neurons recorded in mouse m3044, showing that across bootstraps, the estimates are consistent for the gain α (mean = 0.64, P < 0.05 for 88/100 bootstraps, light blue P > 0.05, dark blue P < 0.05), baseline b (mean = 0.86, P < 0.012) and discount factor γ (mean = 0.72, P < 0.05 for 97/100 bootstraps, light blue P > 0.05, dark blue P < 0.05). j, same as panel f (lower row) for the neurons recorded in mouse m3054. The estimates are consistent for the gain α (mean = 0.90, P < 0.0048), baseline b (mean = 0.93, P < 1.0 × 10⁻⁵) and discount factor γ (mean = 0.79, P < 0.0069).

Extended Data Fig. 4 Decoding reward timing using the regularized pseudo-inverse of the discount matrix.

(a-c), Singular value decomposition (SVD) of the discount matrix. a, left singular vectors (in the neuron space). b, Singular values. The black line at 2 indicates the values of the regularization term α. c, right singular vectors (in the time space). d, Decoding matrix based on the regularized pseudo-inverse. e, Distribution of 1-wassertein distances between the reward timing and the predicted reward timing from the decoding on the test data from exponential fits (shown in Fig. 2k, top row) and on the average exponential model (shown in Fig. 2k, bottom row). Decoding is better for the exponential model from Fig. 2 than the average exponential model except for the shortest delay (P(t = 0.6 s) = 1, P(t = 1.5 s) <1.0 × 10⁻³¹, P(t = 3.75) = 0.0135, P(t = 9.375 s) <1.0 × 10⁻¹⁴), one-tailed Wilcoxon signed rank test, see Methods). f, The ability to decode the timing of expected future reward is not due to a general property of the discounting matrix and collapses if we randomize the identity of the cue responses (see Methods). g, Distribution of 1-Wassertein distances between the reward timing and the predicted reward timing from the decoding on the test data exponential fits (shown in Fig. 2k, top row) and on the shuffled data (shown in panel f). The prediction from the test data are better predictions (smaller 1-Wasserstein distance) than shuffled data (P = 1.2 × 10⁻⁴ for 0.6 s reward delay, P < 1.0 × 10⁻²⁰ for the other delays, one-tailed Wilcoxon signed rank test, see Methods).

Extended Data Fig. 5 Comparing behavioral and neural discounting and decoding reward timing in single animals.

(a-c): mouse 3044. a, left panel. Normalized lick responses to the cues predicting reward delays across the population. For each neuron, the response was normalized to the highest response across the 4 possible delays. Neurons are sorted by the inferred behavioral discount factor. Right panel: Normalized neural responses to the cues predicting reward delays across the population (sorted by the behavioral discount factor). b, The behavioral and neural discount factors are not correlated (r = −0.29, P = 0.27, Spearman’s rank correlation, two-tailed Student’s t-test). c, Discount matrix for the neurons recorded in mouse 3044. This is the matrix used for decoding in panel g, top row. (d-f): same as panels (a-c) for mouse 3054. e, The behavioral and neural discount factors are not correlated in mouse 3054 (r = −0.029, P = 0.9, Spearman’s rank correlation, two-tailed Student’s t-test). f, Discount matrix for the neurons recorded in mouse 3054. This is the matrix used for decoding in panel g, bottom row. g, Decoding of reward timing at the single animal level for mouse 3044 (top row) and mouse 3054 (bottom row). The decoding is present but slightly less accurate as expected from the smaller number of neurons. h, discount factor inferred for neurons in mouse m3044 when dividing trials between low and high anticipatory lick rate. left panel, scatter plot of the value across neurons. right panel, the distribution across neurons of differences in inferred discount across the two conditions is not significant (mean = −0.0024, P = 0.96, two-tailed Student’s t-test). i, Same as panel h for mouse m3054. The difference in inferred value between low and high lick rate is significant (mean = 0.086, P = 0.0062, two-tailed Student’s t-test) but the mean effect is small compared to the standard deviation of inferred discount factors across neurons (s.d. = 0.19 for neurons in m3054).

Extended Data Fig. 6 Decoding reward timing from the hyperbolic model and exponential model simulations.

a, Distribution of the inferred discount parameter k across the neurons. b, Correlation between the discount factor inferred in the exponential model of the discount parameter k from the hyperbolic model (r = −0.9, P < 1.0 × 10⁻³⁰, Student’s t-test). Note the in the hyperbolic model a larger value of k implies faster discounting hence the negative correlation. c, Discount matrix for the hyperbolic model. For each neuron we plot the relative value of future events given its inferred discount parameter. Neurons are sorted by decreasing estimated value of the discount parameter. d, Decoded subjective expected timing of future reward \(E(r|t)\) using the discount matrix from the hyperbolic model (see Methods). e, Distribution of 1-Wassertein distances between the reward timing and the predicted reward timing from the decoding on the test data with the exponential model (shown in Fig. 2k, top row) and on the test data with the hyperbolic model (shown in d). Decoding is better for the exponential model from Fig. 2 than the hyperbolic model except for the shortest delay (P(t = 0.6 s) = 1, P(t = 1.5 s) <1.0 × 10⁻³¹, P(t = 3.75) < 1.0 × 10⁻³³, P(t = 9.375 s) <1.0 × 10⁻³), one-tailed Wilcoxon signed rank test, see Methods). f, Decoded subjective expected timing of future reward \(E(r|t)\) using simulated data based on the parameters of the exponential model (see Methods). g, Distribution of 1-Wassertein distances between the reward timing and the predicted reward timing from the decoding on the test data from exponential fits (shown in Fig. 2k, top row) and on the simulated data from the parameters of the exponential fits (shown in f). Decoding is marginally better for the data predictions (P(t = 0.6 s) = 0.002, P(t = 1.5 s) = 0.999, P(t = 3.75) < 1 × 10⁻¹², P(t = 9.375 s) = 0.027), one-tailed Wilcoxon signed rank test, see Methods), suggesting that decoding accuracy is limited by the number of trials.

Extended Data Fig. 7 Ramping, discounting, anatomy and distributed RL models.

a, Ramping in the prediction error signal is controlled by the relative contribution of value increases and discounting. If the value increase (middle) exactly matches the discounting, there is no prediction error (middle equation, right). If the discounting is smaller than the value increase (large discount factor) then there is a positive TD error (top equation, right). If the discounting is larger (small discount factor) than the value increase then there a negative TD error (bottom equation, right). A single timescale agent with no state uncertainty will learn an exponential value function but if there is state uncertainty (see ref. ⁶⁹) or the global value function arises from combining the contribution of single-timescale agents then the value function is likely t be non-exponential. Image of a magnifying glass was created by googlefonts via SVG Repo under an Apache Licence. b, Intuition for diversity of ramping with a hyperbolic value function. Agents with a small discount factor exhibit a monotonic downward ramp (pink), while those with a large discount factor exhibit a monotonic upward ramp (brown). Agents with an intermediary discount factor tend to exhibit a downward then upward ramp. The hyperbolic value function gets increasingly convex as the reward approaches, so an increasing fraction of the agents have a positive prediction error as they approach the reward. c, The discount factor inferred in the VR task is not correlated with the medio-lateral (ML) position of the implant (Pearson’s r = 0.015, P = 0.89, two-tailed Student’s t-test). d, The baseline parameter inferred in the VR task is not correlated with the medio-lateral (ML) position of the implant (Pearson’s r = −0.011, P = 0.92, two-tailed Student’s t-test). e, The inferred gain in the VR task reduces with increasing medio-lateral (ML) position but the effect does not reach significance (Pearson’s r = −0.19, P = 0.069, two-tailed Student’s t-test). In panels c-e, the line correspond to the best fit linear regression and the uncertainty shading represents 95% confidence interval on a linear regression fit. f-h. Ramping in the reward prediction error with mixing in distributed RL models. Inferred value functions (V) and RPEs (δ) for the mixed RL model as a function of the common value function sharing-parameter λ, in a linear MDP of 30 steps (x-axis in the plots) with a deterministic reward equal to 1 in the last step and 0 everywhere else. Plots are shown after learning has empirically stabilized (after 3,000 TD-learning iterations with a learning rate of 0.1). The dashed value function is the exponential value function without common value sharing (λ = 0), which would lead to a flat RPE equal to 0 at every state. The actual value functions (solid lines) are not purely exponential, and thus lead to ramping RPEs. f, Circuit model in which each value estimation and their corresponding prediction error are part completely independent loops (λ = 0). At convergence, there is no more prediction error in the reward anticipation period (bottom row). g, Circuit model in which the prediction error for each dopamine neuron is influenced by both the independent value signal and the shared a common value signal (C, λ = 0.1). The dashed line indicate the value function corresponding to completely separate loops, and the solid function the actual value function due to the influence of the common value signal. The difference between them leads to ramping in the reward prediction error signals (bottom row). h, Circuit model with a strong influence of the common value signal (λ = 1) which also leads to ramping in the reward prediction error signals. See Methods for details. i, Decoded reward times for 4 experimental conditions with rewards at times 5, 10, 15 and 30 (pink to cyan), by applying a regularized inverse Laplace decoder (analogous to the one used in Fig. 2 of the main text) to the values at the moment of the cue, under the model without mixing λ = 0. j, Same as (i) but using a mixing factor of λ = 0.1. The small mixing factor does not affect the quality of the temporal decoding, while creating a ramping reward prediction error (panel g). Therefore, a small mixing factor constitutes a common model that can qualitatively account for the two tasks studied in the paper. k, Same as (i) but using a mixing factor of λ = 1. Using a fully shared value function the relative differences between discount factors disappear, so temporal decoding is no longer possible.

Extended Data Fig. 8 Behavioral and neural discounting at the single animal.

a, Time course of the lick rate in the VR task as mice approach the reward location. gray line, lick rate for individual neurons, blue line, mean lick rate. The three black lines on top indicate the three windows used to compute early, middle and late lick rate in the analysis presented in panels c-d. b, The inferred discount factor and the slope in spiking activity (see Fig. 3b) are strongly correlated (r = 0.81, P = 0, Spearman rank correlation, two-tailed Student’s t-test) suggesting that slope is a good measure of discounting. c, Correlations of measures of behavioral and neural discounting for mouse m3044 (Spearman rank correlation, two-tailed Student’s t-test). i-iii: the neural discount factor and the ramp in licking activity is not correlated irrespective of the window used to compute the ramp in licking activity when using the following windows to compute ramping activity in the lick rate: i, modulation between the late and early window, r = 0.09, P = 0.64. ii, modulation between the late and middle windows, r = −0.05, P = 0.79. iii, modulation between the early and middle windows, r = 0.17, P = 0.37. iv-vi: The measures of ramping in licking activity are strongly correlated to each other: i, correlation between the late-middle and late-early modulation measures, r = 0.65, P = 1.3 × 10⁻⁴. ii, correlation between the middle-early and late-early modulation measures, r = 0.96, P = 7 × 10⁻¹⁷. iii, correlation between the middle-early and late-middle modulation measures, r = 0.53, P = 0.0028. d, Correlations of measures of behavioral and neural discounting for mouse m3054 (Spearman rank correlation, two-tailed Student’s t-test). i-iii: the neural discount factor and the ramp in licking activity is not correlated irrespective of the window used to compute the ramp in licking activity when using the following windows to compute ramping activity in the lick rate: i, modulation between the late and early window, r = −0.1, P = 0.62. ii, modulation between the late and middle windows, r = 0.032, P = 0.88. iii, modulation between the early and middle windows, r = −0.11, P = 0.61. iv-vi: The measures of ramping in licking activity are strongly correlated to each other: i, correlation between the late-middle and late-early modulation measures, r = 0.82, P = 6.7 × 10⁻⁷. ii, correlation between the middle-early and late-early modulation measures, r = 0.94, P = 4.7 × 10⁻¹². iii, correlation between the middle-early and late-middle modulation measures, r = 0.66, P = 4.2 × 10⁻⁴. e, The VR model fits (right panel) to m3044 neurons alone captures the diversity of ramping activity observed across single neurons (left panel). f, Inferred value function for m3044. Thin gray line, individual bootstrap fits. Blue line, mean value fit. g, The VR model fits (right panel) to m3054 neurons alone captures the diversity of ramping activity observed across single neurons (left panel). h, Inferred value function for m3054. Thin gray line, individual bootstrap fits. Blue line, mean value fit. i, The inferred parameter values between the fit for m3044 and the full population fit are strongly correlated (Spearman rank correlation, two-tailed Student’s t-test) for the gain parameter (left panel, r = 0.78, P = 2.1 × 10⁻⁶), the baseline parameter (middle panel, r = 0.75, P = 5.7 × 10⁻⁶) and the discount factor (right panel, r = 0.75, P = 6.8 × 10⁻⁶). j, The inferred parameter values between the fit for m3054 and the full population fit are strongly correlated (Spearman rank correlation, two-tailed Student’s t-test) for the gain parameter (left panel, r = 0.97, P = 1.3 × 10⁻⁶), the baseline parameter (middle panel, r = 0.98, P = 1.1 × 10⁻⁶) and the discount factor (right panel, r = 0.88, P = 2.6 × 10⁻⁶). All reported correlations are Spearman rank correlations.

Extended Data Fig. 9 Discounting heterogeneity explains ramping diversity in a common reward expectation model.

a, Uncertainty in reward timing reduces as mice approach the reward zone. Not only does the mean expected reward time reduces but the standard deviation of the estimate also reduces. Distribution in the bottom row from fitted data (see panels c-e). b, A model where each neuron contributes to its individual value function but share a common reward expectation predicts ramping heterogeneity across neurons. Left panel, as mice approach reward, the uncertainty, quantified by the standard deviation, of reward timing reduces. 2^nd panel from left, The Expectation of reward timing takes the form of a folded normal distribution. As the mice approach the reward there is a reduction of both the mean and the standard deviation of the expected reward timing distribution. 3^rd panel from left, each neuron computes a distinct value function given their individual discount factor and the common expected reward timing distribution with. Right panel, The diverse value functions across neurons lead to ramping heterogeneity across neurons in the reward prediction error. (see Methods ‘Common Reward Expectation model’). c, The inferred standard deviation of the reward expectation model reduces as a function of time to reward. Line indicates the mean inferred standard deviation and the shading indicates the standard error of the mean over 100 bootstraps. d, Expected timing of the reward as a function of true time to reward. As the mice approach the reward not only does the mean expected time to reward reduces but the uncertainty of the reward timing captured by the standard deviation shown in c also reduces. This effect leads to increasingly convex value functions that lead to the observed ramps in dopamine neuron activity. e, Value function for each individual neuron (same order as in h-i). f, Distribution of inferred discount factors under the common reward expectation model. g, Although the range of discount factor between the fits from the common value (x axis) and common reward expectation (y axis) models differs, the inferred discount factors are strongly correlated for single neurons (Spearman’s ρ = 0.93, P < 1.0 × 10⁻²⁰, two-tailed Student’s t-test). h, Predicted ramping activity from the model fits under the common reward expectation model. i, Diversity of ramping activity across single neurons as mice approach reward (aligned by inferred discount factor in the common reward expectation model).

Extended Data Fig. 10 Decoding reward timing in the cued delayed reward task using parameters inferred in the VR task and details of recordings.

a, Discount matrix computed using the parameters inferred in the VR tasks for neurons recorded across both tasks and used in the cross-task decoding. b, Dopamine neurons cue responses in the cued delay task. Neurons are aligned as in a according to increasing discount factor inferred in the VR task. c, Top row: Decoded reward timing using discount factors inferred in the VR task. Bottom row: The ability to decode reward timing is lost when shuffling the identities of the cue responses. d, Except for the shortest delay, decoded reward timing is more accurate than shuffle as measured by the 1-Wassertsein distance (P_t = 0.6s = 1, P_t = 1.5s < 1.1 × 10⁻²⁰, P_{t = 3.75s} < 3.8 × 10⁻²⁰, P_{t = 9.375s} < 2.9 × 10⁻⁵). e, Breakdown of the number of recorded neurons per animal and task. The numbers in parenthesis indicate the number of neurons included in the analysis. ± indicates standard deviation across sessions. The maximum number of neurons recorded in a single session was 4 in both the cued delay task and the VR task.

Supplementary information

Reporting Summary

Peer Review file

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Masset, P., Tano, P., Kim, H.R. et al. Multi-timescale reinforcement learning in the brain. Nature 642, 682–690 (2025). https://doi.org/10.1038/s41586-025-08929-9

Download citation

Received: 12 November 2023
Accepted: 21 March 2025
Published: 04 June 2025
Issue date: 19 June 2025
DOI: https://doi.org/10.1038/s41586-025-08929-9