Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Multi-timescale reinforcement learning in the brain

Abstract

To thrive in complex environments, animals and artificial agents must learn to act adaptively to maximize fitness and rewards. Such adaptive behaviour can be learned through reinforcement learning1, a class of algorithms that has been successful at training artificial agents2,3,4,5 and at characterizing the firing of dopaminergic neurons in the midbrain6,7,8. In classical reinforcement learning, agents discount future rewards exponentially according to a single timescale, known as the discount factor. Here we explore the presence of multiple timescales in biological reinforcement learning. We first show that reinforcement agents learning at a multitude of timescales possess distinct computational benefits. Next, we report that dopaminergic neurons in mice performing two behavioural tasks encode reward prediction error with a diversity of discount time constants. Our model explains the heterogeneity of temporal discounting in both cue-evoked transient responses and slower timescale fluctuations known as dopamine ramps. Crucially, the measured discount factor of individual neurons is correlated across the two tasks, suggesting that it is a cell-specific property. Together, our results provide a new paradigm for understanding functional heterogeneity in dopaminergic neurons and a mechanistic basis for the empirical observation that humans and animals use non-exponential discounts in many situations9,10,11,12, and open new avenues for the design of more-efficient reinforcement learning algorithms.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Computational advantages of multi-timescale RL.
Fig. 2: Dopaminergic neurons exhibit a diversity of discount factors that enables decoding of reward delays.
Fig. 3: The diversity of discount factors across dopaminergic neurons explains qualitatively different ramping activity.
Fig. 4: Discount factors of single dopaminergic neurons are correlated across behavioural contexts.

Similar content being viewed by others

Data availability

The raw electrophysiological data can be found on DANDI Archive (https://dandiarchive.org/dandiset/000251). The curated electrophysiological data can be found at https://doi.org/10.17632/tc43t3s7c5.1 (see ref. 72).

Code availability

The code used for simulations can be found on GitHub (https://github.com/pablotano8/multi_timescale_RL). The data analysis code for the electrophysiological experiments can be found at https://doi.org/10.17632/tc43t3s7c5.1 (see ref. 72).

References

  1. Sutton, R. S. & Barto, A. G. Reinforcement Learning 2nd edn (MIT Press, 2018).

  2. Tesauro, G. Temporal difference learning and TD-Gammon. Commun. ACM 38, 58–68 (1995).

    Article  Google Scholar 

  3. Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).

    Article  ADS  CAS  PubMed  Google Scholar 

  4. Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).

    Article  ADS  CAS  PubMed  Google Scholar 

  5. Wurman, P. R. et al. Outracing champion Gran Turismo drivers with deep reinforcement learning. Nature 602, 223–228 (2022).

    Article  ADS  CAS  PubMed  Google Scholar 

  6. Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275, 1593–1599 (1997).

    Article  CAS  PubMed  Google Scholar 

  7. Schultz, W. Neuronal reward and decision signals: from theories to data. Physiol. Rev. 95, 853–951 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Cohen, J. Y., Haesler, S., Vong, L., Lowell, B. B. & Uchida, N. Neuron-type-specific signals for reward and punishment in the ventral tegmental area. Nature 482, 85–88 (2012).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  9. Ainslie, G. Specious reward: a behavioral theory of impulsiveness and impulse control. Psychol. Bull. 82, 463–496 (1975).

    Article  CAS  PubMed  Google Scholar 

  10. Frederick, S., Loewenstein, G. & O’Donoghue, T. Time discounting and time preference: a critical review. J. Econ. Lit. 40, 351–401 (2002).

  11. Laibson, D. Golden eggs and hyperbolic discounting. Q. J. Econ. 112, 443–478 (1997).

    Article  Google Scholar 

  12. Sozou, P. D. On hyperbolic discounting and uncertain hazard rates. Proc. R. Soc. London. B 265, 2015–2020 (1998).

    Article  Google Scholar 

  13. Botvinick, M. et al. Reinforcement learning, fast and slow. Trends Cogn. Sci. 23, 408–422 (2019).

    Article  PubMed  Google Scholar 

  14. Redish, A. D. Addiction as a computational process gone awry. Science 306, 1944–1947 (2004).

    Article  ADS  CAS  PubMed  Google Scholar 

  15. Lempert, K. M., Steinglass, J. E., Pinto, A., Kable, J. W. & Simpson, H. B. Can delay discounting deliver on the promise of RDoC? Psychol. Med. 49, 190–199 (2019).

    Article  PubMed  Google Scholar 

  16. Sutton, R. S. et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems Vol. 2 761–768 (International Foundation for Autonomous Agents and Multiagent Systems, 2011); https://dl.acm.org/doi/10.5555/2031678.2031726

  17. Bellemare, M. G., Dabney, W. & Rowland, M. Distributional Reinforcement Learning (MIT Press, 2023).

  18. Jaderberg, M. et al. Reinforcement learning with unsupervised auxiliary tasks. In International Conference on Learning Representations (ICLR, 2017).

  19. Fedus, W., Gelada, C., Bengio, Y., Bellemare, M. G. & Larochelle, H. Hyperbolic discounting and learning over multiple horizons. Preprint at https://arxiv.org/abs/1902.06865 (2019).

  20. Dabney, W. et al. A distributional code for value in dopamine-based reinforcement learning. Nature 577, 671–675 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  21. Tano, P., Dayan, P. & Pouget, A. A local temporal difference code for distributional reinforcement learning. Adv. Neural Inf. Process. Syst. 1146, 13662–13673 (2020).

  22. Brunec, I. K. & Momennejad, I. Predictive representations in hippocampal and prefrontal hierarchies. J. Neurosci. 42, 299–312 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Lowet, A. S., Zheng, Q., Matias, S., Drugowitsch, J. & Uchida, N. Distributional reinforcement learning in the brain. Trends Neurosci. 43, 980–997 (2020).

  24. Masset, P. & Gershman, S. J. in The Handbook of Dopamine (Handbook of Behavioral Neuroscience) Vol. 32 (eds Cragg, S. J. & Walton, M.) Ch. 24 (Academic Press, 2025).

  25. Buhusi, C. V. & Meck, W. H. What makes us tick? Functional and neural mechanisms of interval timing. Nat. Rev. Neurosci. 6, 755–765 (2005).

    Article  CAS  PubMed  Google Scholar 

  26. Tsao, A., Yousefzadeh, S. A., Meck, W. H., Moser, M.-B. & Moser, E. I. The neural bases for timing of durations. Nat. Rev. Neurosci. 23, 646–665 (2022).

    Article  CAS  PubMed  Google Scholar 

  27. Fiorillo, C. D., Newsome, W. T. & Schultz, W. The temporal precision of reward prediction in dopamine neurons. Nat. Neurosci. 11, 966–973 (2008).

    Article  CAS  PubMed  Google Scholar 

  28. Mello, G. B. M., Soares, S. & Paton, J. J. A scalable population code for time in the striatum. Curr. Biol. 25, 1113–1122 (2015).

    Article  CAS  PubMed  Google Scholar 

  29. Soares, S., Atallah, B. V. & Paton, J. J. Midbrain dopamine neurons control judgment of time. Science 354, 1273–1277 (2016).

    Article  ADS  CAS  PubMed  Google Scholar 

  30. Enomoto, K., Matsumoto, N., Inokawa, H., Kimura, M. & Yamada, H. Topographic distinction in long-term value signals between presumed dopamine neurons and presumed striatal projection neurons in behaving monkeys. Sci. Rep. 10, 8912 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  31. Mohebi, A., Wei, W., Pelattini, L., Kim, K. & Berke, J. D. Dopamine transients follow a striatal gradient of reward time horizons. Nat. Neurosci. 27, 737–746 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Kiebel, S. J., Daunizeau, J. & Friston, K. J. A hierarchy of time-scales and the brain. PLoS Comput. Biol. 4, e1000209 (2008).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  33. Kurth-Nelson, Z. & Redish, A. D. Temporal-difference reinforcement learning with distributed representations. PLoS ONE 4, 7362 (2009).

    Article  ADS  Google Scholar 

  34. Shankar, K. H. & Howard, M. W. A scale-invariant internal representation of time. Neural Comput. 24, 134–193 (2012).

    Article  MathSciNet  PubMed  Google Scholar 

  35. Tanaka, C. S. et al. Prediction of immediate and future rewards differentially recruits cortico-basal ganglia loops. Nat. Neurosci. 7, 887–893 (2004).

    Article  CAS  PubMed  Google Scholar 

  36. Sherstan, C., Dohare, S., MacGlashan, J., Günther, J. & Pilarski, P. M. Gamma-Nets: generalizing value estimation over timescale. In Proceedings of the AAAI Conference on Artificial Intelligence 34, 5717–5725 (2020).

  37. Momennejad, I. & Howard, M. W. Predicting the future with multi-scale successor representations. Preprint at bioRxiv https://doi.org/10.1101/449470 (2018).

  38. Reinke, C., Uchibe, E. & Doya, K. Average reward optimization with multiple discounting reinforcement learners. In Neural Information Processing. ICONIP 2017. Lecture Notes in Computer Science (eds Liu, D. et al.) 789–800 (Springer, 2017).

  39. Kobayashi, S. & Schultz, W. Influence of reward delays on responses of dopamine neurons. J. Neurosci. 28, 7837–7846 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Schultz, W. Dopamine reward prediction-error signalling: a two-component response. Nat. Rev. Neurosci. 17, 183–195 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Howe, M. W., Tierney, P. L., Sandberg, S. G., Phillips, P. E. M. & Graybiel, A. M. Prolonged dopamine signalling in striatum signals proximity and value of distant rewards. Nature 500, 575–579 (2013).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  42. Berke, J. D. What does dopamine mean? Nat. Neurosci. 21, 787–793 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Gershman, S. J. Dopamine ramps are a consequence of reward prediction errors. Neural Comput. 26, 467–471 (2014).

    Article  PubMed  Google Scholar 

  44. Kim, H. G. R. et al. A unified framework for dopamine signals across timescales. Cell 183, 1600–1616.e25 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Mikhael, J. G., Kim, H. R., Uchida, N. & Gershman, S. J. The role of state uncertainty in the dynamics of dopamine. Curr. Biol. 32, 1077–1087.e9 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Guru, A. et al. Ramping activity in midbrain dopamine neurons signifies the use of a cognitive map. Preprint at bioRxiv https://doi.org/10.1101/2020.05.21.108886 (2020).

  47. Doya, K. Reinforcement learning in continuous time and space. Neural Comput. 12, 219–245 (2000).

    Article  CAS  PubMed  Google Scholar 

  48. Lee, R. S., Sagiv, Y., Engelhard, B., Witten, I. B. & Daw, N. D. A feature-specific prediction error model explains dopaminergic heterogeneity. Nat. Neurosci. 27, 1574–1586 (2024).

    Article  CAS  PubMed  Google Scholar 

  49. Cruz, B. F. et al. Action suppression reveals opponent parallel control via striatal circuits. Nature 607, 521–526 (2022).

    Article  ADS  CAS  PubMed  Google Scholar 

  50. Millidge, B., Song, Y., Lak, A., Walton, M. E. & Bogacz, R. Reward bases: a simple mechanism for adaptive acquisition of multiple reward types. PLoS Comput. Biol. 20, e1012580 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Engelhard, B. et al. Specialized coding of sensory, motor and cognitive variables in VTA dopamine neurons. Nature 570, 509–513 (2019).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  52. Eshel, N., Tian, J., Bukwich, M. & Uchida, N. Dopamine neurons share common response function for reward prediction error. Nat. Neurosci. 19, 479–486 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Cox, J. & Witten, I. B. Striatal circuits for reward learning and decision-making. Nat. Rev. Neurosci. 20, 482–494 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Collins, A. L. & Saunders, B. T. Heterogeneity in striatal dopamine circuits: form and function in dynamic reward seeking. J. Neurosci. Res. https://doi.org/10.1002/jnr.24587 (2020).

  55. Gershman, S. J. et al. Explaining dopamine through prediction errors and beyond. Nat. Neurosci. 27, 1645–1655 (2024).

    Article  CAS  PubMed  Google Scholar 

  56. Watabe-Uchida, M. & Uchida, N. Multiple dopamine systems: weal and woe of dopamine. Cold Spring Harb. Symp. Quant. Biol. 83, 83–95 (2018).

    Article  PubMed  Google Scholar 

  57. Xu, Z., van Hasselt, H. P. & Silver, D. Meta-gradient reinforcement learning. In Advances in Neural Information Processing Systems Vol. 31 (Curran Associates, 2018).

  58. Yoshida, N., Uchibe, E. & Doya, K. Reinforcement learning with state-dependent discount factor. In 2013 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL) https://ieeexplore.ieee.org/document/6652533 (IEEE, 2013).

  59. Doya, K. Metalearning and neuromodulation. Neural Netw. 15, 495–506 (2002).

    Article  PubMed  Google Scholar 

  60. Tanaka, S. C. et al. Serotonin differentially regulates short- and long-term prediction of rewards in the ventral and dorsal striatum. PLoS ONE https://doi.org/10.1371/journal.pone.0001333 (2007).

  61. Kvitsiani, D. et al. Distinct behavioural and network correlates of two interneuron types in prefrontal cortex. Nature 498, 363–366 (2013).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  62. Sutton, R. S. Learning to predict by the methods of temporal differences. Mach. Learn. 3, 9–44 (1988).

    Article  Google Scholar 

  63. Oppenheim, A., Willsky, A. & Hamid, W. Signals and Systems (Pearson, 1996).

  64. Dayan, P. Improving generalisation for temporal difference learning: the successor representation. Neural Comput. 5, 613–624 (1993).

    Article  Google Scholar 

  65. Gershman, S. J. The successor representation: its computational logic and neural substrates. J. Neurosci. https://doi.org/10.1523/JNEUROSCI.0151-18.2018 (2018).

  66. Amit, R., Meir, R. & Ciosek, K. Discount factor as a regularizer in reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning 269–278 (PMLR, 2020).

  67. Badia, A. P. et al. Agent57: outperforming the Atari human benchmark. In Proceedings of the 37th International Conference on Machine Learning 507–517 (PMLR, 2020).

  68. Reinke, C. Time adaptive reinforcement learning. ICLR 2020 work. Beyond tabula rasa RL. Preprint at https://doi.org/10.48550/arXiv.2004.08600 (2020).

  69. Gershman, S. J. & Uchida, N. Believing in dopamine. Nat. Rev. Neurosci. 20, 703–714 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Leone, F. C., Nelson, L. S. & Nottingham, R. B. The folded normal distribution. Technometrics 3, 543–550 (1961).

    Article  MathSciNet  Google Scholar 

  71. Lindsey, J. & Litwin-Kumar, A. Action-modulated midbrain dopamine activity arises from distributed control policies. Adv. Neural Inf. Process. Syst. 35, 5535–5548 (2022).

    Google Scholar 

  72. Masset, P. et al. Data and code for ‘Multi-timescale reinforcement learning in the brain’, V1. Mendeley Data https://doi.org/10.17632/tc43t3s7c5.1 (2025).

Download references

Acknowledgements

We thank S. J. Gershman and J. Mikhael for their contributions to the preceding studies; M. Watabe-Uchida for advice on task design; members of the Uchida and Pouget laboratories, including A. Lowet and M. Burrell, for discussions and comments; and W. Carvalho, G. Reddy and T. Ott for their comments on the manuscript. This work is supported by US National Institutes of Health (NIH) BRAIN Initiative grants (R01NS226753 and U19NS113201), NIH grant 5R01DC017311 to N.U., and a grant from the Swiss National Science Foundation (315230_197296) to A.P. This research was carried out in part thanks to funding from the Canada First Research Excellence Fund, awarded to P.M. through the Healthy Brains, Healthy Lives initiative at McGill University.

Author information

Authors and Affiliations

Authors

Contributions

P.M., P.T., A.P. and N.U. conceived the project. P.M., H.R.K., A.N.M. and N.U. designed the electrophysiology experiments. A.N.M. and H.R.K. performed the electrophysiology experiments and curated the data. P.T. performed the simulations with artificial agents. P.M. performed the analysis of electrophysiological data. P.M., P.T., A.P. and N.U. wrote the paper with input from H.R.K.

Corresponding authors

Correspondence to Paul Masset, Alexandre Pouget or Naoshige Uchida.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature thanks Kenji Doya and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Decoding simulations for multi-timescale vs. single-timescale agents.

(a-c). Experiment corresponding to Task 1 in Fig. 1. a, MDP with reward R at time tR. b, Diagram of the decoding experiment. In each episode, the reward magnitude and time are randomly sampled from discrete uniform distributions, which defines the MDP in a. Values are learned until near convergence using TD-learning. Values with different discount factors are learned independently. The learned values for the cue (s) are fed into a non-linear decoder which learns, across MDPs, to report the reward time. c, Decoding performance as the decoder is trained. Different colors indicate the discount factors used in TD-learning. (d-f). Experiment corresponding to Task 2 in Fig. 1. d, MDP with reward R at time tR. e, Diagram of the decoding experiment. In each episode, the reward magnitude and time are randomly sampled from discrete uniform distributions, which defines the MDP in a. Values are learned until near convergence using TD-learning. Values with different discount factors are learned independently. The learned values for the cue (s) are fed into a non-linear decoder which learns, across MDPs, to report the hyperbolic value of the cue. f, Decoding performance as the decoder is trained. Different colors indicate the discount factors used in TD-learning. (g-i). Experiment corresponding to Task 3 in Fig. 1. g, MDP with reward equal to 1 at time tR. h, Diagram of the decoding experiment. In each episode, the reward time and the number of TD iterations (N) are sampled from discrete uniform distributions. Values are learned by performing N TD-learning backups on the MDP. Values with different discount factors are learned independently. The learned values for the cue (s) are fed into a non-linear decoder which learns, across MDPs, to report the reward time. i, Decoding performance as the decoder is trained. Different colors indicate the discount factors used in TD-learning. (j-l). Experiment corresponding to Task 4 in Fig. 1. j, MDP with reward equal to 1 at time \({t}_{R}\), and a noisy reward added to every state. k, Diagram of the decoding experiment. In each episode, the reward time is sampled from discrete uniform distributions. Values are learned by performing a single iteration of TD-learning backwards through the MDP. Values with different discount factors are learned independently. The learned values for the cue (s) are fed into a non-linear decoder which learns, across MDPs, to report the true value of the cue after experiencing the trajectory an infinite number of times (this is, ignoring the random rewards). l, Decoding performance as the decoder is trained. Different colors indicate the discount factors used in TD-learning. (m-o) Experiment with two rewards. m, MDP with two rewards of magnitude R1 and R2 at times tR1 < tR2. Value estimates Vγ(s) are fed into a non-linear decoder which learns, across MDPs, to report both reward times. o, Decoding performance as the decoder is trained. Different colors indicate the discount factors used in TD-learning. (p-r) Experiment to determine the shortest branch when learning from incomplete information. p, MDP with two possible trajectories. In this example, the upwards trajectory is longer than the downwards trajectory. q, Diagram of the decoding experiment. In each episode, the length of the two branches D and the number of times that TD-backups are performed for each branch are randomly sampled from uniform discrete distributions. Then, TD-backups are perform ed for the two branches the corresponding number of times. After this, they are fed into a decoder which is trained, across episodes, to report the shorter branch. r, Decoding performance as the decoder is trained. Different colors indicate the discount factors used in TD-learning. s-w: Temporal estimates are available before convergence for multi-timescale agents. s, Two experiments, one with a short wait between the cue and reward (pink), and one with a longer wait (cyan). t, The identity of the cue with the higher value for a single-timescale agent (here γ = 0.9) depends on the number of times that the experiments have been experienced. When the longer trajectory has been experienced significantly more often than the short one, the single-timescale agent can incorrectly believe that it has a larger value. u, For a multi-timescale agent, the pattern of values learned across discount factors is only affected by a multiplicative factor that depends on the learning rate, the prior values and the asymmetric learning experience. The pattern therefore contains unique information about outcome time. v,w, When plotted as a function of the number of times that trajectories are experienced, the pattern of values across discount factors is only affected by a multiplicative factor. In other words, for the pink cue, the larger discount factors are closer together than they are to the smaller discount factor, and the opposite for the cyan cue. This pattern is maintained at every point along the x-axis, and therefore is independent of the asymmetric experience, and it enables a downstream system to decode reward timing. Error bars are the standard deviations (s.d.) across 100 test episodes and 3 trained policy gradient (PG) networks.

Extended Data Fig. 2 The myopic learning bias.

a, Illustration of the myopic learning bias. Consider a scenario in which the upwards and downwards paths from s and s’ are experienced only once, such that at least one path in the small bifurcations is left unexplored. In state s’ (blue) the far future is more uncertain than the near future, and in state s (red) the near future is more uncertain than the far future. b, When experiencing the upwards and downwards paths only once, there are 4 possible scenarios depending on which path of the corresponding bifurcations is visited. When learning from limited information, the myopic (low γ) and farsighted (high γ) estimates would make different decisions depending on whether the VUP estimate is larger, smaller or approximately equal to VDOWN (VUP ≈ VDOWN occurs when both estimates have similar magnitudes and some small variations in the precise magnitude of rewards, the prior values or the learning parameters could change whether VUP < VDOWN or VUP > VDOWN). In both s and s’, the correct decision is to follow the upwards path (the ‘correct’ decision is the decision made by a hypothetical RL agent that experiences all possible trajectories an infinite number of times). Below we show the approximate probability that the agent chooses the correct path, if it follows the myopic estimates (low γ) or far-sighted estimates (high γ). Illustration of a mouse in panel a and silhouette of a raptor in panels a,b were adapted from the NIAID NIH BIOART Source. Illustration of a block of cheese in panels a,b, was adapted from SVG Repo under a CC0 1.0 licence. c, Task structure to evaluate the myopic learning bias when uncertainty arises due to stochastic rewards. The three dots collapse 5 transitions between black states. Black states give a small stochastic reward and orange states give a large deterministic reward. d, Accuracy at selecting the branch with the large deterministic reward under incomplete learning conditions. At state s (orange), agents with larger discount factors (far-sighted) are more accurate. At state s’ (blue), agents with a small discount factor (myopic) are more accurate. Error bars are half s.d. across 10,000 episodes, maximums are highlighted with stars. e, Mean performance in this task by the agent in Fig. 1d (see main text and Methods). f, Maze to highlight the myopic learning bias in cases where uncertainty arises due to incomplete exploration of the state space. Rewards are indicated with water and fire. An example trajectory is shown with transparent arrows. The red and blue bars to the right denote the states in the Lower and Upper half. g, True (grey) and estimated (green and brown) values for the example trajectory on top and shown in panel a. In the x-axis we highlight the starting timestep with s, the timestep when the fire is reached and the timestep when the water is reached. Image of fire in panels f,g was created by dstore via SVG Repo under a CC0 1.0 licence. Images of water droplet in panels f,g were adapted from SVG Repo under a CC0 1.0 licence. h, Accuracy (y-axis) is measured as the Kendall tau coefficient between the estimate with a specific gamma (x-axis) and the true value function Vγ = 0.99. Error bars are deviations across 300 sets of sampled trajectories. The red (blue) curve shows average accuracy for the states on the upper (lower) half of the maze, indicated with color lines on panel a. i, As the sampled number of trajectories increases, the myopic learning bias disappears. j, Architecture that learns about multiple timescales as auxiliary tasks. k, States are separated according to the agent being close to the goal (blue) or far from the goal (orange). Images in panels j,k were adapted from Farama Foundation under an MIT licence. l, Accuracy of the Q-values in the Lunar Lander environment as a function of their discount factor, estimated as the fraction of concordant state pairs between the empirical value function and the discount specific Q-value estimated by the network. Error bars are s.e.m across 10 trained networks, maximums are highlighted with stars. See Methods for details.

Extended Data Fig. 3 Single neuron responses and robustness of fit in the cued delay task.

a, PSTHs of single selected neurons (n = 50) responses to the cues predicting a reward delay of 0.6 s, 1.5 s, 3.75 s, and 9.375 s (from top to bottom). Neurons are sorted by the inferred value of the discount factor γ. Neural responses are normalized by z-scoring each neuron across its activity to all 4 conditions. b, PSTHs of single non-selected neurons (n = 23) responses to the cues predicting a reward delay of (from top to bottom). Neurons are sorted by the inferred value of the discount factor \({\rm{\gamma }}\). Neural responses are normalized by z-scoring each neuron across its activity to all 4 conditions. c, Variance explained for training vs testing data for the exponential model. For each bootstrap, the variance explained was computed on both the half of the trials used for fitting (train) and the other half of the trials (test). Neurons (n = 13) with a negative variance explained on the test data are excluded from the decoding analysis (grey dots). d, Same as panel c but for the fits for the hyperbolic model. e, Mean goodness of fit on held-out data across 100 bootstraps for each selected neuron for the exponential and hyperbolic models. The data lies above the diagonal line suggesting a better fit from the exponential model as shown in Fig. 2f. Error bars indicate 95% confidence interval using bootstrap, see Methods. f, The values of the inferred parameters in the exponential model are robust across bootstraps. top row, Inferred value of the parameters across two halves of the trials (single bootstrap) for the gain α, baseline b and discount factor γ, respectively. Bottom row, Distribution across n = 100 bootstraps of the Pearson correlations across neurons between the inferred parameter values in the two halves of the trials. Reported mean is the mean correlation across bootstraps and reported p-value is the highest p-value for all the bootstraps for a given parameters assessed via Student’s t-test. Distribution of correlations for the gain α (mean = 0.84, P < 1 × 10−20), baseline b (v, mean = 0.9, P < 1.0 × 10−32) and discount factor γ (vi, mean = 0.93, P < 1.0 × 10−46). g, Same as panel f (lower row) but for the hyperbolic model with distribution of correlations for the gain α (mean = 0.86, P < 1 × 10−26), baseline b (v, mean = 0.88, P < 1.0 × 10−28) and shape parameter k (vi, mean = 0.76, P < 1.0 × 10−11). h, Same as panel f (lower row) but for the exponential model simulated responses with distribution of correlations for the gain α (mean = 0.86, P < 1.0 × 10−10), baseline b (v, mean = 0.88, P < 1.0 × 10−24) and discount factor γ (vi, mean = 0.76, P < 1.0 × 10−26). Note that the distributions of inferred parameters are in a similar range than the fits to the data suggesting that trial numbers constrain the accuracy of parameter estimation. i, similar to the panel f (lower row) for the neurons recorded in mouse m3044, showing that across bootstraps, the estimates are consistent for the gain α (mean = 0.64, P < 0.05 for 88/100 bootstraps, light blue P > 0.05, dark blue P < 0.05), baseline b (mean = 0.86, P < 0.012) and discount factor γ (mean = 0.72, P < 0.05 for 97/100 bootstraps, light blue P > 0.05, dark blue P < 0.05). j, same as panel f (lower row) for the neurons recorded in mouse m3054. The estimates are consistent for the gain α (mean = 0.90, P < 0.0048), baseline b (mean = 0.93, P < 1.0 × 10−5) and discount factor γ (mean = 0.79, P < 0.0069).

Extended Data Fig. 4 Decoding reward timing using the regularized pseudo-inverse of the discount matrix.

(a-c), Singular value decomposition (SVD) of the discount matrix. a, left singular vectors (in the neuron space). b, Singular values. The black line at 2 indicates the values of the regularization term α. c, right singular vectors (in the time space). d, Decoding matrix based on the regularized pseudo-inverse. e, Distribution of 1-wassertein distances between the reward timing and the predicted reward timing from the decoding on the test data from exponential fits (shown in Fig. 2k, top row) and on the average exponential model (shown in Fig. 2k, bottom row). Decoding is better for the exponential model from Fig. 2 than the average exponential model except for the shortest delay (P(t = 0.6 s) = 1, P(t = 1.5 s) <1.0 × 10−31, P(t = 3.75) = 0.0135, P(t = 9.375 s) <1.0 × 10−14), one-tailed Wilcoxon signed rank test, see Methods). f, The ability to decode the timing of expected future reward is not due to a general property of the discounting matrix and collapses if we randomize the identity of the cue responses (see Methods). g, Distribution of 1-Wassertein distances between the reward timing and the predicted reward timing from the decoding on the test data exponential fits (shown in Fig. 2k, top row) and on the shuffled data (shown in panel f). The prediction from the test data are better predictions (smaller 1-Wasserstein distance) than shuffled data (P = 1.2 × 10−4 for 0.6 s reward delay, P < 1.0 × 10−20 for the other delays, one-tailed Wilcoxon signed rank test, see Methods).

Extended Data Fig. 5 Comparing behavioral and neural discounting and decoding reward timing in single animals.

(a-c): mouse 3044. a, left panel. Normalized lick responses to the cues predicting reward delays across the population. For each neuron, the response was normalized to the highest response across the 4 possible delays. Neurons are sorted by the inferred behavioral discount factor. Right panel: Normalized neural responses to the cues predicting reward delays across the population (sorted by the behavioral discount factor). b, The behavioral and neural discount factors are not correlated (r = −0.29, P = 0.27, Spearman’s rank correlation, two-tailed Student’s t-test). c, Discount matrix for the neurons recorded in mouse 3044. This is the matrix used for decoding in panel g, top row. (d-f): same as panels (a-c) for mouse 3054. e, The behavioral and neural discount factors are not correlated in mouse 3054 (r = −0.029, P = 0.9, Spearman’s rank correlation, two-tailed Student’s t-test). f, Discount matrix for the neurons recorded in mouse 3054. This is the matrix used for decoding in panel g, bottom row. g, Decoding of reward timing at the single animal level for mouse 3044 (top row) and mouse 3054 (bottom row). The decoding is present but slightly less accurate as expected from the smaller number of neurons. h, discount factor inferred for neurons in mouse m3044 when dividing trials between low and high anticipatory lick rate. left panel, scatter plot of the value across neurons. right panel, the distribution across neurons of differences in inferred discount across the two conditions is not significant (mean = −0.0024, P = 0.96, two-tailed Student’s t-test). i, Same as panel h for mouse m3054. The difference in inferred value between low and high lick rate is significant (mean = 0.086, P = 0.0062, two-tailed Student’s t-test) but the mean effect is small compared to the standard deviation of inferred discount factors across neurons (s.d. = 0.19 for neurons in m3054).

Extended Data Fig. 6 Decoding reward timing from the hyperbolic model and exponential model simulations.

a, Distribution of the inferred discount parameter k across the neurons. b, Correlation between the discount factor inferred in the exponential model of the discount parameter k from the hyperbolic model (r = −0.9, P < 1.0 × 10−30, Student’s t-test). Note the in the hyperbolic model a larger value of k implies faster discounting hence the negative correlation. c, Discount matrix for the hyperbolic model. For each neuron we plot the relative value of future events given its inferred discount parameter. Neurons are sorted by decreasing estimated value of the discount parameter. d, Decoded subjective expected timing of future reward \(E(r|t)\) using the discount matrix from the hyperbolic model (see Methods). e, Distribution of 1-Wassertein distances between the reward timing and the predicted reward timing from the decoding on the test data with the exponential model (shown in Fig. 2k, top row) and on the test data with the hyperbolic model (shown in d). Decoding is better for the exponential model from Fig. 2 than the hyperbolic model except for the shortest delay (P(t = 0.6 s) = 1, P(t = 1.5 s) <1.0 × 10−31, P(t = 3.75) < 1.0 × 10−33, P(t = 9.375 s) <1.0 × 10−3), one-tailed Wilcoxon signed rank test, see Methods). f, Decoded subjective expected timing of future reward \(E(r|t)\) using simulated data based on the parameters of the exponential model (see Methods). g, Distribution of 1-Wassertein distances between the reward timing and the predicted reward timing from the decoding on the test data from exponential fits (shown in Fig. 2k, top row) and on the simulated data from the parameters of the exponential fits (shown in f). Decoding is marginally better for the data predictions (P(t = 0.6 s) = 0.002, P(t = 1.5 s) = 0.999, P(t = 3.75) < 1 × 10−12, P(t = 9.375 s) = 0.027), one-tailed Wilcoxon signed rank test, see Methods), suggesting that decoding accuracy is limited by the number of trials.

Extended Data Fig. 7 Ramping, discounting, anatomy and distributed RL models.

a, Ramping in the prediction error signal is controlled by the relative contribution of value increases and discounting. If the value increase (middle) exactly matches the discounting, there is no prediction error (middle equation, right). If the discounting is smaller than the value increase (large discount factor) then there is a positive TD error (top equation, right). If the discounting is larger (small discount factor) than the value increase then there a negative TD error (bottom equation, right). A single timescale agent with no state uncertainty will learn an exponential value function but if there is state uncertainty (see ref. 69) or the global value function arises from combining the contribution of single-timescale agents then the value function is likely t be non-exponential. Image of a magnifying glass was created by googlefonts via SVG Repo under an Apache Licence. b, Intuition for diversity of ramping with a hyperbolic value function. Agents with a small discount factor exhibit a monotonic downward ramp (pink), while those with a large discount factor exhibit a monotonic upward ramp (brown). Agents with an intermediary discount factor tend to exhibit a downward then upward ramp. The hyperbolic value function gets increasingly convex as the reward approaches, so an increasing fraction of the agents have a positive prediction error as they approach the reward. c, The discount factor inferred in the VR task is not correlated with the medio-lateral (ML) position of the implant (Pearson’s r = 0.015, P = 0.89, two-tailed Student’s t-test). d, The baseline parameter inferred in the VR task is not correlated with the medio-lateral (ML) position of the implant (Pearson’s r = −0.011, P = 0.92, two-tailed Student’s t-test). e, The inferred gain in the VR task reduces with increasing medio-lateral (ML) position but the effect does not reach significance (Pearson’s r = −0.19, P = 0.069, two-tailed Student’s t-test). In panels c-e, the line correspond to the best fit linear regression and the uncertainty shading represents 95% confidence interval on a linear regression fit. f-h. Ramping in the reward prediction error with mixing in distributed RL models. Inferred value functions (V) and RPEs (δ) for the mixed RL model as a function of the common value function sharing-parameter λ, in a linear MDP of 30 steps (x-axis in the plots) with a deterministic reward equal to 1 in the last step and 0 everywhere else. Plots are shown after learning has empirically stabilized (after 3,000 TD-learning iterations with a learning rate of 0.1). The dashed value function is the exponential value function without common value sharing (λ = 0), which would lead to a flat RPE equal to 0 at every state. The actual value functions (solid lines) are not purely exponential, and thus lead to ramping RPEs. f, Circuit model in which each value estimation and their corresponding prediction error are part completely independent loops (λ = 0). At convergence, there is no more prediction error in the reward anticipation period (bottom row). g, Circuit model in which the prediction error for each dopamine neuron is influenced by both the independent value signal and the shared a common value signal (C, λ = 0.1). The dashed line indicate the value function corresponding to completely separate loops, and the solid function the actual value function due to the influence of the common value signal. The difference between them leads to ramping in the reward prediction error signals (bottom row). h, Circuit model with a strong influence of the common value signal (λ = 1) which also leads to ramping in the reward prediction error signals. See Methods for details. i, Decoded reward times for 4 experimental conditions with rewards at times 5, 10, 15 and 30 (pink to cyan), by applying a regularized inverse Laplace decoder (analogous to the one used in Fig. 2 of the main text) to the values at the moment of the cue, under the model without mixing λ = 0. j, Same as (i) but using a mixing factor of λ = 0.1. The small mixing factor does not affect the quality of the temporal decoding, while creating a ramping reward prediction error (panel g). Therefore, a small mixing factor constitutes a common model that can qualitatively account for the two tasks studied in the paper. k, Same as (i) but using a mixing factor of λ = 1. Using a fully shared value function the relative differences between discount factors disappear, so temporal decoding is no longer possible.

Extended Data Fig. 8 Behavioral and neural discounting at the single animal.

a, Time course of the lick rate in the VR task as mice approach the reward location. gray line, lick rate for individual neurons, blue line, mean lick rate. The three black lines on top indicate the three windows used to compute early, middle and late lick rate in the analysis presented in panels c-d. b, The inferred discount factor and the slope in spiking activity (see Fig. 3b) are strongly correlated (r = 0.81, P = 0, Spearman rank correlation, two-tailed Student’s t-test) suggesting that slope is a good measure of discounting. c, Correlations of measures of behavioral and neural discounting for mouse m3044 (Spearman rank correlation, two-tailed Student’s t-test). i-iii: the neural discount factor and the ramp in licking activity is not correlated irrespective of the window used to compute the ramp in licking activity when using the following windows to compute ramping activity in the lick rate: i, modulation between the late and early window, r = 0.09, P = 0.64. ii, modulation between the late and middle windows, r = −0.05, P = 0.79. iii, modulation between the early and middle windows, r = 0.17, P = 0.37. iv-vi: The measures of ramping in licking activity are strongly correlated to each other: i, correlation between the late-middle and late-early modulation measures, r = 0.65, P = 1.3 × 10−4. ii, correlation between the middle-early and late-early modulation measures, r = 0.96, P = 7 × 10−17. iii, correlation between the middle-early and late-middle modulation measures, r = 0.53, P = 0.0028. d, Correlations of measures of behavioral and neural discounting for mouse m3054 (Spearman rank correlation, two-tailed Student’s t-test). i-iii: the neural discount factor and the ramp in licking activity is not correlated irrespective of the window used to compute the ramp in licking activity when using the following windows to compute ramping activity in the lick rate: i, modulation between the late and early window, r = −0.1, P = 0.62. ii, modulation between the late and middle windows, r = 0.032, P = 0.88. iii, modulation between the early and middle windows, r = −0.11, P = 0.61. iv-vi: The measures of ramping in licking activity are strongly correlated to each other: i, correlation between the late-middle and late-early modulation measures, r = 0.82, P = 6.7 × 10−7. ii, correlation between the middle-early and late-early modulation measures, r = 0.94, P = 4.7 × 10−12. iii, correlation between the middle-early and late-middle modulation measures, r = 0.66, P = 4.2 × 10−4. e, The VR model fits (right panel) to m3044 neurons alone captures the diversity of ramping activity observed across single neurons (left panel). f, Inferred value function for m3044. Thin gray line, individual bootstrap fits. Blue line, mean value fit. g, The VR model fits (right panel) to m3054 neurons alone captures the diversity of ramping activity observed across single neurons (left panel). h, Inferred value function for m3054. Thin gray line, individual bootstrap fits. Blue line, mean value fit. i, The inferred parameter values between the fit for m3044 and the full population fit are strongly correlated (Spearman rank correlation, two-tailed Student’s t-test) for the gain parameter (left panel, r = 0.78, P = 2.1 × 10−6), the baseline parameter (middle panel, r = 0.75, P = 5.7 × 10−6) and the discount factor (right panel, r = 0.75, P = 6.8 × 10−6). j, The inferred parameter values between the fit for m3054 and the full population fit are strongly correlated (Spearman rank correlation, two-tailed Student’s t-test) for the gain parameter (left panel, r = 0.97, P = 1.3 × 10−6), the baseline parameter (middle panel, r = 0.98, P = 1.1 × 10−6) and the discount factor (right panel, r = 0.88, P = 2.6 × 10−6). All reported correlations are Spearman rank correlations.

Extended Data Fig. 9 Discounting heterogeneity explains ramping diversity in a common reward expectation model.

a, Uncertainty in reward timing reduces as mice approach the reward zone. Not only does the mean expected reward time reduces but the standard deviation of the estimate also reduces. Distribution in the bottom row from fitted data (see panels c-e). b, A model where each neuron contributes to its individual value function but share a common reward expectation predicts ramping heterogeneity across neurons. Left panel, as mice approach reward, the uncertainty, quantified by the standard deviation, of reward timing reduces. 2nd panel from left, The Expectation of reward timing takes the form of a folded normal distribution. As the mice approach the reward there is a reduction of both the mean and the standard deviation of the expected reward timing distribution. 3rd panel from left, each neuron computes a distinct value function given their individual discount factor and the common expected reward timing distribution with. Right panel, The diverse value functions across neurons lead to ramping heterogeneity across neurons in the reward prediction error. (see Methods ‘Common Reward Expectation model’). c, The inferred standard deviation of the reward expectation model reduces as a function of time to reward. Line indicates the mean inferred standard deviation and the shading indicates the standard error of the mean over 100 bootstraps. d, Expected timing of the reward as a function of true time to reward. As the mice approach the reward not only does the mean expected time to reward reduces but the uncertainty of the reward timing captured by the standard deviation shown in c also reduces. This effect leads to increasingly convex value functions that lead to the observed ramps in dopamine neuron activity. e, Value function for each individual neuron (same order as in h-i). f, Distribution of inferred discount factors under the common reward expectation model. g, Although the range of discount factor between the fits from the common value (x axis) and common reward expectation (y axis) models differs, the inferred discount factors are strongly correlated for single neurons (Spearman’s ρ = 0.93, P < 1.0 × 10−20, two-tailed Student’s t-test). h, Predicted ramping activity from the model fits under the common reward expectation model. i, Diversity of ramping activity across single neurons as mice approach reward (aligned by inferred discount factor in the common reward expectation model).

Extended Data Fig. 10 Decoding reward timing in the cued delayed reward task using parameters inferred in the VR task and details of recordings.

a, Discount matrix computed using the parameters inferred in the VR tasks for neurons recorded across both tasks and used in the cross-task decoding. b, Dopamine neurons cue responses in the cued delay task. Neurons are aligned as in a according to increasing discount factor inferred in the VR task. c, Top row: Decoded reward timing using discount factors inferred in the VR task. Bottom row: The ability to decode reward timing is lost when shuffling the identities of the cue responses. d, Except for the shortest delay, decoded reward timing is more accurate than shuffle as measured by the 1-Wassertsein distance (Pt = 0.6s = 1, Pt = 1.5s < 1.1 × 10−20, Pt = 3.75s < 3.8 × 10−20, Pt = 9.375s < 2.9 × 10−5). e, Breakdown of the number of recorded neurons per animal and task. The numbers in parenthesis indicate the number of neurons included in the analysis. ± indicates standard deviation across sessions. The maximum number of neurons recorded in a single session was 4 in both the cued delay task and the VR task.

Supplementary information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Masset, P., Tano, P., Kim, H.R. et al. Multi-timescale reinforcement learning in the brain. Nature 642, 682–690 (2025). https://doi.org/10.1038/s41586-025-08929-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/s41586-025-08929-9

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing