Extended Data Fig. 10: A foraging mouse must decide which patch to choose to maximize cumulative collected rewards in a non-stationary environment.
From: A multidimensional distributional map of future reward in dopamine neurons

a, Axes indicate the learnt joint probability distribution of reward time and magnitude associated with each patch. Rewards available from the three patches are dynamic during the day: patch one provides lower reward magnitudes early in the day. Patch two provides variable reward magnitudes with the same mean as patch one early in the day. Patch three provides bigger reward magnitudes than either patch one or patch two, but later in the day. b, In the agent making use of SR, the value of each patch at the start of the day is the product of the temporally discounted future occupancy, learnt through a temporal-difference algorithm, by the reward at each future time step. In the agent making use of TMRL, the probability distribution over future reward time and magnitude is weighted by a utility function to obtain an estimate that depends on internal state or/and the dynamics of the environment (see Methods for a detailed description). The utility is represented using a colour gradient, ranging from low (black) to high (yellow). c, Adaptation of SR and TMRL agents when the timescale of the environment changes, from dusk to dawn. The SR future occupancy has to be relearnt, considering a lower temporal discount factor. The TMRL utility function discounts the reward time more steeply at dawn. d, Illustration of how the SR and the TMRL adapt when reward is over-valued, that may occur for example when the mouse is sated and becomes hungry. For the SR agent, the reward signal increases. For the TMRL agent, the utility function over reward magnitudes is linear when the mouse is sated and becomes convex when the mouse is hungry. e, When the mouse is hungry and has less time to forage at dawn, it may become more risk prone. The SR agent has to relearn the future occupancy and over-value the reward. The TMRL utility function, in addition to discounting the reward time more steeply at dawn, will apply a convex function to reward magnitudes. f, Probability of choosing the optimal patch (patch 1 or 2) at the first trial after dawn for the three algorithms. g, Probability of choosing the optimal patch (patch 2) at the first trial after being hungry for the three algorithms. h, Probability of choosing the optimal patch (patch 2) at the first trial after dawn for the three algorithms when the agent is hungry. i, Value of each patch as a function of the time since dawn for standard TDRL, SR and TMRL. The error bars represent the standard deviation over 10 runs and the lines represent the mean. j, Value of each patch as a function of the time since the mouse is hungry. k, Value of each patch as a function of the time since dawn when the mouse is hungry. Data underlying the figure can be found in the Supplementary Data.