Fig. 5: RL models suggest that mice attempt to maximize dopamine during spontaneous behaviour. | Nature

Fig. 5: RL models suggest that mice attempt to maximize dopamine during spontaneous behaviour.

From: Spontaneous behaviour is structured by reinforcement without explicit reward

Fig. 5

a, Top, schematic describing modification of a standard RL model to explore relationships between DLS dopamine fluctuations and behavioural choices. Bottom, schematics of ‘reinforcement-only’ and ‘full’ model variants (Methods). b, Left, empirical transition matrix (TM) observed during open field behaviour. Centre, an example transition matrix learned by the full model (top), along with the squared difference between the empirically observed transition matrix and the example learned transition matrix (bottom). Right, as in centre, except for the reinforcement-only model. The average correlation between the observed transition matrix and the transition matrix learned by each model, along with the associated P-value computed via shuffle test, are given for each model type. For visualization, the model transition matrix is estimated by taking a softmax (see Methods) over the Q-table learned by the model. Here, the temperature parameter was set to 0.1 for visualization only. c, The distribution of correlations between the learned and observed transition matrices for both the reinforcement-only (blue) and full (orange) models, compared to a histogram of correlations between transition matrices learned with time-shuffled dLight traces (all models are statistically significant according to a shuffle test, defined as model fits exceeding 95% of shuffle correlation values, n = 100 shuffles). d, The performance of the full model after temporally shifting syllable-associated dLight amplitudes across syllables over various lags. e, The distribution of log likelihoods for models that consider dopamine as a reward versus a reward-prediction error (RPE) signal (Methods). The log likelihoods shown are for the best parameterization for each model type across 50 bootstraps of the dataset. On the basis of this relationship, we formulated models that treated dopamine transients as representing reward rather than reward-prediction error.

Back to article page