Extended Data Fig. 6: APE model and tests.
From: Dopaminergic action prediction errors serve as a value-free teaching signal

a, The model comprises an actor that learns stimulus-action values and guides action choices, a critic that learns a value function that is used to calculate RPE. The RPE signal is broadcast to the actor and critic to update their respective value functions. A value-free system learns to predict actions from those taken in the past and updates its prediction using the difference between its prediction and the action taken (APE). APE and RPE equations are written with respect to time (t), as is common, for illustrative purposes. For the model equations we use dwell time in the state (k) to approximate temporal discounting, see methods. b, The Markov decision process used to model the task. c, Correlation between the turn angle and the size of the dopamine response in the TS for all trials in all sessions of an example mouse. d, Correlation between the average speed of an example mouse and the TS dopamine response for all trials in all sessions. e, Linear regression coefficients for speed and turn angle on single trial TS (n = 6 mice) dopamine responses for the first three sessions of training. Stats: one-sample two-sided t-test against zero, speed: (p = 0.448, Cohen’s d = −0.34), turn angle: (p = 0.033, Cohen’s d = 1.20). Filled circles represent significant correlations for individual mice. Error bars represent 95% confidence interval. f, Turn angle of an example mouse over the course of training, binned per 40 trials. g, Average speed during a choice of an example mouse over the course of training, binned per 40 trials. h, Linear regression coefficients for the effect of trial number on speed or turn angle at a single trial level (n = 6 mice). Stats: one-sample two-sided t-test against zero, speed: (p = 0.154, Cohen’s d = 0.68), turn angle: (p = 0.340, Cohen’s d = −0.43). Filled circles represent significant correlations for individual mice. Error bars represent 95% confidence interval. i, TS dopamine response, binned per 40 trials of an example mouse over the course of training (blue). A linear regression model was built using average speed and turn angle to predict the TS dopamine signal. The model prediction from just the movement parameters over the course of training is shown in gold (binned per 40 trials). j, The movement model used in panel I was subtracted per trial from the TS dopamine responses to give the remaining signal that was not explained by speed or turn angle (residuals in blue). A new linear regression model was built using log trial number to account for the remaining TS dopamine signal (purple). Both are shown binned per 40 trials. k, The correlation between the individual trial residual dopamine responses and log trial number for an example mouse. l, Regression coefficients for the effect of log trial number on the residual dopamine response (n = 6 mice) (filled circles show significant correlations for individual mice). One-sample two-sided t-test against 0, p = 0.003, Cohen’s d = −2.14. Error bars represent 95% confidence interval. m, Total model variance explained by each parameter in a model where speed, turn angle and trial number are used to predict the size of the TS (n = 6 mice) dopamine response throughout learning. n, Difference between TS dopamine response in the last 40 trials of a previous session and next 40 trials of a session (between sessions) and first 40 trials of a session and last 40 trials of the same session (within session) (n = 6, between sessions: p = 0.27 one-sample two-sided t-test against 0, two-sided t-test Cohen’s d = 0.51, turn angle p = 0.05 one sample t-test against 0 two-sided t-test, Cohen’s d = −1.07). Error bars represent 95% confidence interval. o, Performance in the 50 trials before and after the state change (n = 13 mice, p = 1.98×10-4 paired two-sided t-test, Cohen’s d = 1.46). p, Changes in turn angle before and after the state change (n = 13 mice, p = 0.85 two paired two-sided t-test, Cohen’s d =−0.08). q, Changes in average speed before and after the state change (n = 13 mice, p = 0.02 paired two-sided t-test, Cohen’s d = 1.42). r, Performance before and after state change at trial 150 (black dashed line) binned per 20 trials (n = 13). Green lines show mean, error bars represent sem, grey lines represent data from individual mice. s, Same as R but showing the response time following the state change. t, Same as R but showing the bias towards ipsilateral choices following the state change. u, behavioral bias towards the large reward port before and after the change in value. The last 50 trials from each block are used for analysis with blocks being a minimum of 70 trials (n = 10 mice, p = 0.002 paired two-sided t-test, Cohen’s d = 1.39). v, Percentage of trials where mice did not make a choice before and after the value change is introduced at trial 100 (black dashed line) binned per 20 trials (n = 10 mice). Green lines show mean, error bars represent sem, grey lines represent data from individual mice. w, Same as V but showing the change in performance. x, Same as V but showing the change in response time. y, Same as V but showing the change in choice bias over the course of the session. All boxplots show the range from quartile (Q1 - Q3), the median and the whiskers extend to the farthest data point lying within 1.5x the inter-quartile range (IQR) from the box.