Extended Data Fig. 15: Influence of policy on backward shifting of place fields. | Nature

Extended Data Fig. 15: Influence of policy on backward shifting of place fields.

From: Predictive coding of reward in the hippocampus

Extended Data Fig. 15

a, The task set-up is the same, except that: (1) the agent learns a navigation policy using place-cell activity via an actor with forward/backward actions, and (2) place-field peaks (λ) are modulated by both critic and actor weights11. b, As trials progress, the agent learns to reach the reward faster (reduced latency). Performance increases (orange) or decreases (blue) based on trial-to-trial latency changes. c, Backward-shifting TD error with higher stochasticity across trials. d, Value estimate of each state is updated in a backwards manner across trials. e, At trial 0, basis function peaks are uniformly distributed across states 0 to 9. As learning progresses, backward-shifting TD error causes place-cell peaks to shift backwards, with pronounced backward shifts occurring between states 4 and state 8. Place-field centres organize in the centre of each discrete state to support policy learning. f, Similar to the value learning model (Fig. 5) we observe three shift patterns: (1) reward cells (yellow) shift backwards; (2) approach cells (green) overshoot then shift back; (3) screen cells (purple) shift forwards later. g, 12 examples of individual reward place cells’ activity dynamics between trials 0 to 500. White dash indicates state 7. h, Place cells initialized between states 7 to 8 demonstrate pronounced backward shifts across trials. i, Correlation between peak shift dynamics and session number across cell types. j, Both reward and reward-approach cells show a significantly higher proportion of backward shifting dynamics, whereas screen cells show a significantly higher proportion of forward-shifting dynamics, replicating the proportions in Fig. 4e. k, Reward cells rapidly increase early, then decline. l,m, Quantification of average backward shifting of reward and reward-approach cells across all consecutive days (first column). When consecutive days are grouped based on whether performance increased or decreased, days with reduced performance show greater backward shifting (second column) compared to days with improved performance (third column). l represents modelling results (two-sided t-test p-value = 10^−173) and m represents experimental results (two-sided t-test p-value: 0.0051). l: n (consecutive): 998; n (decreased): 476; n (increased): 458. m: n (consecutive): 121; n (decreased): 53; n (increased): 68. n,o, Hyperparameter sweep for value estimation agent. Because the critic’s weight vector wv was initialized with zeros and the policy was consistently taking the shortest path, there is no stochasticity in the value agent and place-cell dynamics. Hence, the results are for one seed. n, Change (Δ) in percentage of cells is the difference in the number of cells between trial 0 and 1,000 at a specific state. Increasing the total number of available place cells (N) caused a monotonic decrease in the number of cells at the reward state, and increase in approach and screen states, similar to experimental results (γ = 0.95, σ = 0.5). o, As place-cell spread (σ) increased, the number of cells at the reward state decreased while cells at the screen and approach states increased, similar to experimental results. Beyond σ > 0.8, the decrease in reward cell percentage was reduced, although still negative. (γ = 0.95, N = 1,000). Hence, the experimentally observed decrease in the number of cells at the reward state is robust across different numbers of cells and place-field width in our numerical simulations. Bar graphs and error bars in i,l,m show mean ± s.e.m.

Back to article page