Fig. 1: Meta-learning theory and paradigm.

A Motor learning as a sequential decision-making process. The action u(k) updates the memory and sensory prediction error states {x(k), e(k)} to the next states {x(k+1), e(k+1)}, and generates a reward \(r({x}^{(x+1)})\) in the given environment (p: perturbation). The action u(k) responds to {x(k), e(k)}, characterized by meta-parameter θ and influenced by memory noise \({n}_{x}^{(k)}\), i.e., drawn from a policy distribution \({u}^{(k)} \sim {\pi }_{{{{{{\mathbf{\theta }}}}}}}({u}^{(k)}|{x}^{(k)},{e}^{(k)})\)42. This aligns with the previous models of error-based motor learning \({x}^{(k+1)}=\alpha {x}^{(k)}+\beta {e}^{(k)}+{n}_{x}^{(k)}\) (α: retention rate, β: learning rate)17,21,23 when the learner has a linear policy function. B The primary hypothesis of this study is that the meta-parameter \({{{{{\mathbf{\theta }}}}}}={[\alpha,\beta ]}^{T}\) is updated by reinforcement learning rule (policy gradient18) to maximize rewards and minimize punishments: \({{{{{\mathbf{\theta }}}}}}\leftarrow {{{{{\mathbf{\theta }}}}}}+{\nabla }_{{{{{{\mathbf{\theta }}}}}}}\,\log {\pi }_{{{{{{\mathbf{\theta }}}}}}}\cdot r({x}^{(x+1)})\). C Simulated change of meta-parameters in the two opposite reward functions. Reward is given for learning in “Promote” (magenta) and for not-learning in “Suppress” (cyan). Reinforcement learning upregulates \({{{{{\mathbf{\theta }}}}}}={[\alpha,\beta ]}^{T}\) to learn faster in Promote, whereas it downregulates them to learn slower in Suppress. a.u. = arbitrary unit. D Meta-learning training. Learners experience a sensory-error (E) trial where the sensory prediction error e is induced by cursor rotation p while the task error is clamped (TE clamp). Subsequently, they experience a reward (R) trial in which the updated memory u, manifested as an aftereffect h = T−x, is evaluated with reward function r. Promote and Suppress were implemented by linking the aftereffect and reward oppositely. Reward is delivered as a numerical score associated with monetary reward. E The task schedule. Learners repeat meta-learning training that comprises pairs of E and R trials and Null trials (in which the veridical cursor feedback was given). After every six repetitions of training, they perform a probe task, developed from previously established motor learning paradigms to estimate learning parameters. The simulated reach behavior and changes in θ are plotted for Promote and Suppress. F The task is separated into four blocks, and behavior is analyzed block-by-block. The first block marked in pink is the baseline condition in which score is absent in R trials.