Fig. 2: Motivation and framework of the proposed approach.
From: Discovery of the reward function for embodied reinforcement learning agents

a Humans observe the world and manually design a reward function, which becomes increasingly difficult as task complexity increases. b Two schemes for designing reward functions automatically. One involves learning from expert demonstrations, and the other leverages human preference labels. c Regret of policies learned with different reward functions. The red and blue lines correspond to a more and a less effective reward function, respectively. A better reward function is associated with less policy regret and faster convergence. d Overview of the proposed bilevel optimization framework for optimal reward function discovery. In the lower-level, the embodied RL agent interacts with a world model, receives reward signals, and optimizes policy. Interaction trajectories are stored in a buffer. In the upper-level, a mini-batch of trajectories is sampled and decomposed into interaction steps for estimation of the policy distribution and the advantage function. The reward function is then updated via the approximated meta gradient (Theorem 2). This iterative process ensures simultaneous optimization of both the reward function and the policy of the RL agent. e Comparison of well-designed and poorly designed reward functions. A well-designed reward function assigns high rewards (represented by the arrow length) to the optimal action. A poorly designed reward function fails to distinguish the optimal action, resulting in incorrect actions by the agent. f Discovery of the optimal reward function. The reward function is randomly initialized and then learned by the proposed framework. The discovered reward function accurately identifies the optimal action and assigns an appropriate reward to each action. Specifically, higher rewards are associated with more effective actions, whereas lower rewards are assigned to less effective actions. Icons from Icons8.com.