Fig. 5: Performance of the generalized world model across various environments.
From: Model-based reinforcement learning for ultrasound-driven autonomous microrobots

a, Reward progression over number of training steps across ten distinct environments, including empty, four squares, racetrack, vascular and several maze configurations. Each coloured line represents the centred EWMA (α = 0.002) of the reward for each environment, with shaded regions indicating ±0.5 of the rolling standard deviation (window size of 1,000 steps). Convergence was faster in simpler environments, whereas more complex ones required extra training steps. After 4.5 million steps, marked by a vertical dashed red line, the model transitioned from pretraining across the ten simulation environments to adaptation within a new multi-output tributary channel. The black curve denotes the average pretraining performance across all environments. Following the transition, the model adapted rapidly, achieving stable performance within ~50,000 steps (approximately 30 min) in the new environment. b, Success rate of targets reached across different environments plotted against number of training steps. The box plots illustrate the variability and distribution of the performance of the MBRL algorithm in successfully reaching targets. Although simpler environments facilitated quicker convergence, our MBRL model consistently attained convergence across all scenarios. Steps are grouped into logarithmic bins from 0 to 4 million steps, and each box summarizes target-reaching rates across a training run within each bin. Boxes indicate the IQR, the horizontal line marks the median, whiskers extend to 1.5 × IQR and outliers are omitted for clarity.