Extended Data Fig. 2: Details of the experiments based on embodied multimodal large language models (MLLM).
From: Emulating human-like adaptive vision for efficient and flexible machine visual perception

a, The network architecture and inference procedure of the AdaptiveNN-based embodied MLLM, which mainly follow RoboFlamingo 77. The backbone network is based on a pre-trained OpenFLamingo 3B 117. Each two adjacent network blocks coupled with the shared vision encoder are employed as the perception net of AdaptiveNN. b, The metric employed in our experiments on CALVIN. The model performance is quantified as the average successful length (0 to 5) across 1000 5-task sequences. Images are constructed based on the code of ref. 78 (MIT License).