Extended Data Table 2 Comparison with baselines on CALVIN

From: What matters in building vision–language–action models for generalist robots

  1. Simulation performances on CALVIN benchmark, all models are trained on split ABCD/ABC, and evaluated on split D. KosMos P.H. represents the VLA utilizing KosMos-2 as backbone and policy head as architecture, built with the RoboVLMs framework, and is maximally trained for 5 epochs. We will continue to use the expression of backbone and structure to represent the VLAs built with RoboVLMs in the following paper. Note that KosMos refers to the VLM backbone we utilized, and “P.H.” refers to the policy head formulation.