Extended Data Table 5 The performance of the built VLAs based on VLMs with different image token numbers and VL pre-train data scales

From: What matters in building vision–language–action models for generalist robots

  1. The first three rows are flamingo backbones with encoder-decoder structures, the rest backbones are decoder-only structures. Note that for VLMs with multi-stage training, the data scale refers to the data amount utilized for the final stage of fine-tuning. “UNK” denotes unknown. Results are reported with the model checkpoints trained with 5 epochs on the ABCD training splits, all models are trained with a single side view image for fair comparison. We surprisingly found that both LLaVA and Qwen behave badly without an additional resampler to downsample the number of tokens.