Extended Data Fig. 2: The architectures of MoE and considered VLAs. | Nature Machine Intelligence

Extended Data Fig. 2: The architectures of MoE and considered VLAs.

From: What matters in building vision–language–action models for generalist robots

Extended Data Fig. 2: The architectures of MoE and considered VLAs.The alternative text for this image may have been generated using AI.

(a) Illustration of Mix-of-Expert structure. In original VLAs, both vision-language tokens and action tokens share the same weights of the original VLM FFN (Feed Forward Network). For the Mix-of-Expert structure, vision-Language tokens and action tokens have separate query, key and value projection layers for self-attention, and have separate feed-forward networks. Action tokens would only interact with vision-language tokens through self-attention. This Mix-of-Expert structure preserves the original parameters and forward process of the VLMs, and is claimed to benefit the generalization for the built VLAs. (b) The illustration of considered VLA formulations, including several popular designs. For example, RoboFlamingo is a Policy-Head-Continuous-type VLA, RT-2 and OpenVLA corresponds to the One-Step-Discrete-Action-type VLA. Octo and GR correspond to the Interleaved-Continuous-Action-type VLA with a fixed window size. The architectures of MoE and considered VLAs.

Back to article page