Fig. 1: Emu3 framework.
From: Multimodal learning with next-token prediction for large multimodal models

Emu3 first tokenizes multimodal data such as images, text, video and actions into discrete tokens and then sequences these tokens by order and performs unified next-token prediction at scale with a Transformer decoder. We have also seamlessly generalized the framework to robotic manipulation by treating vision, language and actions as unified token sequences.