Fig. 1: Emu3 framework. | Nature

Fig. 1: Emu3 framework.

From: Multimodal learning with next-token prediction for large multimodal models

Fig. 1: Emu3 framework.

Emu3 first tokenizes multimodal data such as images, text, video and actions into discrete tokens and then sequences these tokens by order and performs unified next-token prediction at scale with a Transformer decoder. We have also seamlessly generalized the framework to robotic manipulation by treating vision, language and actions as unified token sequences.

Back to article page