Emu3 enables large-scale text, image and video learning based solely on next-token prediction, matching the generation and perception performance of task-specific methods, with implications for the development of scalable and unified multimodal intelligence systems.
- Xinlong Wang
- Yufeng Cui
- Tiejun Huang