Fig. 3: Token-centric multimodal infrastructure and architectural comparisons with diffusion models and the encoder + LLM compositional paradigm.
From: Multimodal learning with next-token prediction for large multimodal models

a, Multimodal data tokenization can be performed directly on edge devices, and only the resulting discrete token IDs are transmitted to large-scale servers for unified multimodal training and inference. b, GenEval overall scores as a function of training sample count for the image-generation task, comparing the latent diffusion and next-token prediction paradigms. c, Validation loss of text tokens as a function of training sample count for the image-understanding task, contrasting the decoder-only paradigm with the encoder + LLM compositional paradigm in the scenario in which the LLM is trained from scratch, with further comparisons according to whether CLIP initialization is applied. Init., initialization.