Abstract
Multimodal large language models have revolutionized AI research and industry, paving the way toward the next milestone. However, their large sizes and high computational costs restrict deployment to cloud servers, limiting use in mobile, offline, energy-sensitive, or privacy-critical scenarios. We present MiniCPM-V, efficient models for edge devices that integrate advancements in architecture, training, and data. The 8B model outperforms GPT-4V, Gemini Pro, and Claude 3 across 11 public benchmarks, processes high-resolution images at any aspect ratio, achieves robust optical character recognition, exhibits low hallucination rates, and supports over 30 languages while running efficiently on mobile phones. This progress reflects a broader trend: The sizes for high-performing models are rapidly decreasing alongside growing edge computation capacity, enabling advanced multimodal models to operate locally on consumer hardware. Such developments unlock applications across diverse real-world scenarios, from enhanced mobile AI to privacy-preserving solutions, marking a critical step toward democratizing powerful multimodal intelligence.
Similar content being viewed by others
Introduction
The rapid development of Multimodal Large Language Models (MLLMs)1,2,3,4,5,6,7,8,9,10,11 has brought an impressive surge in multimodal capabilities in understanding, reasoning and interaction. This has not only fundamentally reshaped the landscape of AI research and industry, but also shed light on a promising path towards the next AI milestone. However, current MLLMs are still far from being practical in real-world applications. One of the most predominant challenges is the heavy computational burdens imposed by the massive number of parameters of MLLMs. As a result, most MLLMs can only be deployed on high-performing cloud servers, leading to significant energy consumption and carbon emissions. This limitation significantly constrains potential application scopes such as on mobile devices, energy-sensitive scenarios, offline settings without stable network connections, and privacy/security protective scenarios for both personal and industrial users.
In light of these limitations, there is a growing interest in exploring more efficient lightweight MLLMs1,3,11,12 that can run on edge devices. Edge scenarios encompass a broader scope of equipment, including mobile phones, personal computers, vehicles and robotics, etc., which are ubiquitous in users’ daily lives and are experiencing rapid advancements in computation capacities. On-device MLLMs provide a promising solution towards more practical applications due to their broader usage scope, better computation efficiency, more robust offline behaviors, and better privacy/security protection.
However, developing capable on-device MLLMs is challenging due to significantly constrained parameter and inference computation budgets. As a result, more careful architecture designs and training recipes are required to fully unleash the potential of on-device MLLMs. In this work, we present MiniCPM-V, a series of efficient MLLMs deployable on edge devices. The philosophy of MiniCPM-V is to achieve a good balance between performance and efficiency, a more important objective in real-world applications. From February to May in 2024, we unveiled three models: (1) In February, we launched MiniCPM-V 1.0 2B, an early prototype of on-device MLLMs. (2) In April, we released MiniCPM-V 2.0 2B, which outperforms strong larger MLLMs such as Qwen-VL 9B7, CogVLM 17B5, and Yi-VL 34B13. This iteration also introduces support for high-resolution image perception and exhibits promising OCR capabilities. (3) Most recently in May, we introduced MiniCPM-Llama3-V 2.5 8B, which outperforms proprietary GPT-4V-1106, Gemini Pro and Claude 3 on the OpenCompass evaluation. Noteworthy features of this model include strong OCR capability, high-resolution image perception, trustworthy behavior, multilingual support, and efficient edge deployment optimization. The capabilities of on-device MLLMs have been growing even stronger in our later releases since May 2024.
More importantly, MiniCPM-V can be viewed as a representative example of a promising miniaturization trend of MLLMs. Figure 1 summarizes the recent development of MLLMs3,12,14 in terms of performance, parameters and release time. We observe an interesting trend akin to Moore’s Law15 indicated by the red line: the sizes of models reaching GPT-4V level performance are rapidly decreasing over time. This phenomenon could perhaps be called Moore’s Law of MLLMs. Simultaneously, the computational capacity of edge devices such as phones and personal computers is steadily increasing (qualitatively depicted by the blue line). The convergence of these two trends indicates usable (e.g., GPT-4V level) MLLMs deployable on edge devices are soon within reach, opening up broader possibilities and benefiting more application scenarios in the near future. From a historical perspective of human technology development, this trend can also be viewed as the human pursuit of miniaturization of state-of-the-art technologies, which has been repeatedly witnessed in other science and technology fields. For example, in aerospace, the latest SpaceX Raptor 2 rocket engine can achieve a strong thrust of 2,256 kN with a mass of 1.6 tons, whereas 20 years ago, the RD-0750 rocket engine could only achieve a thrust of 1,413 kN with a mass exceeding 4 tons16.
The red line shows the decreasing model sizes for achieving GPT-4V level performance, while the blue line represents the growing edge device computation capacity. This jointly shows that GPT-4V level MLLMs deployed on edge devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future.
MiniCPM-V Series Techniques
In this paper, we will take MiniCPM-Llama3-V 2.5 as an example, and systematically introduce the notable features of MiniCPM-V series and the key techniques behind them:
-
Leading Performance. MiniCPM-Llama3-V 2.5 achieves better performance than GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass collection, a comprehensive evaluation over 11 popular benchmarks. This is jointly contributed by its careful design in architecture, data and training recipes, which we will detail in the following.
-
Strong OCR Capability. MiniCPM-Llama3-V 2.5 outperforms GPT-4V, Gemini Pro and Qwen-VL-Max on OCRBench. It also supports high-utility functions such as table-to-markdown conversion and full OCR content transcription. These are largely attributed to the 1.8M pixel high-resolution (e.g., 1344 × 1344) image perception technique across any aspect ratios17 of MiniCPM-Llama3-V 2.5.
-
Trustworthy Behavior. Based on the RLAIF-V18 and RLHF-V19 techniques that align MLLM behaviors from high-quality AI/human feedback, MiniCPM-Llama3-V 2.5 exhibits more trustworthy behaviors, achieving lower hallucination rates than GPT-4V-1106 on Object HalBench.
-
Multilingual Support. Inspired by the findings from VisCPM8, the integration of multilingual LLM significantly alleviates the heavy reliance on multimodal training data in low-resource languages. Based on the foundation, a lightweight multilingual multimodal instruction tuning helps MiniCPM-Llama3-V 2.5 generalize its multimodal capabilities to more than 30 languages.
-
Efficient Edge Deployment. We systematically integrate a suite of on-device optimization techniques, encompassing quantization, memory optimization, compilation optimization and NPU acceleration, enabling efficient deployment on edge devices.
We hope MiniCPM-V series can serve as an example for unveiling the potential of on-device MLLMs, and help draw more attention to promote the research in this direction. Following Moore’s Law for MLLM, we believe there will be increasingly powerful on-device MLLMs with reduced sizes, bringing efficient, safe, and trustworthy AI services on devices soon.
Results
Overview of MiniCPM-V
As shown in Fig. 2b, MiniCPM-V comprises three key modules: the visual encoder, compression layer, and LLM. The input image is first encoded by a visual encoder, utilizing the adaptive visual encoding approach. The visual tokens are then compressed by the compression layer, which adopts a perceiver resampler structure with one layer cross-attention. Finally, the compressed visual tokens, along with the text input, are fed into the LLM for conditional text generation.
a Conventional visual encoding requires a large number of tokens when encoding high-resolution images. Our proposed adaptive visual encoding strategy employs adaptive image partitioning and compressed encoding, significantly reducing computational costs when processing high-resolution images. b Overall structure presents the architecture of the model including the visual encoder, shared compression layer, and LLM. c The progressive multimodal learning strategy is applied to train MiniCPM-V, encompassing three phases: the pre-training phase, supervised fine-tuning phase, and alignment phase. d RLAIF-V framework for hallucination reduction. (1) Response generation produces multiple responses for an instruction using the policy model. (2) Feedback collection evaluates the correctness of each response in a divide-and-conquer fashion. (3) DPO optimizes the model on the preference dataset.
Encoding high-resolution images poses two major challenges. In terms of efficiency, directly encoding high-resolution images results in an excessive number of visual tokens, rendering it computationally prohibitive for edge devices. In terms of effectiveness, the considerable discrepancy between the image resolution and the resolution employed during ViT pre-training can lead to out-of-distribution problems and therefore substantially degrade encoding performance. To address the challenges, we take advantage of the adaptive visual encoding strategy17 as shown in Fig. 2a. To handle the high-resolution images with different aspect ratios, we divide images into slices, where each slice better matches ViT’s pre-training setting in terms of resolution and aspect ratio. Each image is divided into a maximum of 10 slices, supporting 1.8 million pixels (e.g., 1344 × 1344 resolution) at most in total during encoding, which covers most real-world application scenarios. Then we adjust each slice by resizing it proportionally so that the resultant area size matches ViT pre-training area size, and interpolate the ViT’s position embeddings to adapt to the slice’s aspect ratio. After visual encoding, each slice is encoded into 1,024 tokens, where 10 slices can yield over 10k tokens collectively. To manage this high token count, we employ a compression module comprising of one-layer cross-attention and a moderate number of queries, with 2D positions informed7. In practice, the visual tokens of each slice are compressed into 64 queries for MiniCPM V1&2 and 96 tokens for MiniCPM-Llama3-V 2.5 through this layer. Compared with other MLLMs with competitive performance, the significantly smaller number of visual tokens in MiniCPM-V series enables superior efficiency in terms of GPU memory consumption, inference speed, first-token latency and power consumption, making it more friendly to wider application scopes and communities. Finally, we introduce a spatial schema inspired by20 to indicate each slice’s position relative to the whole image. We first wrap tokens of each slice by two special tokens “<slice>” and “<\slice>”, and then employ a special token “\n” to separate slices from different rows.
We adopt a three-phase progressive multimodal learning strategy as shown in Fig. 2c, which consists of the pre-training phase, supervised fine-tuning phase, and alignment phase. In the first phase, we utilize large-scale image-text pairs for MLLM pre-training to align the visual modules (i.e., visual encoder and compression layer) with the input space of LLM and to acquire foundational multimodal knowledge. The pre-training phase can be further divided into three stages. In the first stage, the compression layer is warmed up. The second stage involves extending the input resolution of the pre-trained visual encoder. Finally, in the third stage, the visual modules are trained using an adaptive visual encoding strategy, allowing them to effectively handle high-resolution inputs with any aspect ratios. In the second phase, we perform Supervised Fine-Tuning (SFT) on high-quality visual question answering datasets to further learn knowledge and interaction capability from human annotations. We unlock all model parameters to better exploit the data and learn rich knowledge during the SFT phase. We also conduct a lightweight yet high-quality SFT process as in VisCPM8 to enhance alignment with languages beyond English and Chinese, achieving strong multimodal performance across more than 30 languages. In the alignment phase, we employ the recent RLAIF-V18 approach to address the hallucination problem (Fig. 2d), where the MLLM generates responses that are not factually grounded in the input image19. The first step of RLAIF-V is to generate multiple responses for a given instruction using the policy model. Then a divide-and-conquer strategy is applied for response scoring. After collecting the high-quality AI feedback, we perform preference learning via DPO21 method.
This paper introduces the first 3 models in the MiniCPM-V series, including MiniCPM-V 1.0, MiniCPM-V 2.0, and MiniCPM-Llama3-V 2.5. MiniCPM-V 1.0 is trained with the pre-training stage1&2 and SFT without using the adaptive visual encoding and RLAIF-V. For MiniCPM-V 2.0, we include all training stages and the adaptive visual encoding strategy to further improve performance. MiniCPM-Llama3-V 2.5 adopts Llama3-Instruct 8B as its base LLM, showcasing strong multimodal understanding capabilities, as illustrated in Fig. 3.
Evaluation across diverse multimodal understanding benchmarks
We perform a comprehensive evaluation on popular benchmarks covering visual question answering, multimodal conversation, knowledge and reasoning, OCR, and hallucination. (1) General benchmarks. We adopt OpenCompass22 as the general evaluation indicator, which is a comprehensive collection over 11 popular multimodal benchmarks, including MME23, MMBench24, MMMU25, MathVista26, LLaVA Bench4, etc. We also report the results on RealWorldQA for real-world spatial understanding capabilities. (2) OCR benchmarks. We adopt three widely used benchmarks for OCR capability evaluation, including including OCRBench27, TextVQA28 and DocVQA29. (3) Hallucination benchmarks. We also include Object HalBench19,30 to evaluate the trustworthiness of the models.
We compare with strong baselines in different series: For open-source models, we compare with strong models including Yi-VL-6B/34B13, Qwen-VL-Chat-9B7, DeepSeek-VL-7B3, TextMonkey31, CogVLM-Chat-17B5, CogVLM2-Llama3-19B5, Idefics2-8B9, Bunny-Llama-3-8B32, XTuner-Llama-3-8B-v1.133, LLaVA-NeXT-Llama-3-8B34, Cambrian-8B/34B35, LLaVA-NeXT-Yi-34B36, DeepSeek-VL-1.3B3, MobileVLM V237, Mini-Gemini14 and Phi-3-Vision-128k-instruct12. For proprietary models, we compare with GPT-4V-11062, Gemini-Pro1 and Claude 3 Opus38.
From the experimental results in Fig. 4, we have the following observations: (1) MiniCPM-Llama3-V 2.5 outperforms strong open-source models by a notable margin. For instance, MiniCPM-Llama3-V 2.5 surpasses the recent strong Idefics2-8B by 7.9 points on the OpenCompass benchmark, with similar model sizes. It also achieves better results than significantly larger models such as Cambrian-34B, LLaVA-NeXT-Yi-34B, Yi-VL-34B and CogVLM2-Llama3-19B. (2) Compared with powerful proprietary models, such as GPT-4V-1106 and Gemini Pro, MiniCPM-Llama3-V 2.5 achieves better performance on the OpenCompass benchmark with significantly fewer parameters. In addition, MiniCPM-Llama3-V 2.5 also achieves lower hallucination rates than GPT-4V-1106 on Object HalBench, indicating its trustworthiness for real-world applications. (3) The smaller MiniCPM-V 2.0 with 2B parameters achieves significantly better performance compared with other 2B ~ 4B models, and is even comparable with 8B MLLMs such as Bunny-Llama-3-8B. In summary, the results show that MiniCPM-V series achieves a good balance between performance and efficiency, making it more friendly for broader communities and applications.
MiniCPM-Llama-V 2.5, with only 8 billion parameters, outperforms leading open-source MLLMs and achieves superior results on the OpenCompass benchmark compared to proprietary models like GPT-4V-1106 and Gemini Pro. In addition, MiniCPM-V 2.0, with 2 billion parameters, significantly outperforms other MLLMs with fewer than 4 billion parameters.
MiniCPM-V models also show strong OCR capabilities, including scene-text, document and screenshot understanding. As shown in Fig. 5a, MiniCPM-Llama3-V 2.5 outperforms open-source MLLMs ranging 1.7B–34B on OCRBench, TextVQA, and DocVQA. The performance on these datasets is even comparable to proprietary models like GPT-4V-1106 and Gemini Pro. MiniCPM-V 2.0 also achieves significantly better performance among models in the 2B–4B parameter range (Fig. 5b).
Based on the multilingual multimodal generalization approach from VisCPM, MiniCPM-Llama3-V 2.5 extends its multimodal capability to over 30 languages. As shown in Fig. 5c, MiniCPM-Llama3-V 2.5 can outperform Yi-VL 34B and Phi-3-vision-128k-instruct on the multilingual LLaVA Bench. The promising multilingual multimodal capability makes MiniCPM-Llama3-V 2.5 useful in serving larger linguistic groups.
To investigate the effectiveness of key components, we perform an ablation study on high-resolution perception and multi-stage training pipeline. To ablate high-resolution perception, we follow the standard method in LLaVA-1.539 to downsample high-resolution images into low-resolution versions (i.e., 448 × 448) in both training and inference. To ablate multi-stage training pipeline, we follow the standard two-stage training (i.e., pretraining and instruction tuning) in LLaVA-1.539. Due to the high computational costs of model training on full data, we perform ablation on a subset of full training data by randomly sampling 10% data from each dataset, resulting in 70M data in total, which is sufficient to validate a frontier MLLM. From the experimental results in Table 1, we can see that both high-resolution perception and multi-stage training pipeline contribute to the final performance. The reason is that high-resolution perception is crucial for MLLMs to perceive fine-grained visual details, especially for OCR-related tasks, and multi-stage training can better fit and exploit training data of different forms and qualities.
Moreover, it is worth noting that MiniCPM-Llama3-V 2.5 requires significantly less inference computation. For example, the visual token number range of MiniCPM-Llama3-V 2.5 is (96, 960), which is lower than LLaVA-NeXT-Llama-3-8B’s (1728, 2880). This can be important especially for real-world on-device applications in terms of inference speed, first-token latency, memory usage, and power consumption.
Specifically, we provide a comparison of the computational costs between the adaptive visual encoding method and the vanilla visual encoding method (i.e., visual features of the original image from ViT are directly projected and input into the LLM, as in LLaVA-1.539). From the results in Fig. 2a, we can see that compared with the standard method, adaptive visual encoding largely reduces both FLOPs and GPU memory usage in both ViT and LLM for high-resolution images. The reason is that, compared with the standard method, image slicing prevents the quadratic computation growth of ViT, and the compression layer largely reduces the number of visual tokens to LLMs.
Efficient Deployment of MiniCPM-V on Edge Devices
In this section, we investigate the deployment of MiniCPM-V on edge devices. Edge devices such as smartphones and computers often face resource constraints due to factors like heat dissipation, size limitations, and power consumption. When deploying models, the two most critical limitations are memory capacity and CPU/GPU processing speed. High-performance servers typically boast extensive memory capacities, often exceeding 100GB or even 1TB. In contrast, the memory available on mobile phones typically ranges from 12GB to 16GB, which can be insufficient for MLLM deployment. On the other hand, The overall processing speeds of CPUs in smartphones are notably slower. For instance, the Snapdragon 8 Gen3 features 8 CPU cores, whereas high-performance server like Intel Xeon Platinum 8580 has 60 CPU cores. Similarly, mobile phone GPUs are not as powerful as server GPUs. For example, Qualcomm Adreno 750 only has 6 TFLOPS, while NVIDIA 4090 can reach 83 TFLOPS.
To deploy the MLLM on edge devices, we first employ quantization for reduced memory cost. For MiniCPM-Llama3-V 2.5, the fp16 version model typically demands 16–17G memory. We opt for the Q4_K_M mode 4-bit quantization strategy within GGML framework. This reduces the memory requirement to around 5G, which is friendly to mobile phone usage. We then empirically investigate the deployment results on different frameworks. Several frameworks have been proposed for on-device deployment. Illustrated in Fig. 6, we make a thorough investigation of different frameworks for different chip types including CPU, GPU, and NPU. Given the ubiquity of CPU usage across devices, we prioritize this chip type and opt for the llama.cpp40 framework. Combining quantization and llama.cpp on Xiaomi 14 Pro (Snapdragon 8 Gen 3), the model achieves a text encoding latency of 64.2s and a text decoding speed of 1.3 tokens/s (as depicted in Fig. 6f), which is still far from acceptable for users.
a–d Advanced techniques used in the deployment of MiniCPM-V on edge devices, including (a) memory usage optimization, (b) compilation optimization, (c) configuration optimization, and d NPU acceleration. e The influence of these techniques on the encoding latency and decoding throughput. The results are tested on the Xiaomi 14 Pro (Snapdragon 8 Gen 3). No opt.: non-optimized, mem. opt.: memory usage optimization, comp. opt.: compilation optimization, config. opt.: configuration optimization, NPU: NPU acceleration. f Results on different edge devices. We show the encoding latency and decoding throughput across different device types. Xiaomi 14 Pro is the only device with NPU.
To achieve better acceleration, we further investigate a series of advanced techniques including memory usage optimization, compilation optimization, configuration optimization, and NPU acceleration, as shown in Fig. 6a–d. We first explore memory usage optimization strategies to address the image processing bottleneck of the inference speed due to limited memory resources on mobile phones. Instead of loading both ViT and LLM simultaneously into memory, we adopt a sequential loading approach. Specifically, we first load ViT for visual encoding, followed by the LLM for visual and text token encoding. By releasing the large amount of memory occupied by LLM, we can prevent frequent paging (swapping in and out) during ViT encoding, thereby improving the program efficiency. This optimization technique, as illustrated in Fig. 6e, results in a notable reduction of image processing time from 45.2s to 31.5s. We also find that directly compiling the models on the target devices can significantly improve the encoding latency and the decoding throughput. This can be attributed to better consistency between the compilation and target device instruction set architecture. As depicted in Fig. 6e, this optimization endeavor yields promising results. Encoding latency shows a notable reduction from 50.5s to 17.0s, while decoding throughput experiences a significant boost from 1.3 tokens/s to 3.2 tokens/s. Next, instead of relying on a single default configuration of the llama.cpp framework, we propose an automated parameter search algorithm that dynamically identifies the optimal configurations (e.g., computational distribution across CPU cores) tailored to various edge devices. Through configuration optimization, we can achieve good improvements. Specifically, decoding throughput surged from 3.2 tokens/s to an impressive 8.2 tokens/s, surpassing the typical human reading speed. Finally, we leverage Neural Processing Units (NPUs), a class of specialized hardware designed to accelerate AI applications, available in certain smartphones. Recognized for their ability to address computational bottlenecks, NPUs enable accelerated visual encoding. Specifically, we replace the backend framework of ViT with QNN while retaining the llama.cpp backend for the language model component. On mobile devices equipped with Qualcomm NPUs, this optimization yields a notable reduction in visual encoding time, decreasing from 3.7 to 1.3 s.
For a comprehensive assessment of MiniCPM-Llama3-V 2.5’s performance across various edge devices, we present test results on Xiaomi 14 Pro (Snapdragon 8 Gen 3), vivo X00 Pro (Mediatek Dimensity 9300), Macbook Pro (M1), and Jetson AGX Orin 32G in Fig. 6f. Thanks to the deployment optimization techniques, MiniCPM-Llama3-V 2.5 can operate efficiently on both mobile phones and personal computers, delivering acceptable latency and throughput. For instance, leveraging NPU on Xiaomi 14 Pro enables it to achieve a similar encoding speed as the Mac M1. Furthermore, nearly all devices exhibit comparable or higher throughput compared with human reading speed. Upon analyzing the results, it becomes evident that the current computation bottleneck primarily stems from LLM prefilling, which mainly involves encoding image and text tokens for LLM inference. Promising research directions involve developing more efficient visual encoding methods with fewer visual tokens, and better leveraging GPU/NPU acceleration for LLM encoding. With increasing attention to on-device MLLMs and the rapid advancement of GPU/NPU acceleration techniques, we believe that real-time interaction with on-device MLLMs can be reached soon.
Discussion
The MiniCPM-V series models are a primary exploration into powerful on-device MLLMs. Thanks to techniques such as adaptive visual encoding, multilingual generalization, and the RLAIF-V method, MiniCPM-Llama3-V 2.5 can achieve GPT-4V level performance with significantly fewer parameters. Leveraging diverse optimization techniques for edge deployment, this model ensures an acceptable user experience on mobile phones.
Despite promising performance, there remain several limitations with the current MiniCPM-V models. (1) Capability Depth. There is still plenty of room for improvement in enhancing multimodal understanding capability and inference efficiency. (2) Capability Width. In addition to image modality, it’s promising to expand MLLM capabilities to encompass other modalities, such as video and audio, etc., where GPT-4o41 and Google Astra42 have given good examples.
In addition to MLLM capabilities, edge deployment also presents unique challenges. The inference speed and latency are still far from good enough and the model service can be limited by the battery capacity. In addition, previous efforts on chips and deployment frameworks mainly target CNNs and LSTMs, which can be sub-optimal for MLLMs. Tailored efforts to MLLMs can bring ample room for improvement.
Considering the current limitations and the promising future of on-device MLLMs, we anticipate increasing efforts from both academia and industry in enhancing model capabilities in terms of depth and width, and improving smartphone chips and deployment frameworks. We believe that simultaneous advancements in model capability and edge device capacity will lead to on-device applications providing a satisfying user experience in the near future.
Methods
Adaptive visual encoding
Image partition
To process high-resolution images with varying aspect ratios, we divide them into slices17. Each slice is adjusted to more closely align with ViT’s pre-training settings in terms of both resolution and aspect ratio. Specifically, we first calculate the ideal number of slices based on the input image size. Given an image with resolution (WI, HI) and a ViT pre-trained on images with resolution (Wv, Hv), we calculate the ideal slice number \(N=\lceil \frac{{W}_{I}\times {H}_{I}}{{W}_{v}\times {H}_{v}}\rceil\). Then, we choose the combination of rows n and columns m from the set \({{\mathbb{C}}}_{N}=\{(m,n)| m\times n=N,m\in {\mathbb{N}},n\in {\mathbb{N}}\}\). A good partition (m, n) should result in slices that match well with ViT’s pre-training setting. To achieve this, we use a score function to evaluate each potential partition:
We select the partition with the highest score from all possible candidates:
where \(\bar{{\mathbb{C}}}\) is the possible (m, n) combinations with the product N. However, when N is a prime number, the feasible solutions can be limited to (N, 1) and (1, N). Therefore, we additionally introduce \({{\mathbb{C}}}_{N-1}\) and \({{\mathbb{C}}}_{N+1}\), and set \(\bar{{\mathbb{C}}}={{\mathbb{C}}}_{N-1}\cup {{\mathbb{C}}}_{N}\cup {{\mathbb{C}}}_{N+1}\). In practice, we set N < 10, supporting 1.8 million pixels (e.g., 1344 × 1344 resolution) at most during encoding. Although we can encompass more image slices for higher resolutions, we purposely impose this resolution upper-bound, since it already well covers most real-world application scenarios, and the benefit of further increasing encoding resolution is marginal considering the performance and overhead.
Slice encoding
Although image partitioning can ensure a good match between the slices and the ViT pre-training setting, each slice’s size is not precisely equal to (Wv, Hv). To feed the slices into ViT, we first adjust each slice by resizing it proportionally so that the resultant area size matches ViT pre-training area size Wv × Hv. This adjustment helps prevent a significant gap between the number of encoded patches and the ViT’s pre-training setting. Subsequently, we interpolate the ViT’s position embeddings to adapt to the slice’s ratio. This involves reshaping the ViT’s 1D embedding \({{{{\bf{P}}}}}_{1}\in {{\mathbb{R}}}^{Q\times l}\) back to its 2D format \({{{{\bf{P}}}}}_{2}\in {{\mathbb{R}}}^{q\times q\times l}\), where the number of position embeddings Q = q × q. Then, we interpolate P2 to fit the size of each slice via 2D interpolation. We also include the original image as an additional slice to provide holistic information about the entire image.
Token compression
After visual encoding, each slice is encoded into 1,024 tokens, where 10 slices can yield over 10k tokens collectively. To manage this high token count, we employ a compression module comprising of one-layer cross-attention and a moderate number of queries, with 2D positions. In practice, the visual tokens of each slice are compressed into 64 queries for MiniCPM V1&2 and 96 tokens for MiniCPM-Llama3-V 2.5 through this layer. Compared with other MLLMs with competitive performance, the significantly smaller number of visual tokens in MiniCPM-V series enables superior efficiency in terms of GPU memory consumption, inference speed, first-token latency and power consumption, making it more friendly to wider application scopes and communities.
Spatial schema
To indicate each slice’s position relative to the whole image, inspired by20, we introduce a spatial schema. We first wrap tokens of each slice by two special tokens “<slice>” and “<\slice>”, and then employ a special token “\n” to separate slices from different rows.
Pre-training
In this phase, we utilize large-scale image-text pairs for MLLM pre-training. The primary goal of this phase is to align the visual modules (i.e., visual encoder and compression layer) with the input space of the LLM and learn foundational multimodal knowledge. We show the pre-training data composition in Table 2. The pre-training phase is further divided into 3 stages.
Stage-1
The role of stage-1 is to warm up the compression layer, primarily connecting the visual encoder and LLMs. (1) Trainable Modules. We randomly initialize the compression layer and train this module in stage-1, keeping other parameters frozen. The visual encoder’s resolution is set to 224 × 224, which is the same as the visual encoder’s pre-training setting. (2) Data. To warm up the compression layer, we randomly select 200M data from the Image Captioning data in Table 2. Data cleaning is performed to remove image-text pairs with poor correlation and ill-formatted text data, ensuring the data quality.
Stage-2
After the warm-up training of the compression layer, the role of stage-2 is to extend the input resolution of the pre-trained visual encoder. (1) Trainable Modules. In stage-2, we extend the image resolution from 224 × 224 to 448 × 448. The whole visual encoder is trained, leaving other parameters frozen. (2) Data. To extend the pre-trained resolution, we additionally select 200M data from the Image Captioning data in Table 2.
Stage-3
After extending the primary input resolution of the visual encoder, we finally train the visual modules using the adaptive visual encoding strategy, which can further accommodate high-resolution inputs with any aspect ratio. (1) Trainable Modules. During the stage-3 training, both the compression layer and the visual encoder are trained to adapt to the language model embedding space. The LLM is kept frozen to avoid disruption from the relatively low-quality pre-training data. (2) Data. Different from the previous stages with only image captioning data, during the high-resolution pre-training stage, we additionally introduce OCR data to enhance the visual encoders’ OCR capability.
Caption rewriting
Image-text pairs sourced from the Web43,44 can suffer from quality issues in the caption data, including non-fluent content, grammatical errors, and duplicated words. Such low-quality data can lead to unstable training dynamics. To address the issue, we introduce an auxiliary model for low-quality caption rewriting. The rewriting model takes the raw caption as input and is asked to convert it into a question-answer pair. The answer from this process is adopted as the updated caption. In practice, we leverage GPT-445 to annotate a small number of seed samples, which are then used to fine-tune an LLM for the rewriting task.
Data packing
Samples from different data sources usually have different lengths. The high variance of sample lengths across batches will lead to inefficiency in memory usage and the risk of out-of-memory problem. To address the issue, we pack multiple samples into a single sequence with a fixed length. By truncating the last sample in the sequence, we ensure uniformity in sequence lengths, facilitating more consistent memory consumption and computational efficiency. Meanwhile, we modify the position ids and attention masks to avoid interference between different samples. In our experiments, the data packing strategy can bring 2 ~ 3 times acceleration in the pre-training phase.
Multilingual generalization
Multimodal capability across multiple languages is essential for serving users from broader communities. Traditional solutions involve extensive multimodal data collection and cleaning, and training for the target languages. Fortunately, recent findings8 have shown that the multimodal capabilities can be efficiently generalized across languages via a strong multilingual LLM pivot, largely alleviating the heavy reliance on multimodal data in low-resource languages. In practice, we only pre-train our model on English and Chinese multimodal data, and then perform a lightweight but high-quality multilingual supervised fine-tuning to align to the target languages. Despite its simplicity, we find the resultant MiniCPM-Llama3-V 2.5 can achieve good performance in over 30 languages as compared with significantly larger MLLMs.
Supervised fine-tuning
After learning foundational capabilities from pre-training, we perform supervised fine-tuning (SFT) on high-quality visual question answering datasets to further learn knowledge and interaction capability from human annotations.
Trainable Modules
Compared with the pre-training phase which mainly uses crawled data from the Web, the SFT phase mainly utilizes high-quality datasets annotated by either human lablers or strong models such as GPT-4. Therefore, we unlock all model parameters to better exploit the data and learn rich knowledge during SFT phase.
Data
Recent works1,46 show that data near the end of training plays a more important role in shaping the models’ capabilities and response styles. We categorize the SFT data into two parts, as shown in Table 3. Part-1 focuses on bolstering the models’ basic recognition capabilities, while part-2 is tailored to enhance their capabilities in generating detailed responses and following human instructions. Specifically, part-1 data consists of the traditional question answering and captioning datasets with relatively short response lengths, which helps enhance the model’s basic recognition capabilities. In comparison, part-2 encompasses datasets featuring long responses with complex interactions, either in text or multimodal context. During SFT, these two parts of data are concatenated and sequentially fed into the model. For MiniCPM-Llama3-V 2.5, we integrate 2M data from the recent Cauldron dataset9 for multimodal knowledge augmentation, and 90K multilingual data over 36 languages for boosting the multilingual conversation capability.
Alignment
MLLMs are typically prone to hallucination problems, generating responses that are not factually grounded in the input image19. The issue greatly limits the wide application of MLLMs, especially in high-stakes scenarios, such as autonomous driving and assistance for visually impaired groups. To address the hallucination problem, we employ the recent RLAIF-V18 approach, where the key is to obtain scalable high-quality feedback from open-source models for preference learning21.
Response generation
The first step of RLAIF-V is to generate multiple responses for a given instruction using the policy model. Specifically, given a model M waiting for alignment, we sample 10 responses Y = {y1, y2, ⋯ , yn} from M using sampling decoding with high temperatures. There are several benefits of using the policy model M for response generation: (1) Feedback collection and learning can better focus on trustworthiness, since different text styles from multiple MLLMs are avoided. (2) Feedback learning is more efficient since preference is directly collected on the distribution of the policy model.
Feedback collection
Collecting high-quality feedback from open-source MLLMs can be challenging due to their typically weaker capabilities compared with proprietary models. To address the issue, RLAIF-V uses a divide-and-conquer strategy for response scoring. Specifically, each response yi is divided into atomic claims Ci = {c1, c2,…, cm} using Llama-3 8B, where the correctness of atomic claims is much easier to evaluate. Then, we verify the claims by converting each claim to a yes/no question and employing an open-source MLLM to score each claim. In practice, we adopt OmniLMM 12B for MiniCPM-V 2.0 scoring and LLaVA-NeXT-Yi 34B for MiniCPM-Llama3-V 2.5 scoring. The final score si of the response yi is given by − nrej, where nrej is the number of invalid atomic claims.
Direct preference optimization
After collecting the high-quality AI feedback, we perform preference learning via DPO method. The DPO algorithm requires training on preference pairs, where one sample yw is preferred to the other one yl. To compose the preference dataset, we randomly sample pairs from each response set Y = {y1, y2,…,yn}, and determine (yw, yl) based on their relative scores. Finally, we construct a preference dataset consisting of 6K preference pairs from 3K unique images for preference learning.
Data availability
The datasets used for training are described in detail in the Methods section. While most of the training data are publicly available, a small portion originates from proprietary datasets licensed from a commercial provider and cannot be shared due to legal and contractual restrictions. These restricted datasets were used exclusively for model training and do not affect the reproducibility of the core findings presented in this study. Source data are provided with this paper.
Code availability
Code for training and evaluating our model is publicly available on GitHub at https://github.com/OpenBMB/MiniCPM-o, and has been archived at https://doi.org/10.5281/zenodo.15525638.
References
Reid, M. et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. Preprint at https://arxiv.org/abs/2403.05530 (2024).
Achiam, J. et al. GPT-4 Technical Report. Preprint at https://arxiv.org/abs/2303.08774 (2023).
Lu, H. et al. DeepSeek-VL: Towards real-world vision-language understanding. Preprint at https://arxiv.org/abs/2403.05525 (2024).
Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning. NeurIPS 36 (2024).
Wang, W. et al. CogVLM: Visual expert for pretrained language models. In: Proc. NeurIPS (2023).
Chen, Z. et al. How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites. In Science China Information Sciences (2024).
Bai, J. et al. Qwen-VL: a frontier large vision-language model with versatile abilities. Preprint at https://arxiv.org/abs/2308.12966 (2023).
Hu, J. et al. Large multilingual models pivot zero-shot multimodal learning across languages. In Proc. ICLR (2024).
Laurençon, H., Tronchon, L., Cord, M. & Sanh, V. What matters when building vision-language models? In Proc. NeurIPS (2024).
McKinzie, B. et al. MM1: methods, analysis & insights from multimodal LLM pre-training. In Proc. ECCV (2024).
Beyer, L. et al. PaliGemma: a versatile 3b vlm for transfer. Preprint at https://arxiv.org/abs/2407.07726 (2024).
Abdin, M. et al. Phi-3 technical report: a highly capable language model locally on your phone. Preprint at https://arxiv.org/abs/2404.14219 (2024).
Young, A. et al. Yi: open foundation models by 01.AI. Preprint at https://arxiv.org/abs/2403.04652 (2024).
Li, Y. et al. Mini-Gemini: mining the potential of multi-modality vision language models. Preprint at https://arxiv.org/abs/2403.18814 (2024).
Wikipedia. Moore’s law. In Wikipedia, the Free Encyclopedia (2001).
Wikipedia. Thrust‑to‑weight ratio. In Wikipedia, the Free Encyclopedia (2024).
Xu, R. et al. LLaVA-UHD: an LMM perceiving any aspect ratio and high-resolution images. In Proc. ECCV (2024).
Yu, T. et al. RLAIF-V: aligning MLLMs through open-source AI feedback for super GPT-4V trustworthiness. In Proc. CVPR (2025).
Yu, T. et al. RLHF-V: towards trustworthy MLLMs via behavior alignment from fine-grained correctional human feedback. In: Proc. CVPR (2024).
Bavishi, R. et al. Introducing our multimodal models. In Adept Blog (2023).
Rafailov, R. et al. Direct preference optimization: your language model is secretly a reward model. NeurIPS. 36 (2024).
OpenCompass Contributors. OpenCompass: a universal evaluation platform for foundation models. In GitHub repository (2023).
Fu, C. et al. MME: a comprehensive evaluation benchmark for multimodal large language models. Preprint at https://arxiv.org/abs/2306.13394 (2023).
Liu, Y. et al. MMBench: Is your multi-modal model an all-around player? Preprint at https://arxiv.org/abs/2307.06281 (2023).
Yue, X. et al. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. In Proc. CVPR (2024).
Lu, P. et al. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. In Proc. ICLR (2024).
Liu, Y. et al. OCRBench: on the hidden mystery of OCR in large multimodal models. In Science China Information Sciences (2023).
Singh, A. et al. Towards VQA models that can read. In: Proc. CVPR (2019).
Mathew, M., Karatzas, D. & Jawahar, C. DocVQA: a dataset for VQA on document images. In: Proc. WACV (2021).
Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T. & Saenko, K. Object hallucination in image captioning. In Proc. EMNLP (2018).
Liu, Y. et al. TextMonkey: An OCR-free large multimodal model for understanding document. Preprint at https://arxiv.org/abs/2403.04473 (2024).
He, M. et al. Efficient multimodal learning from data-centric perspective. Preprint at https://arxiv.org/abs/2402.11530 (2024).
XTuner Contributors. XTuner: a toolkit for efficiently fine‑tuning LLM. In GitHub repository (2023).
Li, B. et al. LLaVA‑NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild. In LLaVA blog (2024).
Tong, S. et al. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs. In Proc. NeurIPS (2024).
Liu, H. et al. LLaVA‑NeXT: Improved reasoning, OCR, and world knowledge. In LLaVA blog (January 30, 2024).
Chu, X. et al. MobileVLM: a fast, reproducible and strong vision language assistant for mobile devices. Preprint at https://arxiv.org/abs/2312.16886 (2023).
Anthropic. Introducing the next generation of Claude. In Claude 3 blog (2024).
Liu, H., Li, C., Li, Y. & Lee, Y. J. Improved baselines with visual instruction tuning. In: Proc. CVPR (2024).
Georgi Gerganov and the llama.cpp Group. llama.cpp: LLM inference in C/C++ with cross‑platform support and hardware-optimized quantization. In GitHub repository (2023).
OpenAI. Hello GPT‑4o: our new flagship, multimodal model for real-time reasoning across text, vision, and audio. In OpenAI blog (May 13, 2024).
DeepMind. Project Astra: a research prototype exploring capabilities toward a universal AI assistant. In Google DeepMind Models (2024).
Schuhmann, C. et al. LAION-5B: An open large-scale dataset for training next generation image-text models. NeurIPS 35, 25278–25294 (2022).
Byeon, M. et al. COYO-700M: Image-text pair dataset. In GitHub repository (2022).
Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with GPT-4. Preprint at https://arxiv.org/abs/2303.12712 (2023).
Hu, S. et al. MiniCPM: Unveiling the potential of small language models with scalable training strategies. In Proc. COLM (2024).
Lin, T.-Y. et al. Microsoft COCO: Common objects in context. In: ECCV (Eds. Fleet, D., Pajdla, T., Schiele, B. & Tuytelaars, T.) 740–755 (Springer, 2014).
Krishna, R. et al. Visual Genome: Connecting language and vision using crowdsourced dense image annotations”. IJCV 123, 32–73 (2017).
Sharma, P., Ding, N., Goodman, S. & Soricut, R. Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018).
Changpinyo, S., Sharma, P., Ding, N. & Soricut, R. Conceptual 12M: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: CVPR (2021).
Wu, J. et al. AI Challenger: a large-scale dataset for going deeper in image understanding. Preprint at https://arxiv.org/abs/1711.06475 (2017).
Gu, J. et al. Wukong: a 100 million large-scale Chinese cross-modal pre-training benchmark. NeurIPS 35, 26418–26431 (2022).
Xie, C., Li, J. & Zhang, B. CCMB: A Large-scale Chinese Cross-modal Benchmark. Preprint at https://arxiv.org/abs/2205.03860 (2022).
Srinivasan, K., Raman, K., Chen, J., Bendersky, M. & Najork, M. WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning. In: SIGIR 2443–2449 (2021).
Biten, A. F., Tito, R., Gomez, L., Valveny, E.& Karatzas, D. OCR-IDL: OCR annotations for industry document library dataset. In: ECCV (Eds. Avidan, S., Brostow, G., Cissé, M., Farinella, G. M. & Hassner, T.). 241–252 (Springer, 2022)..
Gupta, A., Vedaldi, A. & Zisserman, A. Synthetic data for text localisation in natural images. In: CVPR 2315–2324 (2016).
Kim, G. et al. OCR-free document understanding transformer. In: ECCV (Eds. Avidan, S., Brostow, G., Cissé, M., Farinella, G. M. & Hassner, T.) (2022).
Li, L. et al. Multimodal ArXiv: a dataset for improving scientific comprehension of large vision-language models. In Proc. ACL (2024).
Plummer, B. A. et al. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV 2641–2649 2015.
Gao, H. et al. Are you talking to a machine? dataset and methods for multilingual image question. NeurIPS 28 (2015).
Lu, P. et al. IconQA: a new benchmark for abstract diagram understanding and visual language reasoning. In Proc. NeurIPS Datasets and Benchmarks Track (2021).
Hudson, D. A. & Manning, C. D. GQA: a new dataset for real-world visual reasoning and compositional question answering. In: CVPR 6700–6709 (2019).
Antol, S. et al. VQA: Visual question answering. In: ICCV 2425–2433 (2015).
Johnson, J. et al. CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR 2901–2910 (2017).
Gurari, D. et al. VizWiz Grand Challenge: answering visual questions from blind people. In: CVPR 3608–3617 (2018).
Zhu, Y., Groth, O., Bernstein, M. & Fei-Fei, L. Visual7W: Grounded question answering in images. In: CVPR (2016).
Ren, M., Kiros, R. & Zemel, R. Exploring models and data for image question answering. NeurIPS 28 (2015).
Marino, K., Rastegari, M., Farhadi, A. & Mottaghi, R. OK-VQA: A visual question answering benchmark requiring external knowledge. In: CVPR 3195–3204 (2019).
Schwenk, D., Khandelwal, A., Clark, C., Marino, K. & Mottaghi, R. A-OKVQA: A benchmark for visual question answering using world knowledge. In: ECCV (Eds. Avidan, S., Brostow, G., Cissé, M., Farinella, G. M. & Hassner, T.) 146–162 (Springer, 2022)..
Shah, S., Mishra, A., Yadati, N. & Talukdar, P. P. KVQA: knowledge-aware visual question answering. AAAI 33, 8876–8884 (2019).
Lu, P. et al. Learn to explain: multimodal reasoning via thought chains for science question answering. NeurIPS 35, 2507–2521 (2022).
Yu, L., Poirson, P., Yang, S., Berg, A. C. & Berg, T. L. Modeling context in referring expressions. In: ECCV. (Eds. Leibe, B., Matas, J., Sebe, N. & Welling, M.). 69–85 (Springer, 2016).
Du, Y. et al. What makes for good visual instructions? Synthesizing complex visual reasoning instructions for visual instruction tuning. In Proc. COLING (2025).
Zellers, R., Bisk, Y., Farhadi, A. & Choi, Y. From recognition to cognition: visual commonsense reasoning. In: CVPR (2019).
Suhr, A., Lewis, M., Yeh, J. & Artzi, Y. A corpus of natural language for visual reasoning. In: ACL 217–223 (2017).
Liu, F. et al. Mitigating hallucination in large multi-modal models via robust instruction tuning. In Proc. ICLR (2024).
Chen, J. et al. GeoQA: a geometric question answering benchmark towards multimodal numerical reasoning. In Proc. ACL Findings (2021).
Cherian, A., Peng, K.-C., Lohit, S., Smith, K. A. & Tenenbaum, J. B. Are deep neural networks smarter than second graders? In: CVPR 10834–10844 (2023).
Mishra, A., Shekhar, S., Singh, A. K. & Chakraborty, A. OCR-VQA: Visual question answering by reading text in images. In: ICDAR. (2019).
Biten, A. F. et al. Scene text visual question answering. In: CVPR 4291–4301 (2019).
Tanaka, R., Nishida, K. & Yoshida, S. VisualMRC: Machine reading comprehension on document image. AAAI 35, 13878–13888 (2021).
Kafle, K., Price, B., Cohen, S. & Kanan, C. DVQA: understanding data visualizations via question answering. In: CVPR 5648–5656 (2018).
Kahou, S. E. et al. FigureQA: an annotated figure dataset for visual reasoning. In Proc. ICLR (2018).
Masry, A., Long, D. X., Tan, J. Q., Joty, S. & Hoque, E. ChartQA: a benchmark for question answering about charts with visual and logical reasoning. In Proc. ACL Findings (2022).
Svetlichnaya, S. DeepForm: understand structured documents at scale. In Weights & Biases report (2020).
Chen, W. et al. TabFact: a large-scale dataset for table-based fact verification. In Proc. ICLR (2020).
Mathew, M. et al. InfographicVQA. In: WACV 1697–1706 (2022).
Stanisławek, T. et al. Kleister: Key information extraction datasets involving long documents with complex layouts. In: ICDAR (Eds. Lladós, J., Lopresti, D. & Uchida, S.). 564–579 (Springer, 2021).
Pasupat, P. & Liang, P. Compositional semantic parsing on semi-structured tables. In Proc. ACL (2015).
Ahmed, S., Jawade, B., Pandey, S., Setlur, S. & Govindaraju, V. RealCQA: Scientific chart question answering as a test-bed for first-order logic. In: ICDAR (Eds. Fink, G. A., Jain, R., Kise, K. & Zanibbi, R.). 66–83 (Springer, 2023).
Kembhavi, A. et al. A diagram is worth a dozen images. In: ECCV (Eds. Leibe, B., Matas, J., Sebe, N. & Welling, M.). 235–251 (Springer, 2016).
Shin, A., Ushiku, Y. & Harada, T. The color of the cat is gray: 1 million full-sentences visual question answering (FSVQA). Preprint at https://arxiv.org/abs/1609.06657 (2016).
Das, A. et al. Visual Dialog. In: CVPR 326–335 (2017).
Zhang, Y. et al. LLaVAR: enhanced visual instruction tuning for text-rich image understanding. Preprint at https://arxiv.org/abs/2306.17107 (2023).
Carter, J. TextOCR‑GPT4V. In Hugging Face dataset repository (2024).
Zhao, B., Wu, B. & Huang, T. SVIT: scaling up visual instruction tuning. Preprint at https://arxiv.org/abs/2307.04087 (2023).
Yu, T. et al. Reformulating vision-language foundation models and datasets towards universal multimodal assistants. Preprint at https://arxiv.org/abs/2310.00653 (2023).
Chen, L. et al. ShareGPT4V: Improving large multi-modal models with better captions. In Proc. ECCV (2024).
Gupta, A., Dollar, P.& Girshick, R. LVIS: a dataset for large vocabulary instance segmentation. In: CVPR 5356–5364 (2019).
Chen, G. H. et al. ALLaVA: harnessing GPT4V-synthesized data for a lite vision-language model. Preprint at https://arxiv.org/abs/2402.11684 (2024).
Ding, N. et al. Enhancing chat language models by scaling high-quality instructional conversations. In Proc. EMNLP (2023).
Taori, R. et al. Stanford Alpaca: An instruction‑following LLaMA model. In GitHub repository (2023).
Zheng, L. et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. NeurIPS 36 (2024).
BELLEGroup. BELLE: Be Everyone’s Large Language model Engine. In GitHub repository (2023).
Lian, W. et al. OpenOrca: an open dataset of GPT‑augmented FLAN reasoning traces. In Hugging Face dataset repository (2023).
Teknium. OpenHermes 2.5: An open dataset of synthetic data for generalist LLM assistants. In Hugging Face dataset repository (2023).
Acknowledgements
Z.L. and M.S. are supported by the Research Project 2025QGS16007. Y.Y. is supported by the Shanghai Qi Zhi Institute Innovation Program SQZ202410.
Author information
Authors and Affiliations
Contributions
Yuan Yao initiated and led the research project, and designed the models and experiments. Tianyu Yu, Chongyi Wang, Junbo Cui, Hongji Zhu, Haoyu Li, Zhihui He, Haoye Zhang, Zhi Zheng, Jie Zhou and Jie Cai contributed to the experiments. Chi Chen, Ao Zhang and Yuan Yao wrote the paper. Tianchi Cai, Weilin Zhao, Qianyu Chen, Ronghua Zhou and Zhensheng Zou contributed to the open-source work. Shengding Hu, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu and Maosong Sun provided valuable suggestions and proofread the paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
: Nature Communications thanks the anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yao, Y., Yu, T., Zhang, A. et al. Efficient GPT-4V level multimodal large language model for deployment on edge devices. Nat Commun 16, 5509 (2025). https://doi.org/10.1038/s41467-025-61040-5
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41467-025-61040-5
This article is cited by
-
Densing law of LLMs
Nature Machine Intelligence (2025)
-
Multimodal context-aware translation of the endangered dongba script
npj Heritage Science (2025)
-
Evaluating large language models on multimodal chemistry olympiad exams
Communications Chemistry (2025)








