Efficient GPT-4V level multimodal large language model for deployment on edge devices

Yao, Yuan; Yu, Tianyu; Zhang, Ao; Wang, Chongyi; Cui, Junbo; Zhu, Hongji; Cai, Tianchi; Chen, Chi; Li, Haoyu; Zhao, Weilin; He, Zhihui; Chen, Qianyu; Zhou, Ronghua; Zou, Zhensheng; Zhang, Haoye; Hu, Shengding; Zheng, Zhi; Zhou, Jie; Cai, Jie; Han, Xu; Zeng, Guoyang; Li, Dahai; Liu, Zhiyuan; Sun, Maosong

doi:10.1038/s41467-025-61040-5

Download PDF

Article
Open access
Published: 01 July 2025

Efficient GPT-4V level multimodal large language model for deployment on edge devices

Yuan Yao^1,2,3,
Tianyu Yu¹,
Ao Zhang³,
Chongyi Wang⁴,
Junbo Cui⁴,
Hongji Zhu⁴,
Tianchi Cai⁴,
Chi Chen¹,
Haoyu Li¹,
Weilin Zhao¹,
Zhihui He¹,
Qianyu Chen⁵,
Ronghua Zhou⁴,
Zhensheng Zou⁴,
Haoye Zhang¹,
Shengding Hu¹,
Zhi Zheng⁴,
Jie Zhou⁴,
Jie Cai⁴,
Xu Han¹,
Guoyang Zeng⁴,
Dahai Li⁴,
Zhiyuan Liu ORCID: orcid.org/0000-0002-7709-2543¹ &
…
Maosong Sun ORCID: orcid.org/0000-0002-6011-6115¹

Nature Communications volume 16, Article number: 5509 (2025) Cite this article

17k Accesses
11 Citations
16 Altmetric
Metrics details

Subjects

Abstract

Multimodal large language models have revolutionized AI research and industry, paving the way toward the next milestone. However, their large sizes and high computational costs restrict deployment to cloud servers, limiting use in mobile, offline, energy-sensitive, or privacy-critical scenarios. We present MiniCPM-V, efficient models for edge devices that integrate advancements in architecture, training, and data. The 8B model outperforms GPT-4V, Gemini Pro, and Claude 3 across 11 public benchmarks, processes high-resolution images at any aspect ratio, achieves robust optical character recognition, exhibits low hallucination rates, and supports over 30 languages while running efficiently on mobile phones. This progress reflects a broader trend: The sizes for high-performing models are rapidly decreasing alongside growing edge computation capacity, enabling advanced multimodal models to operate locally on consumer hardware. Such developments unlock applications across diverse real-world scenarios, from enhanced mobile AI to privacy-preserving solutions, marking a critical step toward democratizing powerful multimodal intelligence.

Deploying TinyML for energy-efficient object detection and communication in low-power edge AI systems

Article Open access 05 December 2025

MMAgentRec, a personalized multi-modal recommendation agent with large language model

Article Open access 08 April 2025

Application of multimodal large language models for safety indicator calculation and contraindication prediction in laser vision correction

Article Open access 03 February 2025

Introduction

The rapid development of Multimodal Large Language Models (MLLMs)^{1,2,3,4,5,6,7,8,9,10,11} has brought an impressive surge in multimodal capabilities in understanding, reasoning and interaction. This has not only fundamentally reshaped the landscape of AI research and industry, but also shed light on a promising path towards the next AI milestone. However, current MLLMs are still far from being practical in real-world applications. One of the most predominant challenges is the heavy computational burdens imposed by the massive number of parameters of MLLMs. As a result, most MLLMs can only be deployed on high-performing cloud servers, leading to significant energy consumption and carbon emissions. This limitation significantly constrains potential application scopes such as on mobile devices, energy-sensitive scenarios, offline settings without stable network connections, and privacy/security protective scenarios for both personal and industrial users.

In light of these limitations, there is a growing interest in exploring more efficient lightweight MLLMs^1,3,11,12 that can run on edge devices. Edge scenarios encompass a broader scope of equipment, including mobile phones, personal computers, vehicles and robotics, etc., which are ubiquitous in users’ daily lives and are experiencing rapid advancements in computation capacities. On-device MLLMs provide a promising solution towards more practical applications due to their broader usage scope, better computation efficiency, more robust offline behaviors, and better privacy/security protection.

However, developing capable on-device MLLMs is challenging due to significantly constrained parameter and inference computation budgets. As a result, more careful architecture designs and training recipes are required to fully unleash the potential of on-device MLLMs. In this work, we present MiniCPM-V, a series of efficient MLLMs deployable on edge devices. The philosophy of MiniCPM-V is to achieve a good balance between performance and efficiency, a more important objective in real-world applications. From February to May in 2024, we unveiled three models: (1) In February, we launched MiniCPM-V 1.0 2B, an early prototype of on-device MLLMs. (2) In April, we released MiniCPM-V 2.0 2B, which outperforms strong larger MLLMs such as Qwen-VL 9B⁷, CogVLM 17B⁵, and Yi-VL 34B¹³. This iteration also introduces support for high-resolution image perception and exhibits promising OCR capabilities. (3) Most recently in May, we introduced MiniCPM-Llama3-V 2.5 8B, which outperforms proprietary GPT-4V-1106, Gemini Pro and Claude 3 on the OpenCompass evaluation. Noteworthy features of this model include strong OCR capability, high-resolution image perception, trustworthy behavior, multilingual support, and efficient edge deployment optimization. The capabilities of on-device MLLMs have been growing even stronger in our later releases since May 2024.

More importantly, MiniCPM-V can be viewed as a representative example of a promising miniaturization trend of MLLMs. Figure 1 summarizes the recent development of MLLMs^3,12,14 in terms of performance, parameters and release time. We observe an interesting trend akin to Moore’s Law¹⁵ indicated by the red line: the sizes of models reaching GPT-4V level performance are rapidly decreasing over time. This phenomenon could perhaps be called Moore’s Law of MLLMs. Simultaneously, the computational capacity of edge devices such as phones and personal computers is steadily increasing (qualitatively depicted by the blue line). The convergence of these two trends indicates usable (e.g., GPT-4V level) MLLMs deployable on edge devices are soon within reach, opening up broader possibilities and benefiting more application scenarios in the near future. From a historical perspective of human technology development, this trend can also be viewed as the human pursuit of miniaturization of state-of-the-art technologies, which has been repeatedly witnessed in other science and technology fields. For example, in aerospace, the latest SpaceX Raptor 2 rocket engine can achieve a strong thrust of 2,256 kN with a mass of 1.6 tons, whereas 20 years ago, the RD-0750 rocket engine could only achieve a thrust of 1,413 kN with a mass exceeding 4 tons¹⁶.

**Fig. 1: Moore’s Law for MLLM? Trends of MLLM development in terms of time (x-axis), model size (y-axis), and performance (color).**

MiniCPM-V Series Techniques

In this paper, we will take MiniCPM-Llama3-V 2.5 as an example, and systematically introduce the notable features of MiniCPM-V series and the key techniques behind them:

Leading Performance. MiniCPM-Llama3-V 2.5 achieves better performance than GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass collection, a comprehensive evaluation over 11 popular benchmarks. This is jointly contributed by its careful design in architecture, data and training recipes, which we will detail in the following.
Strong OCR Capability. MiniCPM-Llama3-V 2.5 outperforms GPT-4V, Gemini Pro and Qwen-VL-Max on OCRBench. It also supports high-utility functions such as table-to-markdown conversion and full OCR content transcription. These are largely attributed to the 1.8M pixel high-resolution (e.g., 1344 × 1344) image perception technique across any aspect ratios¹⁷ of MiniCPM-Llama3-V 2.5.
Trustworthy Behavior. Based on the RLAIF-V¹⁸ and RLHF-V¹⁹ techniques that align MLLM behaviors from high-quality AI/human feedback, MiniCPM-Llama3-V 2.5 exhibits more trustworthy behaviors, achieving lower hallucination rates than GPT-4V-1106 on Object HalBench.
Multilingual Support. Inspired by the findings from VisCPM⁸, the integration of multilingual LLM significantly alleviates the heavy reliance on multimodal training data in low-resource languages. Based on the foundation, a lightweight multilingual multimodal instruction tuning helps MiniCPM-Llama3-V 2.5 generalize its multimodal capabilities to more than 30 languages.
Efficient Edge Deployment. We systematically integrate a suite of on-device optimization techniques, encompassing quantization, memory optimization, compilation optimization and NPU acceleration, enabling efficient deployment on edge devices.

We hope MiniCPM-V series can serve as an example for unveiling the potential of on-device MLLMs, and help draw more attention to promote the research in this direction. Following Moore’s Law for MLLM, we believe there will be increasingly powerful on-device MLLMs with reduced sizes, bringing efficient, safe, and trustworthy AI services on devices soon.

Results

Overview of MiniCPM-V

As shown in Fig. 2b, MiniCPM-V comprises three key modules: the visual encoder, compression layer, and LLM. The input image is first encoded by a visual encoder, utilizing the adaptive visual encoding approach. The visual tokens are then compressed by the compression layer, which adopts a perceiver resampler structure with one layer cross-attention. Finally, the compressed visual tokens, along with the text input, are fed into the LLM for conditional text generation.

Encoding high-resolution images poses two major challenges. In terms of efficiency, directly encoding high-resolution images results in an excessive number of visual tokens, rendering it computationally prohibitive for edge devices. In terms of effectiveness, the considerable discrepancy between the image resolution and the resolution employed during ViT pre-training can lead to out-of-distribution problems and therefore substantially degrade encoding performance. To address the challenges, we take advantage of the adaptive visual encoding strategy¹⁷ as shown in Fig. 2a. To handle the high-resolution images with different aspect ratios, we divide images into slices, where each slice better matches ViT’s pre-training setting in terms of resolution and aspect ratio. Each image is divided into a maximum of 10 slices, supporting 1.8 million pixels (e.g., 1344 × 1344 resolution) at most in total during encoding, which covers most real-world application scenarios. Then we adjust each slice by resizing it proportionally so that the resultant area size matches ViT pre-training area size, and interpolate the ViT’s position embeddings to adapt to the slice’s aspect ratio. After visual encoding, each slice is encoded into 1,024 tokens, where 10 slices can yield over 10k tokens collectively. To manage this high token count, we employ a compression module comprising of one-layer cross-attention and a moderate number of queries, with 2D positions informed⁷. In practice, the visual tokens of each slice are compressed into 64 queries for MiniCPM V1&2 and 96 tokens for MiniCPM-Llama3-V 2.5 through this layer. Compared with other MLLMs with competitive performance, the significantly smaller number of visual tokens in MiniCPM-V series enables superior efficiency in terms of GPU memory consumption, inference speed, first-token latency and power consumption, making it more friendly to wider application scopes and communities. Finally, we introduce a spatial schema inspired by²⁰ to indicate each slice’s position relative to the whole image. We first wrap tokens of each slice by two special tokens “<slice>” and “<\slice>”, and then employ a special token “\n” to separate slices from different rows.

We adopt a three-phase progressive multimodal learning strategy as shown in Fig. 2c, which consists of the pre-training phase, supervised fine-tuning phase, and alignment phase. In the first phase, we utilize large-scale image-text pairs for MLLM pre-training to align the visual modules (i.e., visual encoder and compression layer) with the input space of LLM and to acquire foundational multimodal knowledge. The pre-training phase can be further divided into three stages. In the first stage, the compression layer is warmed up. The second stage involves extending the input resolution of the pre-trained visual encoder. Finally, in the third stage, the visual modules are trained using an adaptive visual encoding strategy, allowing them to effectively handle high-resolution inputs with any aspect ratios. In the second phase, we perform Supervised Fine-Tuning (SFT) on high-quality visual question answering datasets to further learn knowledge and interaction capability from human annotations. We unlock all model parameters to better exploit the data and learn rich knowledge during the SFT phase. We also conduct a lightweight yet high-quality SFT process as in VisCPM⁸ to enhance alignment with languages beyond English and Chinese, achieving strong multimodal performance across more than 30 languages. In the alignment phase, we employ the recent RLAIF-V¹⁸ approach to address the hallucination problem (Fig. 2d), where the MLLM generates responses that are not factually grounded in the input image¹⁹. The first step of RLAIF-V is to generate multiple responses for a given instruction using the policy model. Then a divide-and-conquer strategy is applied for response scoring. After collecting the high-quality AI feedback, we perform preference learning via DPO²¹ method.

This paper introduces the first 3 models in the MiniCPM-V series, including MiniCPM-V 1.0, MiniCPM-V 2.0, and MiniCPM-Llama3-V 2.5. MiniCPM-V 1.0 is trained with the pre-training stage1&2 and SFT without using the adaptive visual encoding and RLAIF-V. For MiniCPM-V 2.0, we include all training stages and the adaptive visual encoding strategy to further improve performance. MiniCPM-Llama3-V 2.5 adopts Llama3-Instruct 8B as its base LLM, showcasing strong multimodal understanding capabilities, as illustrated in Fig. 3.

**Fig. 3: Qualitative results of MiniCPM-Llama3-V 2.5.**

Evaluation across diverse multimodal understanding benchmarks

We perform a comprehensive evaluation on popular benchmarks covering visual question answering, multimodal conversation, knowledge and reasoning, OCR, and hallucination. (1) General benchmarks. We adopt OpenCompass²² as the general evaluation indicator, which is a comprehensive collection over 11 popular multimodal benchmarks, including MME²³, MMBench²⁴, MMMU²⁵, MathVista²⁶, LLaVA Bench⁴, etc. We also report the results on RealWorldQA for real-world spatial understanding capabilities. (2) OCR benchmarks. We adopt three widely used benchmarks for OCR capability evaluation, including including OCRBench²⁷, TextVQA²⁸ and DocVQA²⁹. (3) Hallucination benchmarks. We also include Object HalBench^19,30 to evaluate the trustworthiness of the models.

We compare with strong baselines in different series: For open-source models, we compare with strong models including Yi-VL-6B/34B¹³, Qwen-VL-Chat-9B⁷, DeepSeek-VL-7B³, TextMonkey³¹, CogVLM-Chat-17B⁵, CogVLM2-Llama3-19B⁵, Idefics2-8B⁹, Bunny-Llama-3-8B³², XTuner-Llama-3-8B-v1.1³³, LLaVA-NeXT-Llama-3-8B³⁴, Cambrian-8B/34B³⁵, LLaVA-NeXT-Yi-34B³⁶, DeepSeek-VL-1.3B³, MobileVLM V2³⁷, Mini-Gemini¹⁴ and Phi-3-Vision-128k-instruct¹². For proprietary models, we compare with GPT-4V-1106², Gemini-Pro¹ and Claude 3 Opus³⁸.

From the experimental results in Fig. 4, we have the following observations: (1) MiniCPM-Llama3-V 2.5 outperforms strong open-source models by a notable margin. For instance, MiniCPM-Llama3-V 2.5 surpasses the recent strong Idefics2-8B by 7.9 points on the OpenCompass benchmark, with similar model sizes. It also achieves better results than significantly larger models such as Cambrian-34B, LLaVA-NeXT-Yi-34B, Yi-VL-34B and CogVLM2-Llama3-19B. (2) Compared with powerful proprietary models, such as GPT-4V-1106 and Gemini Pro, MiniCPM-Llama3-V 2.5 achieves better performance on the OpenCompass benchmark with significantly fewer parameters. In addition, MiniCPM-Llama3-V 2.5 also achieves lower hallucination rates than GPT-4V-1106 on Object HalBench, indicating its trustworthiness for real-world applications. (3) The smaller MiniCPM-V 2.0 with 2B parameters achieves significantly better performance compared with other 2B ~ 4B models, and is even comparable with 8B MLLMs such as Bunny-Llama-3-8B. In summary, the results show that MiniCPM-V series achieves a good balance between performance and efficiency, making it more friendly for broader communities and applications.

**Fig. 4: Performance comparison of proprietary MLLMs, open-source MLLMs on general multimodal benchmarks.**

MiniCPM-V models also show strong OCR capabilities, including scene-text, document and screenshot understanding. As shown in Fig. 5a, MiniCPM-Llama3-V 2.5 outperforms open-source MLLMs ranging 1.7B–34B on OCRBench, TextVQA, and DocVQA. The performance on these datasets is even comparable to proprietary models like GPT-4V-1106 and Gemini Pro. MiniCPM-V 2.0 also achieves significantly better performance among models in the 2B–4B parameter range (Fig. 5b).

**Fig. 5: Performance on OCR and multilingual benchmarks.**

Based on the multilingual multimodal generalization approach from VisCPM, MiniCPM-Llama3-V 2.5 extends its multimodal capability to over 30 languages. As shown in Fig. 5c, MiniCPM-Llama3-V 2.5 can outperform Yi-VL 34B and Phi-3-vision-128k-instruct on the multilingual LLaVA Bench. The promising multilingual multimodal capability makes MiniCPM-Llama3-V 2.5 useful in serving larger linguistic groups.

To investigate the effectiveness of key components, we perform an ablation study on high-resolution perception and multi-stage training pipeline. To ablate high-resolution perception, we follow the standard method in LLaVA-1.5³⁹ to downsample high-resolution images into low-resolution versions (i.e., 448 × 448) in both training and inference. To ablate multi-stage training pipeline, we follow the standard two-stage training (i.e., pretraining and instruction tuning) in LLaVA-1.5³⁹. Due to the high computational costs of model training on full data, we perform ablation on a subset of full training data by randomly sampling 10% data from each dataset, resulting in 70M data in total, which is sufficient to validate a frontier MLLM. From the experimental results in Table 1, we can see that both high-resolution perception and multi-stage training pipeline contribute to the final performance. The reason is that high-resolution perception is crucial for MLLMs to perceive fine-grained visual details, especially for OCR-related tasks, and multi-stage training can better fit and exploit training data of different forms and qualities.

Table 1 Multi-stage training and high-resolution perception ablation results on a subset (10%) of full training data

Full size table

Moreover, it is worth noting that MiniCPM-Llama3-V 2.5 requires significantly less inference computation. For example, the visual token number range of MiniCPM-Llama3-V 2.5 is (96, 960), which is lower than LLaVA-NeXT-Llama-3-8B’s (1728, 2880). This can be important especially for real-world on-device applications in terms of inference speed, first-token latency, memory usage, and power consumption.

Specifically, we provide a comparison of the computational costs between the adaptive visual encoding method and the vanilla visual encoding method (i.e., visual features of the original image from ViT are directly projected and input into the LLM, as in LLaVA-1.5³⁹). From the results in Fig. 2a, we can see that compared with the standard method, adaptive visual encoding largely reduces both FLOPs and GPU memory usage in both ViT and LLM for high-resolution images. The reason is that, compared with the standard method, image slicing prevents the quadratic computation growth of ViT, and the compression layer largely reduces the number of visual tokens to LLMs.

Efficient Deployment of MiniCPM-V on Edge Devices

In this section, we investigate the deployment of MiniCPM-V on edge devices. Edge devices such as smartphones and computers often face resource constraints due to factors like heat dissipation, size limitations, and power consumption. When deploying models, the two most critical limitations are memory capacity and CPU/GPU processing speed. High-performance servers typically boast extensive memory capacities, often exceeding 100GB or even 1TB. In contrast, the memory available on mobile phones typically ranges from 12GB to 16GB, which can be insufficient for MLLM deployment. On the other hand, The overall processing speeds of CPUs in smartphones are notably slower. For instance, the Snapdragon 8 Gen3 features 8 CPU cores, whereas high-performance server like Intel Xeon Platinum 8580 has 60 CPU cores. Similarly, mobile phone GPUs are not as powerful as server GPUs. For example, Qualcomm Adreno 750 only has 6 TFLOPS, while NVIDIA 4090 can reach 83 TFLOPS.

To deploy the MLLM on edge devices, we first employ quantization for reduced memory cost. For MiniCPM-Llama3-V 2.5, the fp16 version model typically demands 16–17G memory. We opt for the Q4_K_M mode 4-bit quantization strategy within GGML framework. This reduces the memory requirement to around 5G, which is friendly to mobile phone usage. We then empirically investigate the deployment results on different frameworks. Several frameworks have been proposed for on-device deployment. Illustrated in Fig. 6, we make a thorough investigation of different frameworks for different chip types including CPU, GPU, and NPU. Given the ubiquity of CPU usage across devices, we prioritize this chip type and opt for the llama.cpp⁴⁰ framework. Combining quantization and llama.cpp on Xiaomi 14 Pro (Snapdragon 8 Gen 3), the model achieves a text encoding latency of 64.2s and a text decoding speed of 1.3 tokens/s (as depicted in Fig. 6f), which is still far from acceptable for users.

**Fig. 6: Deploying MiniCPM-V on edge devices.**

To achieve better acceleration, we further investigate a series of advanced techniques including memory usage optimization, compilation optimization, configuration optimization, and NPU acceleration, as shown in Fig. 6a–d. We first explore memory usage optimization strategies to address the image processing bottleneck of the inference speed due to limited memory resources on mobile phones. Instead of loading both ViT and LLM simultaneously into memory, we adopt a sequential loading approach. Specifically, we first load ViT for visual encoding, followed by the LLM for visual and text token encoding. By releasing the large amount of memory occupied by LLM, we can prevent frequent paging (swapping in and out) during ViT encoding, thereby improving the program efficiency. This optimization technique, as illustrated in Fig. 6e, results in a notable reduction of image processing time from 45.2s to 31.5s. We also find that directly compiling the models on the target devices can significantly improve the encoding latency and the decoding throughput. This can be attributed to better consistency between the compilation and target device instruction set architecture. As depicted in Fig. 6e, this optimization endeavor yields promising results. Encoding latency shows a notable reduction from 50.5s to 17.0s, while decoding throughput experiences a significant boost from 1.3 tokens/s to 3.2 tokens/s. Next, instead of relying on a single default configuration of the llama.cpp framework, we propose an automated parameter search algorithm that dynamically identifies the optimal configurations (e.g., computational distribution across CPU cores) tailored to various edge devices. Through configuration optimization, we can achieve good improvements. Specifically, decoding throughput surged from 3.2 tokens/s to an impressive 8.2 tokens/s, surpassing the typical human reading speed. Finally, we leverage Neural Processing Units (NPUs), a class of specialized hardware designed to accelerate AI applications, available in certain smartphones. Recognized for their ability to address computational bottlenecks, NPUs enable accelerated visual encoding. Specifically, we replace the backend framework of ViT with QNN while retaining the llama.cpp backend for the language model component. On mobile devices equipped with Qualcomm NPUs, this optimization yields a notable reduction in visual encoding time, decreasing from 3.7 to 1.3 s.

For a comprehensive assessment of MiniCPM-Llama3-V 2.5’s performance across various edge devices, we present test results on Xiaomi 14 Pro (Snapdragon 8 Gen 3), vivo X00 Pro (Mediatek Dimensity 9300), Macbook Pro (M1), and Jetson AGX Orin 32G in Fig. 6f. Thanks to the deployment optimization techniques, MiniCPM-Llama3-V 2.5 can operate efficiently on both mobile phones and personal computers, delivering acceptable latency and throughput. For instance, leveraging NPU on Xiaomi 14 Pro enables it to achieve a similar encoding speed as the Mac M1. Furthermore, nearly all devices exhibit comparable or higher throughput compared with human reading speed. Upon analyzing the results, it becomes evident that the current computation bottleneck primarily stems from LLM prefilling, which mainly involves encoding image and text tokens for LLM inference. Promising research directions involve developing more efficient visual encoding methods with fewer visual tokens, and better leveraging GPU/NPU acceleration for LLM encoding. With increasing attention to on-device MLLMs and the rapid advancement of GPU/NPU acceleration techniques, we believe that real-time interaction with on-device MLLMs can be reached soon.

Discussion

The MiniCPM-V series models are a primary exploration into powerful on-device MLLMs. Thanks to techniques such as adaptive visual encoding, multilingual generalization, and the RLAIF-V method, MiniCPM-Llama3-V 2.5 can achieve GPT-4V level performance with significantly fewer parameters. Leveraging diverse optimization techniques for edge deployment, this model ensures an acceptable user experience on mobile phones.

Despite promising performance, there remain several limitations with the current MiniCPM-V models. (1) Capability Depth. There is still plenty of room for improvement in enhancing multimodal understanding capability and inference efficiency. (2) Capability Width. In addition to image modality, it’s promising to expand MLLM capabilities to encompass other modalities, such as video and audio, etc., where GPT-4o⁴¹ and Google Astra⁴² have given good examples.

In addition to MLLM capabilities, edge deployment also presents unique challenges. The inference speed and latency are still far from good enough and the model service can be limited by the battery capacity. In addition, previous efforts on chips and deployment frameworks mainly target CNNs and LSTMs, which can be sub-optimal for MLLMs. Tailored efforts to MLLMs can bring ample room for improvement.

Considering the current limitations and the promising future of on-device MLLMs, we anticipate increasing efforts from both academia and industry in enhancing model capabilities in terms of depth and width, and improving smartphone chips and deployment frameworks. We believe that simultaneous advancements in model capability and edge device capacity will lead to on-device applications providing a satisfying user experience in the near future.

Methods

Adaptive visual encoding

Image partition

To process high-resolution images with varying aspect ratios, we divide them into slices¹⁷. Each slice is adjusted to more closely align with ViT’s pre-training settings in terms of both resolution and aspect ratio. Specifically, we first calculate the ideal number of slices based on the input image size. Given an image with resolution (W_I, H_I) and a ViT pre-trained on images with resolution (W_v, H_v), we calculate the ideal slice number $N=\lceil \frac{{W}_{I}\times {H}_{I}}{{W}_{v}\times {H}_{v}}\rceil$. Then, we choose the combination of rows n and columns m from the set ${{\mathbb{C}}}_{N}=\{(m,n)| m\times n=N,m\in {\mathbb{N}},n\in {\mathbb{N}}\}$. A good partition (m, n) should result in slices that match well with ViT’s pre-training setting. To achieve this, we use a score function to evaluate each potential partition:

$$S(m,n)=-\left| \log \frac{{W}_{I}/m}{{H}_{I}/n}-\log \frac{{W}_{v}}{{H}_{v}}\right| .$$

(1)

We select the partition with the highest score from all possible candidates:

$${m}^{*},{n}^{*}=\arg {\max }_{(m,n)\in \bar{{\mathbb{C}}}}S(m,n),$$

(2)

where $\bar{{\mathbb{C}}}$ is the possible (m, n) combinations with the product N. However, when N is a prime number, the feasible solutions can be limited to (N, 1) and (1, N). Therefore, we additionally introduce ${{\mathbb{C}}}_{N-1}$ and ${{\mathbb{C}}}_{N+1}$, and set $\bar{{\mathbb{C}}}={{\mathbb{C}}}_{N-1}\cup {{\mathbb{C}}}_{N}\cup {{\mathbb{C}}}_{N+1}$. In practice, we set N < 10, supporting 1.8 million pixels (e.g., 1344 × 1344 resolution) at most during encoding. Although we can encompass more image slices for higher resolutions, we purposely impose this resolution upper-bound, since it already well covers most real-world application scenarios, and the benefit of further increasing encoding resolution is marginal considering the performance and overhead.

Slice encoding

Although image partitioning can ensure a good match between the slices and the ViT pre-training setting, each slice’s size is not precisely equal to (W_v, H_v). To feed the slices into ViT, we first adjust each slice by resizing it proportionally so that the resultant area size matches ViT pre-training area size W_v × H_v. This adjustment helps prevent a significant gap between the number of encoded patches and the ViT’s pre-training setting. Subsequently, we interpolate the ViT’s position embeddings to adapt to the slice’s ratio. This involves reshaping the ViT’s 1D embedding ${{{{\bf{P}}}}}_{1}\in {{\mathbb{R}}}^{Q\times l}$ back to its 2D format ${{{{\bf{P}}}}}_{2}\in {{\mathbb{R}}}^{q\times q\times l}$, where the number of position embeddings Q = q × q. Then, we interpolate P₂ to fit the size of each slice via 2D interpolation. We also include the original image as an additional slice to provide holistic information about the entire image.

Token compression

After visual encoding, each slice is encoded into 1,024 tokens, where 10 slices can yield over 10k tokens collectively. To manage this high token count, we employ a compression module comprising of one-layer cross-attention and a moderate number of queries, with 2D positions. In practice, the visual tokens of each slice are compressed into 64 queries for MiniCPM V1&2 and 96 tokens for MiniCPM-Llama3-V 2.5 through this layer. Compared with other MLLMs with competitive performance, the significantly smaller number of visual tokens in MiniCPM-V series enables superior efficiency in terms of GPU memory consumption, inference speed, first-token latency and power consumption, making it more friendly to wider application scopes and communities.

Spatial schema

To indicate each slice’s position relative to the whole image, inspired by²⁰, we introduce a spatial schema. We first wrap tokens of each slice by two special tokens “<slice>” and “<\slice>”, and then employ a special token “\n” to separate slices from different rows.

Pre-training

In this phase, we utilize large-scale image-text pairs for MLLM pre-training. The primary goal of this phase is to align the visual modules (i.e., visual encoder and compression layer) with the input space of the LLM and learn foundational multimodal knowledge. We show the pre-training data composition in Table 2. The pre-training phase is further divided into 3 stages.

Table 2 Pre-training data

Full size table

Stage-1

The role of stage-1 is to warm up the compression layer, primarily connecting the visual encoder and LLMs. (1) Trainable Modules. We randomly initialize the compression layer and train this module in stage-1, keeping other parameters frozen. The visual encoder’s resolution is set to 224 × 224, which is the same as the visual encoder’s pre-training setting. (2) Data. To warm up the compression layer, we randomly select 200M data from the Image Captioning data in Table 2. Data cleaning is performed to remove image-text pairs with poor correlation and ill-formatted text data, ensuring the data quality.

Stage-2

After the warm-up training of the compression layer, the role of stage-2 is to extend the input resolution of the pre-trained visual encoder. (1) Trainable Modules. In stage-2, we extend the image resolution from 224 × 224 to 448 × 448. The whole visual encoder is trained, leaving other parameters frozen. (2) Data. To extend the pre-trained resolution, we additionally select 200M data from the Image Captioning data in Table 2.

Stage-3

After extending the primary input resolution of the visual encoder, we finally train the visual modules using the adaptive visual encoding strategy, which can further accommodate high-resolution inputs with any aspect ratio. (1) Trainable Modules. During the stage-3 training, both the compression layer and the visual encoder are trained to adapt to the language model embedding space. The LLM is kept frozen to avoid disruption from the relatively low-quality pre-training data. (2) Data. Different from the previous stages with only image captioning data, during the high-resolution pre-training stage, we additionally introduce OCR data to enhance the visual encoders’ OCR capability.

Caption rewriting

Image-text pairs sourced from the Web^43,44 can suffer from quality issues in the caption data, including non-fluent content, grammatical errors, and duplicated words. Such low-quality data can lead to unstable training dynamics. To address the issue, we introduce an auxiliary model for low-quality caption rewriting. The rewriting model takes the raw caption as input and is asked to convert it into a question-answer pair. The answer from this process is adopted as the updated caption. In practice, we leverage GPT-4⁴⁵ to annotate a small number of seed samples, which are then used to fine-tune an LLM for the rewriting task.

Data packing

Samples from different data sources usually have different lengths. The high variance of sample lengths across batches will lead to inefficiency in memory usage and the risk of out-of-memory problem. To address the issue, we pack multiple samples into a single sequence with a fixed length. By truncating the last sample in the sequence, we ensure uniformity in sequence lengths, facilitating more consistent memory consumption and computational efficiency. Meanwhile, we modify the position ids and attention masks to avoid interference between different samples. In our experiments, the data packing strategy can bring 2 ~ 3 times acceleration in the pre-training phase.

Multilingual generalization

Multimodal capability across multiple languages is essential for serving users from broader communities. Traditional solutions involve extensive multimodal data collection and cleaning, and training for the target languages. Fortunately, recent findings⁸ have shown that the multimodal capabilities can be efficiently generalized across languages via a strong multilingual LLM pivot, largely alleviating the heavy reliance on multimodal data in low-resource languages. In practice, we only pre-train our model on English and Chinese multimodal data, and then perform a lightweight but high-quality multilingual supervised fine-tuning to align to the target languages. Despite its simplicity, we find the resultant MiniCPM-Llama3-V 2.5 can achieve good performance in over 30 languages as compared with significantly larger MLLMs.

Supervised fine-tuning

After learning foundational capabilities from pre-training, we perform supervised fine-tuning (SFT) on high-quality visual question answering datasets to further learn knowledge and interaction capability from human annotations.

Trainable Modules

Compared with the pre-training phase which mainly uses crawled data from the Web, the SFT phase mainly utilizes high-quality datasets annotated by either human lablers or strong models such as GPT-4. Therefore, we unlock all model parameters to better exploit the data and learn rich knowledge during SFT phase.

Data

Recent works^1,46 show that data near the end of training plays a more important role in shaping the models’ capabilities and response styles. We categorize the SFT data into two parts, as shown in Table 3. Part-1 focuses on bolstering the models’ basic recognition capabilities, while part-2 is tailored to enhance their capabilities in generating detailed responses and following human instructions. Specifically, part-1 data consists of the traditional question answering and captioning datasets with relatively short response lengths, which helps enhance the model’s basic recognition capabilities. In comparison, part-2 encompasses datasets featuring long responses with complex interactions, either in text or multimodal context. During SFT, these two parts of data are concatenated and sequentially fed into the model. For MiniCPM-Llama3-V 2.5, we integrate 2M data from the recent Cauldron dataset⁹ for multimodal knowledge augmentation, and 90K multilingual data over 36 languages for boosting the multilingual conversation capability.

Table 3 SFT data for MiniCPM-V series

Full size table

Alignment

MLLMs are typically prone to hallucination problems, generating responses that are not factually grounded in the input image¹⁹. The issue greatly limits the wide application of MLLMs, especially in high-stakes scenarios, such as autonomous driving and assistance for visually impaired groups. To address the hallucination problem, we employ the recent RLAIF-V¹⁸ approach, where the key is to obtain scalable high-quality feedback from open-source models for preference learning²¹.

Response generation

The first step of RLAIF-V is to generate multiple responses for a given instruction using the policy model. Specifically, given a model M waiting for alignment, we sample 10 responses Y = {y₁, y₂, ⋯ , y_n} from M using sampling decoding with high temperatures. There are several benefits of using the policy model M for response generation: (1) Feedback collection and learning can better focus on trustworthiness, since different text styles from multiple MLLMs are avoided. (2) Feedback learning is more efficient since preference is directly collected on the distribution of the policy model.

Feedback collection

Collecting high-quality feedback from open-source MLLMs can be challenging due to their typically weaker capabilities compared with proprietary models. To address the issue, RLAIF-V uses a divide-and-conquer strategy for response scoring. Specifically, each response y_i is divided into atomic claims C_i = {c₁, c₂,…, c_m} using Llama-3 8B, where the correctness of atomic claims is much easier to evaluate. Then, we verify the claims by converting each claim to a yes/no question and employing an open-source MLLM to score each claim. In practice, we adopt OmniLMM 12B for MiniCPM-V 2.0 scoring and LLaVA-NeXT-Yi 34B for MiniCPM-Llama3-V 2.5 scoring. The final score s_i of the response y_i is given by − n_rej, where n_rej is the number of invalid atomic claims.

Direct preference optimization

After collecting the high-quality AI feedback, we perform preference learning via DPO method. The DPO algorithm requires training on preference pairs, where one sample y_w is preferred to the other one y_l. To compose the preference dataset, we randomly sample pairs from each response set Y = {y₁, y₂,…,y_n}, and determine (y_w, y_l) based on their relative scores. Finally, we construct a preference dataset consisting of 6K preference pairs from 3K unique images for preference learning.

Data availability

The datasets used for training are described in detail in the Methods section. While most of the training data are publicly available, a small portion originates from proprietary datasets licensed from a commercial provider and cannot be shared due to legal and contractual restrictions. These restricted datasets were used exclusively for model training and do not affect the reproducibility of the core findings presented in this study. Source data are provided with this paper.

Code availability

Code for training and evaluating our model is publicly available on GitHub at https://github.com/OpenBMB/MiniCPM-o, and has been archived at https://doi.org/10.5281/zenodo.15525638.

References

Reid, M. et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. Preprint at https://arxiv.org/abs/2403.05530 (2024).
Achiam, J. et al. GPT-4 Technical Report. Preprint at https://arxiv.org/abs/2303.08774 (2023).
Lu, H. et al. DeepSeek-VL: Towards real-world vision-language understanding. Preprint at https://arxiv.org/abs/2403.05525 (2024).
Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning. NeurIPS 36 (2024).
Wang, W. et al. CogVLM: Visual expert for pretrained language models. In: Proc. NeurIPS (2023).
Chen, Z. et al. How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites. In Science China Information Sciences (2024).
Bai, J. et al. Qwen-VL: a frontier large vision-language model with versatile abilities. Preprint at https://arxiv.org/abs/2308.12966 (2023).
Hu, J. et al. Large multilingual models pivot zero-shot multimodal learning across languages. In Proc. ICLR (2024).
Laurençon, H., Tronchon, L., Cord, M. & Sanh, V. What matters when building vision-language models? In Proc. NeurIPS (2024).
McKinzie, B. et al. MM1: methods, analysis & insights from multimodal LLM pre-training. In Proc. ECCV (2024).
Beyer, L. et al. PaliGemma: a versatile 3b vlm for transfer. Preprint at https://arxiv.org/abs/2407.07726 (2024).
Abdin, M. et al. Phi-3 technical report: a highly capable language model locally on your phone. Preprint at https://arxiv.org/abs/2404.14219 (2024).
Young, A. et al. Yi: open foundation models by 01.AI. Preprint at https://arxiv.org/abs/2403.04652 (2024).
Li, Y. et al. Mini-Gemini: mining the potential of multi-modality vision language models. Preprint at https://arxiv.org/abs/2403.18814 (2024).
Wikipedia. Moore’s law. In Wikipedia, the Free Encyclopedia (2001).
Wikipedia. Thrust‑to‑weight ratio. In Wikipedia, the Free Encyclopedia (2024).
Xu, R. et al. LLaVA-UHD: an LMM perceiving any aspect ratio and high-resolution images. In Proc. ECCV (2024).
Yu, T. et al. RLAIF-V: aligning MLLMs through open-source AI feedback for super GPT-4V trustworthiness. In Proc. CVPR (2025).
Yu, T. et al. RLHF-V: towards trustworthy MLLMs via behavior alignment from fine-grained correctional human feedback. In: Proc. CVPR (2024).
Bavishi, R. et al. Introducing our multimodal models. In Adept Blog (2023).
Rafailov, R. et al. Direct preference optimization: your language model is secretly a reward model. NeurIPS. 36 (2024).
OpenCompass Contributors. OpenCompass: a universal evaluation platform for foundation models. In GitHub repository (2023).
Fu, C. et al. MME: a comprehensive evaluation benchmark for multimodal large language models. Preprint at https://arxiv.org/abs/2306.13394 (2023).
Liu, Y. et al. MMBench: Is your multi-modal model an all-around player? Preprint at https://arxiv.org/abs/2307.06281 (2023).
Yue, X. et al. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. In Proc. CVPR (2024).
Lu, P. et al. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. In Proc. ICLR (2024).
Liu, Y. et al. OCRBench: on the hidden mystery of OCR in large multimodal models. In Science China Information Sciences (2023).
Singh, A. et al. Towards VQA models that can read. In: Proc. CVPR (2019).
Mathew, M., Karatzas, D. & Jawahar, C. DocVQA: a dataset for VQA on document images. In: Proc. WACV (2021).
Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T. & Saenko, K. Object hallucination in image captioning. In Proc. EMNLP (2018).
Liu, Y. et al. TextMonkey: An OCR-free large multimodal model for understanding document. Preprint at https://arxiv.org/abs/2403.04473 (2024).
He, M. et al. Efficient multimodal learning from data-centric perspective. Preprint at https://arxiv.org/abs/2402.11530 (2024).
XTuner Contributors. XTuner: a toolkit for efficiently fine‑tuning LLM. In GitHub repository (2023).
Li, B. et al. LLaVA‑NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild. In LLaVA blog (2024).
Tong, S. et al. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs. In Proc. NeurIPS (2024).
Liu, H. et al. LLaVA‑NeXT: Improved reasoning, OCR, and world knowledge. In LLaVA blog (January 30, 2024).
Chu, X. et al. MobileVLM: a fast, reproducible and strong vision language assistant for mobile devices. Preprint at https://arxiv.org/abs/2312.16886 (2023).
Anthropic. Introducing the next generation of Claude. In Claude 3 blog (2024).
Liu, H., Li, C., Li, Y. & Lee, Y. J. Improved baselines with visual instruction tuning. In: Proc. CVPR (2024).
Georgi Gerganov and the llama.cpp Group. llama.cpp: LLM inference in C/C++ with cross‑platform support and hardware-optimized quantization. In GitHub repository (2023).
OpenAI. Hello GPT‑4o: our new flagship, multimodal model for real-time reasoning across text, vision, and audio. In OpenAI blog (May 13, 2024).
DeepMind. Project Astra: a research prototype exploring capabilities toward a universal AI assistant. In Google DeepMind Models (2024).
Schuhmann, C. et al. LAION-5B: An open large-scale dataset for training next generation image-text models. NeurIPS 35, 25278–25294 (2022).
Google Scholar
Byeon, M. et al. COYO-700M: Image-text pair dataset. In GitHub repository (2022).
Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with GPT-4. Preprint at https://arxiv.org/abs/2303.12712 (2023).
Hu, S. et al. MiniCPM: Unveiling the potential of small language models with scalable training strategies. In Proc. COLM (2024).
Lin, T.-Y. et al. Microsoft COCO: Common objects in context. In: ECCV (Eds. Fleet, D., Pajdla, T., Schiele, B. & Tuytelaars, T.) 740–755 (Springer, 2014).
Krishna, R. et al. Visual Genome: Connecting language and vision using crowdsourced dense image annotations”. IJCV 123, 32–73 (2017).
Article MathSciNet Google Scholar
Sharma, P., Ding, N., Goodman, S. & Soricut, R. Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018).
Changpinyo, S., Sharma, P., Ding, N. & Soricut, R. Conceptual 12M: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: CVPR (2021).
Wu, J. et al. AI Challenger: a large-scale dataset for going deeper in image understanding. Preprint at https://arxiv.org/abs/1711.06475 (2017).
Gu, J. et al. Wukong: a 100 million large-scale Chinese cross-modal pre-training benchmark. NeurIPS 35, 26418–26431 (2022).
Google Scholar
Xie, C., Li, J. & Zhang, B. CCMB: A Large-scale Chinese Cross-modal Benchmark. Preprint at https://arxiv.org/abs/2205.03860 (2022).
Srinivasan, K., Raman, K., Chen, J., Bendersky, M. & Najork, M. WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning. In: SIGIR 2443–2449 (2021).
Biten, A. F., Tito, R., Gomez, L., Valveny, E.& Karatzas, D. OCR-IDL: OCR annotations for industry document library dataset. In: ECCV (Eds. Avidan, S., Brostow, G., Cissé, M., Farinella, G. M. & Hassner, T.). 241–252 (Springer, 2022)..
Gupta, A., Vedaldi, A. & Zisserman, A. Synthetic data for text localisation in natural images. In: CVPR 2315–2324 (2016).
Kim, G. et al. OCR-free document understanding transformer. In: ECCV (Eds. Avidan, S., Brostow, G., Cissé, M., Farinella, G. M. & Hassner, T.) (2022).
Li, L. et al. Multimodal ArXiv: a dataset for improving scientific comprehension of large vision-language models. In Proc. ACL (2024).
Plummer, B. A. et al. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV 2641–2649 2015.
Gao, H. et al. Are you talking to a machine? dataset and methods for multilingual image question. NeurIPS 28 (2015).
Lu, P. et al. IconQA: a new benchmark for abstract diagram understanding and visual language reasoning. In Proc. NeurIPS Datasets and Benchmarks Track (2021).
Hudson, D. A. & Manning, C. D. GQA: a new dataset for real-world visual reasoning and compositional question answering. In: CVPR 6700–6709 (2019).
Antol, S. et al. VQA: Visual question answering. In: ICCV 2425–2433 (2015).
Johnson, J. et al. CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR 2901–2910 (2017).
Gurari, D. et al. VizWiz Grand Challenge: answering visual questions from blind people. In: CVPR 3608–3617 (2018).
Zhu, Y., Groth, O., Bernstein, M. & Fei-Fei, L. Visual7W: Grounded question answering in images. In: CVPR (2016).
Ren, M., Kiros, R. & Zemel, R. Exploring models and data for image question answering. NeurIPS 28 (2015).
Marino, K., Rastegari, M., Farhadi, A. & Mottaghi, R. OK-VQA: A visual question answering benchmark requiring external knowledge. In: CVPR 3195–3204 (2019).
Schwenk, D., Khandelwal, A., Clark, C., Marino, K. & Mottaghi, R. A-OKVQA: A benchmark for visual question answering using world knowledge. In: ECCV (Eds. Avidan, S., Brostow, G., Cissé, M., Farinella, G. M. & Hassner, T.) 146–162 (Springer, 2022)..
Shah, S., Mishra, A., Yadati, N. & Talukdar, P. P. KVQA: knowledge-aware visual question answering. AAAI 33, 8876–8884 (2019).
Article Google Scholar
Lu, P. et al. Learn to explain: multimodal reasoning via thought chains for science question answering. NeurIPS 35, 2507–2521 (2022).
Google Scholar
Yu, L., Poirson, P., Yang, S., Berg, A. C. & Berg, T. L. Modeling context in referring expressions. In: ECCV. (Eds. Leibe, B., Matas, J., Sebe, N. & Welling, M.). 69–85 (Springer, 2016).
Du, Y. et al. What makes for good visual instructions? Synthesizing complex visual reasoning instructions for visual instruction tuning. In Proc. COLING (2025).
Zellers, R., Bisk, Y., Farhadi, A. & Choi, Y. From recognition to cognition: visual commonsense reasoning. In: CVPR (2019).
Suhr, A., Lewis, M., Yeh, J. & Artzi, Y. A corpus of natural language for visual reasoning. In: ACL 217–223 (2017).
Liu, F. et al. Mitigating hallucination in large multi-modal models via robust instruction tuning. In Proc. ICLR (2024).
Chen, J. et al. GeoQA: a geometric question answering benchmark towards multimodal numerical reasoning. In Proc. ACL Findings (2021).
Cherian, A., Peng, K.-C., Lohit, S., Smith, K. A. & Tenenbaum, J. B. Are deep neural networks smarter than second graders? In: CVPR 10834–10844 (2023).
Mishra, A., Shekhar, S., Singh, A. K. & Chakraborty, A. OCR-VQA: Visual question answering by reading text in images. In: ICDAR. (2019).
Biten, A. F. et al. Scene text visual question answering. In: CVPR 4291–4301 (2019).
Tanaka, R., Nishida, K. & Yoshida, S. VisualMRC: Machine reading comprehension on document image. AAAI 35, 13878–13888 (2021).
Article Google Scholar
Kafle, K., Price, B., Cohen, S. & Kanan, C. DVQA: understanding data visualizations via question answering. In: CVPR 5648–5656 (2018).
Kahou, S. E. et al. FigureQA: an annotated figure dataset for visual reasoning. In Proc. ICLR (2018).
Masry, A., Long, D. X., Tan, J. Q., Joty, S. & Hoque, E. ChartQA: a benchmark for question answering about charts with visual and logical reasoning. In Proc. ACL Findings (2022).
Svetlichnaya, S. DeepForm: understand structured documents at scale. In Weights & Biases report (2020).
Chen, W. et al. TabFact: a large-scale dataset for table-based fact verification. In Proc. ICLR (2020).
Mathew, M. et al. InfographicVQA. In: WACV 1697–1706 (2022).
Stanisławek, T. et al. Kleister: Key information extraction datasets involving long documents with complex layouts. In: ICDAR (Eds. Lladós, J., Lopresti, D. & Uchida, S.). 564–579 (Springer, 2021).
Pasupat, P. & Liang, P. Compositional semantic parsing on semi-structured tables. In Proc. ACL (2015).
Ahmed, S., Jawade, B., Pandey, S., Setlur, S. & Govindaraju, V. RealCQA: Scientific chart question answering as a test-bed for first-order logic. In: ICDAR (Eds. Fink, G. A., Jain, R., Kise, K. & Zanibbi, R.). 66–83 (Springer, 2023).
Kembhavi, A. et al. A diagram is worth a dozen images. In: ECCV (Eds. Leibe, B., Matas, J., Sebe, N. & Welling, M.). 235–251 (Springer, 2016).
Shin, A., Ushiku, Y. & Harada, T. The color of the cat is gray: 1 million full-sentences visual question answering (FSVQA). Preprint at https://arxiv.org/abs/1609.06657 (2016).
Das, A. et al. Visual Dialog. In: CVPR 326–335 (2017).
Zhang, Y. et al. LLaVAR: enhanced visual instruction tuning for text-rich image understanding. Preprint at https://arxiv.org/abs/2306.17107 (2023).
Carter, J. TextOCR‑GPT4V. In Hugging Face dataset repository (2024).
Zhao, B., Wu, B. & Huang, T. SVIT: scaling up visual instruction tuning. Preprint at https://arxiv.org/abs/2307.04087 (2023).
Yu, T. et al. Reformulating vision-language foundation models and datasets towards universal multimodal assistants. Preprint at https://arxiv.org/abs/2310.00653 (2023).
Chen, L. et al. ShareGPT4V: Improving large multi-modal models with better captions. In Proc. ECCV (2024).
Gupta, A., Dollar, P.& Girshick, R. LVIS: a dataset for large vocabulary instance segmentation. In: CVPR 5356–5364 (2019).
Chen, G. H. et al. ALLaVA: harnessing GPT4V-synthesized data for a lite vision-language model. Preprint at https://arxiv.org/abs/2402.11684 (2024).
Ding, N. et al. Enhancing chat language models by scaling high-quality instructional conversations. In Proc. EMNLP (2023).
Taori, R. et al. Stanford Alpaca: An instruction‑following LLaMA model. In GitHub repository (2023).
Zheng, L. et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. NeurIPS 36 (2024).
BELLEGroup. BELLE: Be Everyone’s Large Language model Engine. In GitHub repository (2023).
Lian, W. et al. OpenOrca: an open dataset of GPT‑augmented FLAN reasoning traces. In Hugging Face dataset repository (2023).
Teknium. OpenHermes 2.5: An open dataset of synthetic data for generalist LLM assistants. In Hugging Face dataset repository (2023).

Download references

Acknowledgements

Z.L. and M.S. are supported by the Research Project 2025QGS16007. Y.Y. is supported by the Shanghai Qi Zhi Institute Innovation Program SQZ202410.

Author information

Authors and Affiliations

Tsinghua University, Beijing, China
Yuan Yao, Tianyu Yu, Chi Chen, Haoyu Li, Weilin Zhao, Zhihui He, Haoye Zhang, Shengding Hu, Xu Han, Zhiyuan Liu & Maosong Sun
Shanghai Qi Zhi Institute, Shanghai, China
Yuan Yao
National University of Singapore, Singapore, Singapore
Yuan Yao & Ao Zhang
ModelBest Inc., Beijing, China
Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Ronghua Zhou, Zhensheng Zou, Zhi Zheng, Jie Zhou, Jie Cai, Guoyang Zeng & Dahai Li
The Chinese University of Hong Kong, Hong Kong, China
Qianyu Chen

Authors

Yuan Yao
View author publications
Search author on:PubMed Google Scholar
Tianyu Yu
View author publications
Search author on:PubMed Google Scholar
Ao Zhang
View author publications
Search author on:PubMed Google Scholar
Chongyi Wang
View author publications
Search author on:PubMed Google Scholar
Junbo Cui
View author publications
Search author on:PubMed Google Scholar
Hongji Zhu
View author publications
Search author on:PubMed Google Scholar
Tianchi Cai
View author publications
Search author on:PubMed Google Scholar
Chi Chen
View author publications
Search author on:PubMed Google Scholar
Haoyu Li
View author publications
Search author on:PubMed Google Scholar
Weilin Zhao
View author publications
Search author on:PubMed Google Scholar
Zhihui He
View author publications
Search author on:PubMed Google Scholar
Qianyu Chen
View author publications
Search author on:PubMed Google Scholar
Ronghua Zhou
View author publications
Search author on:PubMed Google Scholar
Zhensheng Zou
View author publications
Search author on:PubMed Google Scholar
Haoye Zhang
View author publications
Search author on:PubMed Google Scholar
Shengding Hu
View author publications
Search author on:PubMed Google Scholar
Zhi Zheng
View author publications
Search author on:PubMed Google Scholar
Jie Zhou
View author publications
Search author on:PubMed Google Scholar
Jie Cai
View author publications
Search author on:PubMed Google Scholar
Xu Han
View author publications
Search author on:PubMed Google Scholar
Guoyang Zeng
View author publications
Search author on:PubMed Google Scholar
Dahai Li
View author publications
Search author on:PubMed Google Scholar
Zhiyuan Liu
View author publications
Search author on:PubMed Google Scholar
Maosong Sun
View author publications
Search author on:PubMed Google Scholar

Contributions

Yuan Yao initiated and led the research project, and designed the models and experiments. Tianyu Yu, Chongyi Wang, Junbo Cui, Hongji Zhu, Haoyu Li, Zhihui He, Haoye Zhang, Zhi Zheng, Jie Zhou and Jie Cai contributed to the experiments. Chi Chen, Ao Zhang and Yuan Yao wrote the paper. Tianchi Cai, Weilin Zhao, Qianyu Chen, Ronghua Zhou and Zhensheng Zou contributed to the open-source work. Shengding Hu, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu and Maosong Sun provided valuable suggestions and proofread the paper.

Corresponding authors

Correspondence to Zhiyuan Liu or Maosong Sun.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

: Nature Communications thanks the anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Transparent Peer Review file

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yao, Y., Yu, T., Zhang, A. et al. Efficient GPT-4V level multimodal large language model for deployment on edge devices. Nat Commun 16, 5509 (2025). https://doi.org/10.1038/s41467-025-61040-5

Download citation

Received: 14 January 2025
Accepted: 11 June 2025
Published: 01 July 2025
Version of record: 01 July 2025
DOI: https://doi.org/10.1038/s41467-025-61040-5

This article is cited by

Phase transitions in large language model compression
- Ziyang Ma
- Zuchao Li
- Dacheng Tao
npj Artificial Intelligence (2026)
Densing law of LLMs
- Chaojun Xiao
- Jie Cai
- Maosong Sun
Nature Machine Intelligence (2025)
Multimodal context-aware translation of the endangered dongba script
- Shuo Li
- Xiaojun Bi
- Yang Liu
npj Heritage Science (2025)
Evaluating large language models on multimodal chemistry olympiad exams
- Yiming Cui
- Xin Yao
- Guoping Hu
Communications Chemistry (2025)

Subjects

Abstract

Similar content being viewed by others

Introduction

MiniCPM-V Series Techniques

Results

Overview of MiniCPM-V

Evaluation across diverse multimodal understanding benchmarks

Efficient Deployment of MiniCPM-V on Edge Devices

Discussion

Methods

Adaptive visual encoding

Image partition

Slice encoding

Token compression

Spatial schema

Pre-training

Stage-1

Stage-2

Stage-3

Caption rewriting

Data packing

Multilingual generalization

Supervised fine-tuning

Trainable Modules

Data

Alignment

Response generation

Feedback collection

Direct preference optimization

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links