Visual information extraction from documents via classification-guided large vision-language models

Li, Huafu; Chen, Guo; Xia, Jia; Wang, Lei; Du, Wei; Yao, Yun; Peng, Weijun; Li, Liming

doi:10.1038/s41598-026-49319-z

Download PDF

Article
Open access
Published: 02 May 2026

Visual information extraction from documents via classification-guided large vision-language models

Huafu Li¹^na1,
Guo Chen¹,
Jia Xia¹,
Lei Wang¹,
Wei Du¹,
Yun Yao¹,
Weijun Peng¹ &
…
Liming Li²^na1

Scientific Reports volume 16, Article number: 14158 (2026) Cite this article

777 Accesses
Metrics details

Subjects

Abstract

Visual information extraction (VIE) from visually rich documents remains challenging due to high layout variability and real-world impairments. Existing methods typically rely on sequential OCR pipelines or end-to-end models requiring extensive labeled data and layout-specific training, limiting their scalability. We propose a classification-guided large vision-language model (LVLM) framework for multi-type VIE that achieves high accuracy with minimal supervision. The approach decouples document-type classification from content extraction and employs in-context learning (ICL)-based dynamic prompt engineering to inject task-specific knowledge, enabling robust zero-shot inference across diverse layouts. From a theoretical perspective, the proposed method can be viewed as a form of conditional computation that reduces task uncertainty and improves information efficiency during prompt-based inference. Evaluated on a real-world bidding dataset with 16 certificate types, our zero-shot method (based on Qwen2.5-VL-7B) outperforms a strong supervised baseline by 18.35 percentage points in F1-score (86.43% vs. 68.08%) and 0.23 in normalized edit distance (0.90 vs. 0.67). Optional domain-specific fine-tuning further improves performance to 93.65% F1 and 0.93 NED, demonstrating superior robustness against seals, watermarks, and low contrast. The framework offers an efficient, scalable solution for complex document understanding in office automation. Code is available at https://github.com/FairmeHIT/Multi-VIE, and fine-tuned models at https://huggingface.co/fairme/Qwen2.5-VL-7B-SFT.

Benchmark evaluation of video large language models in quality assessment of science popularization videos for dry eye

Article Open access 13 February 2026

Measuring the psychological restorative quality of urban spaces: a vision language model-based method

Article Open access 05 April 2026

A hybrid ConvNeXt–BiLSTM framework for robust scene text recognition

Article Open access 13 May 2026

Introduction

Visually rich document images – such as business licenses, financial statements, contracts, tickets, and invoices – serve as crucial information carriers in diverse application scenarios, containing abundant data essential for business and decision-making processes. Visual information extraction (VIE) aims to identify and extract predefined semantic entities from these images, forming a foundational component of document understanding and office automation systems^1,2,3.

Most existing VIE methods adopt a two-stage pipeline: optical character recognition (OCR) for text detection and recognition, followed by natural language processing or unified information extraction (UIE) for entity extraction^4,5. While numerous solutions target these stages individually, integrating them efficiently remains challenging. Visually rich documents incorporate multimodal elements (e.g., layout, fonts, charts, and colors) and real-world noise (e.g., watermarks, seals, blur, and distortions), complicating extraction^2,6.

Deep learning has driven significant progress, with multimodal models integrating semantic, visual, and layout features achieving strong performance in practical applications^6,7,8. However, these approaches typically require large labeled datasets and task-specific training, with evaluations often limited to single-layout documents and few entity types. In real-world settings, extracting information from multiple heterogeneous document types simultaneously (termed multi-VIE) poses a key challenge.

The emergence of large vision-language models (LVLMs), such as GPT-4V⁹ and the Qwen2-VL series¹⁰, has sparked interest in their application to VIE. Yet, general-purpose LVLMs exhibit limitations in complex document tasks, including non-Latin script recognition, table understanding, fine-grained perception, and robustness to perturbations, often generating hallucinations^9,11,12.

Current frameworks lack robust multi-type real-world evaluation and minimally supervised solutions for diverse layouts. We address these gaps by proposing a classification-guided LVLM framework that achieves high-accuracy structured extraction with minimal task-specific training.

The main contributions of this work are summarized as follows:

We propose a classification-guided conditioning paradigm for LVLM-based multi-VIE, which formulates document understanding as a modular decomposition of task uncertainty into document-type prediction and conditional generation, providing a principled alternative to monolithic prompting.
We introduce a dynamic prompt construction strategy based on predicted class, which serves as an implicit conditioning mechanism that improves task relevance and reduces context redundancy without modifying LVLM parameters.
We provide a theoretical analysis that interprets prompt construction as conditional computation, showing how relevance-aware prompting improves information efficiency, mitigates attention dilution, and enhances in-context learning alignment.
Extensive experiments on real-world and public benchmarks demonstrate the effectiveness, robustness, and generalizability of the proposed framework under both zero-shot and fine-tuned settings.

Related work

Traditional VIE methods for scene-specific documents often rely on rule-based strategies, leveraging fixed layouts for dictionary lookup, regular expressions, or pattern matching^2,13. While effective in constrained settings, these approaches require substantial manual adaptation across diverse scenarios, limiting scalability and robustness.

Early deep learning methods advanced VIE by incorporating multimodal features. Grid-based representations, such as Chargrid¹⁴, BERTgrid¹⁵, and ViBERTgrid¹⁶, fuse text and layout in 2D grids but struggle with complex semantic relationships. Graph neural networks (GNNs), including GraphIE¹⁷, PICK¹, MatchVIE¹⁸, and GraphDoc¹⁹, model inter-component dependencies effectively yet suffer from over-smoothing and training instabilities.

Transformer-based architectures have emerged as dominant, dynamically fusing semantic, visual, and layout features. Representative works include LayoutLM²⁰, StrucTexT²¹, UDoc²², LayoutXLM²³, and LiLT²⁴, offering strong generalization at the cost of high computation. End-to-end models like EATEN²⁵, TRIE²⁶, and Donut²⁷ integrate detection, recognition, and extraction to minimize error propagation, though they typically demand extensive labeled data. Few-shot methods^28,29 and unified information extraction frameworks⁵ address data scarcity, enabling efficient adaptation across tasks and domains.

The rise of large vision-language models (LVLMs) has opened new possibilities for VIE through techniques like chain-of-thought³⁰ and in-context learning (ICL)³¹. These approaches enhance multimodal reasoning via carefully designed prompts and demonstrations. However, general-purpose LVLMs exhibit limitations in complex document tasks, including OCR for non-Latin scripts, table understanding, and robustness to real-world perturbations⁹. Recent studies have explored advanced prompt learning strategies to improve robustness and generalization. For instance, prompt-based meta-learning methods have been proposed to mitigate label noise and enhance adaptability under distribution shifts³². Other works investigate automated prompt optimization or task-specific prompt tuning to better align model outputs with desired objectives. Despite these advances, most existing approaches focus on optimizing prompts in a static or task-agnostic manner, without explicitly considering input-dependent task variation. In complex document understanding scenarios, where multiple document types coexist, a single prompt or globally optimized prompt often fails to capture type-specific extraction requirements.

Recent specialized document LVLMs, such as DocLLM³³, TextMonkey³⁴, and the Monkey series^35,36, incorporate layout-aware designs and OCR-free processing, achieving strong performance on structured documents. Despite these advances, they often rely on task-specific fine-tuning or prompts and struggle with zero-shot adaptation to highly variable multi-type documents without dynamic, classification-guided knowledge injection – the gap addressed by our work.

Methodology

To tackle the multi-VIE challenge, we propose a classification-guided LVLM framework that achieves high accuracy and generalization with minimal task-specific training. This serves as our primary contribution. For comparison, we also present an enhanced OCR&UIE pipeline that requires more supervised training but provides a strong trainable baseline. We describe the main LVLM-based framework first, followed by the alternative OCR&UIE approach.

Classification-guided LVLM framework for multi-VIE

Our core framework leverages the pre-trained multimodal reasoning capabilities of LVLMs to enable efficient, end-to-end multi-VIE without task-specific model training. As illustrated in Fig. 1, the pipeline consists of three key components: (1) an image classification module to identify the document type, (2) an ICL-based prompt engineering module to construct task-specific prompts, and (3) an LVLM inference and post-processing module to generate structured predictions.

Classification

The classification module serves two critical roles: rapidly filtering out non-target images and providing accurate document-type labels to guide downstream prompt construction, thereby reducing computational overhead and improving relevance.

We offer two practical implementation options:

Option 1 (training-free) classifies images via feature similarity matching. Features are extracted from the input image using a pre-trained model, then matched against a reference feature index with cosine similarity, thresholding, and majority voting. The procedure is as follows:

Feature Extraction: $\textbf{x}=M(I)$. A pre-trained model M (such as ResNet) is utilized to extract the feature vector $\textbf{x}$ from the input image I.
Feature Vector Search: $D, F = \mathrm {Index.search}(\textbf{x}, K)$. A feature vector engine (e.g., Faiss) is used to identify the K feature vectors most similar to $\textbf{x}$. The resulting distance matrix $D \in \mathbb {R}^{1 \times K}$ and index matrix $F \in \mathbb {R}^{1 \times K}$ are then obtained.
Similarity Calculation: ${R_i} = \frac{\textbf{x} \cdot \textbf{y}_i}{|\textbf{x}|\left| \textbf{y}_i\right| }$, $\textbf{y}_i=\mathrm {Index.reconstruct}(I_i)$. The cosine similarity between the input feature vector $\textbf{x}$ and the reconstructed feature vector $\textbf{y}_i$ is calculated for each search result i.
Sorting: $O = \textrm{sort}({R})$. The similarity results are sorted in descending order, yielding the K most similar image labels in the feature library, which are subsequently used to determine the final image category.
Thresholding and Majority Voting: A similarity threshold $\tau$ is applied to filter the sorted similarity scores O, retaining only those where $R_i \ge \tau$. From the retained results, the corresponding image labels are collected, and a majority voting scheme is employed: $C = \textrm{mode}({L_j \mid R_j \ge \tau , j \in {1, \dots , K}})$, where $L_j$ is the label associated with the j-th reference vector, and $\textrm{mode}$ selects the most frequent label.

Option 2 trains a supervised image classifier (e.g., EfficientNet³⁷ or ConvNeXt³⁸) on labeled samples, achieving higher accuracy at the cost of additional training.

Document-type identification is performed by a lightweight, training-free keyword-based classifier applied to OCR-extracted text:

$$\begin{aligned} \hat{d} = \arg \max _{d_i \in \mathscr {D}} \sum _{j=1}^{|\mathscr {T}|} \sum _{l=1}^{|\mathscr {K}(d_i)|} I(w_j, k_{il}) \end{aligned}$$

(1)

where $\hat{d}$ is the predicted type, $\mathscr {T}$ is the set of recognized text tokens, $\mathscr {K}(d_i)$ is the predefined keyword set for type $d_i$, and $I(\cdot )$ is the indicator function. This simple yet effective mechanism matches distinctive terms (e.g., “risk assessment”vs.“emergency response”for similar certificate variants) and is easily extensible to new types by adding keyword sets.

ICL-based prompt engineering

Once the document type is determined, we dynamically assemble a concise, task-specific prompt by combining a shared instruction block with type-specific components, as shown in Fig. 2. The shared block defines the extraction task, provides general document background and purpose cues, describes common layout patterns and spatial hints (e.g., top/main/bottom areas), and enforces rigorous output rules: pure JSON format only, standardized date fields (“year-month”), extraction of specified landmarks exclusively, and“unidentified” for missing/unreadable values.

The predicted document type then injects the corresponding predefined entity list (landmarks) and appends 2–4 carefully selected in-context demonstrations. These demonstrations illustrate correct entity mapping, formatting adherence, and handling of missing information.

This classification-guided dynamic injection delivers highly relevant, focused context to the frozen LVLM, mitigating hallucinations caused by irrelevant or overloaded instructions while ensuring consistent, reliable extraction across diverse document layouts.

LVLM inference and post-processing

The constructed prompt and input image are fed to the LVLM, which generates structured predictions. Post-processing standardizes outputs (e.g., date formats, punctuation, capitalization) and ensures consistency for downstream applications such as database storage, knowledge graph construction, or intelligent retrieval.

This design enables robust adaptation to diverse document types, complex layouts, and real-world noise while maintaining high efficiency.

Domain-specific enhancement via supervised fine-tuning

For scenarios demanding maximum accuracy, we provide an optional enhancement through supervised fine-tuning of the base LVLM using low-rank adaptation (LoRA). Training incorporates multi-granular annotations: entity-level labels for fixed and variable fields, full-document OCR transcripts, and detailed image descriptions. This targeted adaptation significantly improves precision and further reduces hallucinations on the target domain while retaining much of the original model’s generalization capability.

Theoretical perspective on classification-guided prompting

In this section, we provide a theoretical perspective on why classification-guided prompt construction improves LVLM-based visual information extraction.

Problem Formulation. Let I denote an input document image, $d \in \mathscr {D}$ its latent document type, and Y the structured output. A LVLM performs conditional generation:

$$\begin{aligned} P(Y \mid I, P), \end{aligned}$$

(2)

where P denotes the input prompt.

In conventional prompting, a universal prompt $P_{\text {all}}$ encodes instructions for all document types:

$$\begin{aligned} P_{\text {all}} = \bigcup _{d \in \mathscr {D}} P(d). \end{aligned}$$

(3)

In contrast, our method constructs a conditional prompt:

$$\begin{aligned} P = P(\hat{d}), \quad \hat{d} = g(I), \end{aligned}$$

(4)

where $g(\cdot )$ is the document classifier.

View as Conditional Computation. This formulation can be interpreted as a two-stage factorization:

$$\begin{aligned} P(Y \mid I) = \sum _{d \in \mathscr {D}} P(Y \mid I, P(d)) P(d \mid I). \end{aligned}$$

(5)

Our approach approximates this marginalization via a hard routing mechanism:

$$\begin{aligned} P(Y \mid I) \approx P(Y \mid I, P(\hat{d})), \end{aligned}$$

(6)

which is analogous to mixture-of-experts with a deterministic gating function.

This reduces the hypothesis space of the LVLM from all possible tasks to a task-specific subspace, improving sample efficiency in zero-shot settings.

Information-Theoretic Analysis. We analyze prompt effectiveness from an information perspective. Let Z denote the token sequence of the prompt. The mutual information between prompt and output is:

$$\begin{aligned} I(Y; Z \mid I). \end{aligned}$$

(7)

A universal prompt $P_{\text {all}}$ contains both relevant and irrelevant information:

$$\begin{aligned} Z = Z_{\text {rel}} \cup Z_{\text {irr}}. \end{aligned}$$

(8)

Assuming irrelevant tokens are independent of the target output:

$$\begin{aligned} I(Y; Z_{\text {irr}} \mid I) \approx 0, \end{aligned}$$

(9)

but they still contribute to the entropy:

$$\begin{aligned} H(Z) = H(Z_{\text {rel}}) + H(Z_{\text {irr}}). \end{aligned}$$

(10)

Thus, the signal-to-noise ratio of the prompt can be defined as:

$$\begin{aligned} \text {SNR} = \frac{I(Y; Z_{\text {rel}} \mid I)}{H(Z)}. \end{aligned}$$

(11)

Classification-guided prompting removes $Z_{\text {irr}}$, yielding:

$$\begin{aligned} \text {SNR}_{\text {guided}} \gg \text {SNR}_{\text {all}}, \end{aligned}$$

(12)

which leads to more efficient conditioning.

Attention Dilution Effect. Transformer-based LVLMs rely on attention mechanisms:

$$\begin{aligned} \text {Attn}(Q, K, V) = \text {softmax}\left( \frac{QK^T}{\sqrt{d}}\right) V. \end{aligned}$$

(13)

Let the prompt tokens be $\{z_1, \dots , z_n\}$. In long universal prompts, attention weights are distributed across both relevant and irrelevant tokens:

$$\begin{aligned} \sum _{i=1}^{n} \alpha _i = 1, \quad \alpha _i = \text {softmax}(q \cdot k_i). \end{aligned}$$

(14)

When n increases due to irrelevant tokens, the expected attention mass assigned to relevant tokens decreases:

$$\begin{aligned} \mathbb {E}\left[ \sum _{i \in \text {rel}} \alpha _i\right] \downarrow . \end{aligned}$$

(15)

We refer to this phenomenon as attention dilution, where useful signals are weakened by the presence of unrelated instructions.

Error Propagation Trade-off. The proposed method introduces a dependency on classification accuracy. Let $\epsilon = P(\hat{d} \ne d)$ denote classification error. Then:

$$\begin{aligned} P(Y \mid I) = (1 - \epsilon ) P(Y \mid I, P(d)) + \epsilon P(Y \mid I, P(\hat{d} \ne d)). \end{aligned}$$

(16)

This reveals a trade-off:

High classification accuracy leads to strong gains via focused prompting,
Misclassification introduces structured errors due to incorrect prompts.

Empirically, Table 2 shows $\epsilon$ is small ($<2\%$), making the benefits dominant.

Connection to In-Context Learning. From an ICL perspective, prompt construction defines a task-specific context distribution:

$$\begin{aligned} P_{\text {context}} = P(Z \mid d). \end{aligned}$$

(17)

By conditioning on d, we align the context distribution with the input distribution:

$$\begin{aligned} P(Z \mid d) \approx P(Z \mid I), \end{aligned}$$

(18)

which reduces distribution mismatch and improves generalization.

Overall, classification-guided prompting improves LVLM performance by reducing the hypothesis space through conditional routing, increasing the signal-to-noise ratio of prompts, mitigating attention dilution in long prompts, and better aligning the context distribution for in-context learning. These factors together provide a theoretical foundation for the empirical gains observed in our experiments.

Alternative OCR&UIE pipeline as baseline

To establish a strong trainable baseline, we develop an enhanced two-stage pipeline that integrates mature OCR with unified information extraction (UIE), as depicted in Fig. 3.

The UIE module replaces traditional semantic entity recognition, requiring only a small number of annotated samples (e.g., 10 per type) for robust performance. Combined with mature pre-trained OCR models, this pipeline offers a practical solution for small-sample multi-VIE scenarios.

Experiment

This study evaluates the effectiveness of the proposed method using a real-world electronic bidding and tendering platform. The system handles extensive document-based information exchange, where a key challenge lies in accurately extracting and analyzing information from image-based documents. These documents include, but are not limited to, business licenses, professional qualification certificates, and social security certificates. They serve two critical purposes: (1) as legal proof of compliance and eligibility for bidders, and (2) as fundamental data for tendering entities to perform qualification reviews and risk assessments.

Dataset

Public benchmarks such as SROIE³⁹ and SCID⁴⁰ are limited in document variety, layout complexity, and entity types, and may have been exposed during large vision-language model pretraining. To enable rigorous evaluation under realistic conditions, we construct a new domain-specific dataset collected from a real-world electronic bidding and tendering platform.

The dataset comprises 98,600 images of 16 common certificate types, sourced from production environments across diverse regions, industries, and enterprise scales. These samples exhibit authentic real-world variations, including differences in resolution, watermarks, seal imprints, and photographic distortions. Original documents were provided in PDF, PPT, XLS, DOC, and image formats; embedded images were extracted using PyMuPDF, pdf2image, and Pillow, followed by preprocessing with OpenCV. The data were split into training and test sets in a 7:3 ratio. Due to commercial sensitivity, the original images are not publicly released; however, representative synthetic examples generated via GPT-4V are provided in our open-source repository for illustrative purposes.

Table 1 summarizes the per-category distribution and key characteristics. The dataset maintains reasonable balance across categories while exhibiting substantial variation in extraction complexity: the average number of target entities ranges from 4 to 11 (overall 8.5), and OCR-extracted text length varies significantly (overall 312.81 characters), with particularly dense tabular content in categories like Social Security Certificates.

Table 1 Statistics of the real-world bidding dataset.

Full size table

Model training

For the detection and recognition tasks, we selected the differentiable binarization algorithm, which facilitates efficient post-processing, and the convolutional recurrent neural network algorithm, which integrates convolutional and sequential features. Notably, accuracy was enhanced by fine-tuning a pre-trained model from PaddleOCR⁴¹, which demonstrated optimal accuracy and generalization performance for publicly available Chinese-language datasets. For the UIE task, following the approach outlined in⁵, separate UIE models were trained for each image type to extract the corresponding landmark fields. In the OCR-based classification strategy, a pre-trained ResNet-101 model extracts deep features from pre-processed images, with the final fully connected layer removed to produce feature vectors. Image preprocessing involves resizing to $400 \times 300$ pixels, converting to tensor format, and normalizing using mean $[0.485, 0.456, 0.406]$ and standard deviation $[0.229, 0.224, 0.225]$. For query image classification, cosine similarity is computed against a Faiss-based index, and top-5 nearest neighbors are retrieved with a thresholding mechanism ($\tau = 0.9$) and majority voting to enhance prediction reliability. For the detection and recognition labeling tasks, the PaddleLabel tool⁴² was employed, while the Doccano tool⁴³ was utilized for sequence labeling in the UIE task. The detection, recognition, and UIE models were trained using a single NVIDIA 3090 GPU. For image classification, we utilized the ConvNeXt-B ImageNet-22K 224 configuration⁴⁴ pre-trained model, which was trained on a single NVIDIA 3090 GPU.

To further reduce the inference time of LVLM while improving accuracy, we fine-tuned the Qwen2.5-VL-7B model using LoRA training. We first performed preliminary annotation on the raw data using a large-scale LVLM (Qwen2.5-VL-72B), followed by manual verification and correction of errors. A total of 394,400 pieces of annotated training data were compiled for the experiment. These four types of annotated data not only enhance the model’s understanding of document information from different perspectives but also help mitigate the issue of reduced generalization ability after fine-tuning. The training was conducted using 8 NVIDIA A100 GPUs over a period of 6 days.

Evaluation metrics

In the context of image classification, let $\mathscr {L}$ represent the set of all possible labels. For each label $l \in \mathscr {L}$, we define $N_t^{l}$ as the number of true positives, $N_f^{l}$ as the number of false positives, and $N_n^{l}$ as the number of false negatives. The F1-score for each l is computed as follows:

$$\begin{aligned} \textrm{F1}^{l} = 2 \times \frac{\textrm{Precision}^{l} \times \textrm{Recall}^{l}}{\textrm{Precision}^{l} + \textrm{Recall}^{l}} \end{aligned}$$

(19)

where $\textrm{Precision}^{l} = \frac{N_t^{l}}{N_t^{l}+N_f^{l}}$ and $\textrm{Recall}^{l} = \frac{N_t^{l}}{N_t^{l}+N_n^{l}}$. The weight $w^{l}$ assigned to each label $l$ is given by:

$$\begin{aligned} w^{l} = \frac{N_t^{l}+ N_n^{l}}{\sum _{l \in \mathscr {L}} (N_t^{l}+ N_n^{l})} \end{aligned}$$

(20)

The weighted-average F1-score is then computed as:

$$\begin{aligned} \mathrm {F1_{weighted}} = \sum _{l\in \mathscr {L}}w^{l} \textrm{F1}^{l} \end{aligned}$$

(21)

For the VIE task, predictions are compared with ground truth labels, and both entity-level F1-scores and the average normalized edit distance (NED) are used to evaluate performance^7,9. An entity is considered a true positive if its predicted content exactly matches the corresponding ground truth. The $\textrm{Precision}$ and $\textrm{Recall}$ are defined as:

$$\begin{aligned} \left\{ \begin{aligned}&\textrm{Precision} = \frac{N_t}{N_p} \\&\textrm{Recall} = \frac{N_t}{N_g} \end{aligned} \right. \end{aligned}$$

(22)

where $N_t$ denotes the number of true positive samples, $N_p$ represents the total number of predictions, and $N_g$ is the number of ground truth instances. The entity-level F1-score is computed by combining equations (22) and (19).

Moreover, the average NED-score is given by:

$$\begin{aligned} \begin{aligned} \textrm{NED} = \frac{1}{N} \sum _{k=1}^N \left\{ 1 - \frac{I_k + D_k + M_k}{L_{g,k}} \right\} \end{aligned} \end{aligned}$$

(23)

where $N$ represents the total number of entities, and $I_k$, $D_k$, $M_k$, and $L_{g,k}$ denote the number of insertions, deletions, modifications of the $k$-th entity, and the total number of instances occurring in the ground truth, respectively.

Results and discussion

Classification performance

To evaluate the proposed classification framework, we conducted experiments under four configurations:

Retrieved Feature Matching (RFM) (classification Option 1): A baseline method using a deep residual network to extract image features, followed by cosine-similarity matching against a database. It serves as a visual-only reference.
Feature Matching-Text Fusion (FM-TF) (classification Option 1 with OCR-based classifier): Extends RFM by integrating an OCR module to extract textual information, thereby refining classification using both visual and textual cues.
Trained End-to-End Multi-Classification (TEEMC) (classification Option 2 with multi-classification mode): Employs a lightweight ConvNeXt-B network pre-trained on ImageNet-22K and fine-tuned on our dataset in multi-class mode, enhanced with an OCR-based module.
Trained End-to-End Binary-Classification (TEEBC) (classification Option 2 with binary-classification mode): Similar to TEEMC but uses multiple independent binary classifiers, each dedicated to a specific class.

Table 2 F1-scores for the classification tasks.

Full size table

As shown in Table 2, the RFM method, relying solely on visual features, yields the lowest F1-score. In contrast, FM-TF, which incorporates OCR-extracted text, improves accuracy substantially. This fusion addresses a key challenge in supply chain bid evaluation: distinguishing visually similar certificates (e.g., Telecommunication Network Security Service Capability Certificates) that differ only in sparse keywords (e.g., “risk assessment,” “emergency response”). The end-to-end trained models TEEMC and TEEBC achieve further gains, with TEEBC attaining the highest performance (98.45%). This indicates that dedicated binary classifiers outperform a single multi-class model on our dataset.

In summary, combining end-to-end training with textual feature fusion is crucial for high classification accuracy. While the feature-matching approach (Option 1) offers good scalability without extra labeled data, the training-based approach (Option 2) delivers superior effectiveness. Method selection in practice should therefore align with task-specific demands and constraints.

Multi-VIE performance

We compare our fine-tuned 7B model against two baseline versions of Qwen2.5-VL (7B and 72B) on a general document information extraction task. Results are shown in Table 3.

Table 3 F1-score and NED for multi-VIE.

Full size table

Our fine-tuned 7B model outperforms both the untuned 7B and the larger 72B model. In addition, we also measured inference latency to assess the practical deployability. The Qwen2.5-VL-72B model requires an average of 6.5 seconds per image, while both the base Qwen2.5-VL-7B and our fine-tuned 7B model average 2.6 seconds per image. This approximately 2.5$\times$ speedup of the 7B models, combined with their superior or comparable accuracy on our real-world dataset, demonstrates a favorable trade-off for real-world deployment scenarios where throughput and resource constraints are critical.

Cross-dataset generalization analysis

To evaluate generalizability, we conduct experiments on multiple public benchmarks with diverse layouts and domains (Table 4).

The results show that our method maintains competitive performance across datasets without task-specific adaptation. While slight performance degradation is observed on datasets such as FUNSD and EPHOIE, this is expected due to domain shift in layout structure and annotation schema.

Notably, the performance gap remains relatively small compared to base LVLMs, indicating that classification-guided prompting does not overfit to the source domain. Instead, it preserves the general reasoning capability of LVLMs while improving task alignment. These results demonstrate that the proposed framework generalizes well across datasets and is robust to distribution shifts, which is critical for real-world deployment.

Table 4 Benchmark results (F1/NED) on public datasets.

Full size table

OCR&UIE vs. classification-guided LVLM

Table 5 compares the conventional OCR&UIE pipeline with our proposed classification-guided LVLM approach across 16 certificate types.

Table 5 F1 and NED comparison between OCR&UIE and LVLM-based multi-VIE.

Full size table

The classification-guided LVLM-based method consistently outperforms OCR&UIE, with average F1 improving from 68.08% to 93.65% and average NED from 0.67 to 0.93. The gain is especially pronounced for form-like documents (e.g., Social Security Certificate, +44.37% F1). By combining the results in Table 3 and Table 5, it can be observed that even without any fine-tuning, the LVLM-based method has improved by 18.35% in the average F1-score and by 0.23 in the average NED. Despite the fine-grained fine-tuning employed in the OCR&UIE-based method, its performance remains suboptimal in real-world multi-VIE scenarios. This limitation stems from the inherent dependency of VIE accuracy on the outputs of detection and recognition stages. In practical applications, results are influenced not only by text attributes, such as color, size, font, shape, orientation, and multilingual content, but also by various image distortions, including blurring, low resolution, shadows, brightness variations, watermarks, and seal obstructions. In contrast, the LVLM-based approach benefits from extensive pre-training on large-scale image-text datasets, providing a robust understanding of image elements and enhanced predictive capabilities for low-quality text information.

Ablation study

To validate the effectiveness of key components, we conduct ablation experiments on the test set using the zero-shot Qwen2.5-VL-7B model.

Table 6 Ablation study on core components (zero-shot setting).

Full size table

As shown in Table 6, removing document-type classification causes the most significant performance drop (18.01 percentage points in F1-score). In this ablated setting (“single universal prompt”), we construct one monolithic prompt that includes landmarks, background descriptions, layout hints, and in-context examples for all 16 document types simultaneously (see Fig. 2 for the per-type structure). This results in an extremely long prompt, forcing the LVLM to infer the correct document type and corresponding extraction rules entirely from the input image and the overloaded prompt context. The substantial degradation is likely due to diluted attention: the model struggles to focus on the relevant subset of instructions amid the large volume of irrelevant information for the given document type.

The removal of ICL (“only task definition + format”) also notably harms performance (+6.78 percentage points F1 contribution from ICL). Here, we retain document-type classification and dynamically select the correct landmarks but omit all in-context examples and task-specific knowledge injection (e.g., document background and purpose, layout & spatial hints such as “top area, main information area, bottom area” illustrated in the shared prompt component of Fig. 2). Without these structured demonstrations and prior knowledge, the zero-shot LVLM lacks sufficient guidance to reliably locate and extract entities in visually rich, domain-specific certificates, leading to increased hallucinations and formatting errors.

Post-processing and image rotation preprocessing provide smaller but consistent gains, confirming their value in handling real-world variations (e.g., inconsistent formatting and vertically oriented scans). Finally, the trained classifier (Option 2) slightly outperforms the training-free feature-matching approach (Option 1), indicating that higher classification accuracy directly benefits downstream extraction in the zero-shot pipeline.

Limitations and mitigations

Although the proposed classification-guided LVLM framework demonstrates strong performance and robustness in real-world multi-VIE scenarios, it inherits several limitations common to generative vision-language models.

The primary issue is hallucination, where the model generates plausible but incorrect or incomplete entity values. This problem is particularly pronounced in documents with degraded visual quality (e.g., blur, low contrast, or occlusion), where the model tends to rely on prior knowledge rather than faithfully grounding its predictions in the input.

We observe several characteristic failure modes:

Long-text instability: When extracting fields with long textual content (e.g., addresses or descriptions), the LVLM may produce degenerate outputs, including repetitive token generation. In extreme cases, the model repeatedly outputs a single word or phrase until reaching the token limit, indicating instability in long-sequence decoding.
Visual ambiguity: Documents with highly similar layouts but subtle semantic differences (e.g., certificate variants) may lead to incorrect field assignments, especially when classification confidence is low.
Error propagation from classification: Since prompt construction depends on predicted document type, misclassification can result in systematically incorrect prompts and structured extraction errors.

To mitigate these issues, we adopt several practical strategies. First, automatic rotation preprocessing is applied to align images horizontally, yielding an approximately 3 percentage point improvement in average F1-score. Second, post-processing enforces format normalization and removes obvious repetition artifacts in generated outputs.

For more fundamental improvements, future work could explore constrained decoding strategies (e.g., JSON-schema-guided generation, repetition penalties, or length-aware stopping criteria) to stabilize long-text generation. Additionally, improving classification confidence estimation and incorporating soft or multi-hypothesis prompting may further reduce error propagation.

Conclusion

We propose a classification-guided LVLM framework for real-world multi-VIE tasks in visually rich documents. By decoupling document-type identification from content extraction and introducing dynamic prompt-based knowledge injection, our approach achieves superior efficiency and generalization without task-specific training in its base form. On a challenging real-world bidding dataset comprising 16 diverse certificate types, the zero-shot LVLM configuration substantially outperforms the trainable OCR&UIE baseline, improving average F1-score from 68.08% to 86.43% and normalized edit distance from 0.67 to 0.90. Optional domain-specific fine-tuning further elevates performance to 93.65% F1-score and 0.93 normalized edit distance, demonstrating remarkable robustness against real-world impairments such as seals, watermarks, and low contrast. This work establishes an accessible, scalable paradigm for complex document understanding, offering a practical evolutionary path for intelligent document processing systems in office automation and beyond. More broadly, our findings suggest that input-dependent prompt conditioning can be viewed as an effective approximation to conditional computation in large vision-language models, highlighting a promising direction for improving information efficiency and controllability in prompt-based multimodal inference.

Data availability

The source code is publicly available on GitHub at https://github.com/FairmeHIT/Multi-VIE, and the fine-tuned models can be accessed via Hugging Face at https://huggingface.co/fairme/Qwen2.5-VL-7B-SFT.

References

Yu, W., Lu, N., Qi, X., Gong, P. & Xiao, R. PICK: Processing key information extraction from documents using improved graph learning-convolutional networks. In 2020 25th International Conference on Pattern Recognition, 4363–4370, https://doi.org/10.1109/ICPR48806.2021.9412927 (2021).
Guo, Y. et al. Deep learning for visual understanding: A review. Neurocomputing 187, 27–48. https://doi.org/10.1016/j.neucom.2015.09.116 (2016).
Article Google Scholar
Ren, Y. et al. Tablegpt: A novel table understanding method based on table recognition and large language model collaborative enhancement. Appl. Intell. 55, 311. https://doi.org/10.1007/s10489-024-05937-6 (2025).
Article Google Scholar
Wei, H. et al. General OCR theory: Towards OCR-2.0 via a unified end-to-end model. arXiv e-prints arXiv–2409 (2024).
Lu, Y. et al. Unified structure generation for universal information extraction. arXiv preprint arXiv:2203.12277 (2022).
Kuang, J. et al. Visual information extraction in the wild: practical dataset and end-to-end solution. In International Conference on Document Analysis and Recognition, 36–53, https://doi.org/10.1007/978-3-031-41731-3_3 (2023).
Wang, J. et al. Towards robust visual information extraction in real world: new dataset and novel solution. In Proceedings of the AAAI Conference on Artificial Intelligence Vol. 35, 2738–2745. https://doi.org/10.1609/aaai.v35i4.16378 (2021).
Zhang, J., Wang, H. & Luo, X. Dual-VIE: Dual-level graph attention network for visual information extraction. In Pacific Rim International Conference on Artificial Intelligence, 422–434, https://doi.org/10.1007/978-3-031-20862-1_31 (2022).
Shi, Y. et al. Exploring OCR capabilities of GPT-4V (ision): A quantitative and in-depth evaluation. arXiv preprint arXiv:2310.16809 (2023).
Wang, P. et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024).
He, Z. et al. Seeing is believing? mitigating ocr hallucinations in multimodal large language models. arXiv preprint arXiv:2506.20168 (2025).
Chen, Q., Zhang, X., Guo, L., Chen, F. & Zhang, C. Dianjin-ocr-r1: Enhancing ocr capabilities via a reasoning-and-tool interleaved vision-language model. arXiv preprint arXiv:2508.13238 (2025).
Schuster, D. et al. Intellix–end-user trained information extraction for document archiving. In 2013 12th International Conference on Document Analysis and Recognition, 101–105, https://doi.org/10.1109/ICDAR.2013.28 (2013).
Katti, A. R. et al. Chargrid: Towards understanding 2D documents. arXiv preprint arXiv:1809.08799 (2018).
Denk, T. I. & Reisswig, C. Bertgrid: Contextualized embedding for 2D document representation and understanding. arXiv preprint arXiv:1909.04948 (2019).
Lin, W. et al. Vibertgrid: a jointly trained multi-modal 2D document representation for key information extraction from documents. In Document Analysis and Recognition, 548–563 (2021).
Qian, Y., Santus, E., Jin, Z., Guo, J. & Barzilay, R. Graphie: A graph-based framework for information extraction. arXiv preprint arXiv:1810.13083 (2018).
Tang, G. et al. Matchvie: Exploiting match relevancy between entities for visual information extraction. arXiv preprint arXiv:2106.12940 (2021).
Zhang, Z., Ma, J., Du, J., Wang, L. & Zhang, J. Multimodal pre-training based on graph attention network for document understanding. IEEE Trans. Multimedia 25, 6743–6755. https://doi.org/10.1109/TMM.2022.3214102 (2022).
Article Google Scholar
Huang, Y., Lv, T., Cui, L., Lu, Y. & Wei, F. Layoutlmv3: Pre-training for document AI with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia (2022).
Li, Y. et al. Structext: Structured text understanding with multi-modal transformers. In Proceedings of the 29th ACM International Conference on Multimedia, 1912–1920, https://doi.org/10.1145/3474085.34753 (2021).
Gu, J. et al. Unidoc: Unified pretraining framework for document understanding. Adv. Neural. Inf. Process. Syst. 34, 39–50 (2021).
Google Scholar
Xu, Y. et al. Layoutxlm: Multimodal pre-training for multilingual visually-rich document understanding. arXiv preprint arXiv:2104.08836 (2021).
Wang, J., Jin, L. & Ding, K. Lilt: A simple yet effective language-independent layout transformer for structured document understanding. arXiv preprint arXiv:2202.13669 (2022).
Guo, H. et al. Eaten: Entity-aware attention for single shot visual text extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), 254–259, https://doi.org/10.1109/ICDAR.2019.00049 (2019).
Zhang, P. et al. TRIE: end-to-end text reading and information extraction for document understanding. In Proceedings of the 28th ACM International Conference on Multimedia, 1413–1422, https://doi.org/10.1145/3394171.341390 (2020).
Kim, G. et al. OCR-free document understanding transformer. In European Conference on Computer Vision, 498–517, https://doi.org/10.1007/978-3-031-19815-1_29 (2022).
Cheng, M., Qiu, M., Shi, X., Huang, J. & Lin, W. One-shot text field labeling using attention and belief propagation for structure information extraction. In Proceedings of the 28th ACM International Conference on Multimedia, 340–348, https://doi.org/10.1145/3394171.341351 (2020).
Wang, Z. & Shang, J. Towards few-shot entity recognition in document images: a label-aware sequence-to-sequence framework. arXiv preprint arXiv:2204.05819 (2022).
Chen, F. & Feng, Y. Chain-of-thought prompt distillation for multimodal named entity and multimodal relation extraction. arXiv preprint arXiv:2306.14122 (2023).
Cai, C. et al. In-context learning for few-shot multimodal named entity recognition. In Findings of the Association for Computational Linguistics: EMNLP 2023, 2969–2979, https://doi.org/10.18653/v1/2023.findings-emnlp.196 (2023).
Peng, K. et al. Mitigating label noise using prompt-based hyperbolic meta-learning in open-set domain generalization. Int. J. Comput. Vision https://doi.org/10.1007/s11263-025-02643-9 (2026).
Article Google Scholar
Wang, D. et al. DocLLM: A layout-aware generative language model for multimodal document understanding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Ku, L.-W. et al.), 8529–8548, https://doi.org/10.18653/v1/2024.acl-long.463 (Association for Computational Linguistics, 2024).
Liu, Y. et al. Textmonkey: An ocr-free large multimodal model for understanding document. IEEE Trans. Pattern Anal. Mach. Intell. (2026).
Li, Z. et al. Monkey: Image resolution and text label are important things for large multi-modal models. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 26753–26763, https://doi.org/10.1109/CVPR52733.2024.02527 (2024).
Li, Z. et al. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm. arXiv preprint arXiv:2506.05218 (2025).
Tan, M. & Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, 6105–6114 https://doi.org/10.48550/arXiv.1905.11946 (2019).
Liu, Z. et al. A convnet for the 2020s. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11966–11976, https://doi.org/10.1109/CVPR52688.2022.01167 (2022).
Huang, Z. et al. ICDAR2019 competition on scanned receipt OCR and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), 1516–1520, https://doi.org/10.1109/ICDAR.2019.00244 (2019).
Qiao, L., Li, Z., Cheng, Z. & Li, X. SCID: A Chinese characters invoice-scanned dataset in relevant to key information extraction derived of visually-rich document images. J. Image Graph. 28, 2298–2313, https://doi.org/10.11834/jig.220911 (2023).
Li, C. et al. Pp-OCRv3: More attempts for the improvement of ultra lightweight OCR system. arXiv preprint arXiv:2206.03001 (2022).
Authors, P. Paddlelabel, an effective and flexible tool for data annotation. https://github.com/PaddleCV-SIG/PaddleLabel (2022).
Nakayama, H., Kubo, T., Kamura, J., Taniguchi, Y. & Liang, X. doccano: Text annotation tool for human. Software available from https://github.com/doccano/doccano (2018).
Woo, S. et al. Convnext V2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16133–16142, https://doi.org/10.1109/CVPR52729.2023.01548 (2023).

Download references

Acknowledgements

This work was supported by the Ningbo Natural Science Foundation under Grant 2023J285 and the General Scientific Research Project of Department of Education of Zhejiang Province under Grant Y202352276.

Funding

This work was supported by the Ningbo Natural Science Foundation under Grant 2023J285 and the General Scientific Research Project of Department of Education of Zhejiang Province under Grant Y202352276.

Author information

Huafu Li and Liming Li contributed equally to this work.

Authors and Affiliations

China Mobile Information Technology Co., Ltd., Shenzhen, 518000, China
Huafu Li, Guo Chen, Jia Xia, Lei Wang, Wei Du, Yun Yao & Weijun Peng
School of Information Science and Engineering, NingboTech University, Ningbo, 315100, China
Liming Li

Authors

Huafu Li
View author publications
Search author on:PubMed Google Scholar
Guo Chen
View author publications
Search author on:PubMed Google Scholar
Jia Xia
View author publications
Search author on:PubMed Google Scholar
Lei Wang
View author publications
Search author on:PubMed Google Scholar
Wei Du
View author publications
Search author on:PubMed Google Scholar
Yun Yao
View author publications
Search author on:PubMed Google Scholar
Weijun Peng
View author publications
Search author on:PubMed Google Scholar
Liming Li
View author publications
Search author on:PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection, and analysis were conducted by Huafu Li, Guo Chen, Jia Xia, and Liming Li. Lei Wang was responsible for statistical analysis and validation of data interpretation. Wei Du oversaw experimental methodology development and quality control during data collection. Yun Yao and Weijun Peng participated in manuscript revision, provided critical review of the scientific content, and approved the final version for publication. The first draft of the manuscript was written by Huafu Li, and all authors reviewed and commented on subsequent versions. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Huafu Li or Liming Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Li, H., Chen, G., Xia, J. et al. Visual information extraction from documents via classification-guided large vision-language models. Sci Rep 16, 14158 (2026). https://doi.org/10.1038/s41598-026-49319-z

Download citation

Received: 21 July 2025
Accepted: 14 April 2026
Published: 02 May 2026
Version of record: 04 May 2026
DOI: https://doi.org/10.1038/s41598-026-49319-z

Subjects

Abstract

Similar content being viewed by others

Benchmark evaluation of video large language models in quality assessment of science popularization videos for dry eye

Measuring the psychological restorative quality of urban spaces: a vision language model-based method

A hybrid ConvNeXt–BiLSTM framework for robust scene text recognition

Introduction

Related work

Methodology

Classification-guided LVLM framework for multi-VIE

Classification

ICL-based prompt engineering

LVLM inference and post-processing

Domain-specific enhancement via supervised fine-tuning

Theoretical perspective on classification-guided prompting

Alternative OCR&UIE pipeline as baseline

Experiment

Dataset

Model training

Evaluation metrics

Results and discussion

Classification performance

Multi-VIE performance

Cross-dataset generalization analysis

OCR&UIE vs. classification-guided LVLM

Ablation study

Limitations and mitigations

Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links