Abstract
With three millennia of evolution history, ancient Chinese characters present unique challenges for intelligent recognition systems due to severe degradation and stylistic variations, causing notable intra-class variation and reduced inter-class separability. To address this, we propose a novel framework with two key components: a local content transformation (LCT) module enhancing salient regions through learnable spatial attention weights, and a local correlation reasoning (LCR) module employing graph convolutional networks (GCNs) to model patch-wise spatial-semantic dependencies via Transformer-derived attention maps. Experiments on the MACT benchmark show it achieves state-of-the-art performance with a 2.02% top-1 accuracy improvement compared to existing methods. The proposed architecture exhibits particular robustness in handling degraded samples and distinguishing visually similar character classes, showing substantial potential for applications in archeological documentation, digital paleography, and intelligent cultural heritage preservation systems. Code is available at https://github.com/wtq123-git/AttGraph.
Similar content being viewed by others
Introduction
As a foundational medium of human civilization, text has long been a critical epistemic vessel, enabling the codification and cross-generational transmission of cultural knowledge. With advances in image processing and computer vision, intelligent text recognition1,2,3 has become a key tool in domains like literary analysis, digitization, and assistive reading, and is extensively deployed in fields such as image retrieval, autonomous driving and accessible human-computer interaction4,5. Within this field, scene text recognition6,7 focuses on detecting and interpreting text from complex, cluttered visual environments, while handwritten text recognition8,9 addresses challenges from high variability and deformation caused by individualized writing styles.
As shown in Fig. 1, ancient text recognition presents more challenges, which not only includes material degradation (e.g., stains, fading, damage), but also exhibits significant paleographic variations across historical periods—from oracle bone script to cursive styles—leading to notable intra-class variability and inter-class similarity (Fig. 2). Recent text recognition has expanded to natural scene text10, handwritten text11,12 and artistic calligraphy13,14,15. Traditional unsupervised methods work for specific tasks (e.g., document analysis, digitization) but lack performance and robustness due to their inability to adaptively extract features under diverse visual conditions. While approaches like ensemble model MME16 and domain-adaptive CA-CF17 have been explored, challenges persist in handling ancient Chinese complexities, especially wide intra-class variations and subtle inter-class differences from historical writing and multi-scene distribution. Though recent methods18,19,20,21 help by letting networks localize key regions and capture fine-grained features to distinguish subtle inter-class differences, they neglect inter-region correlations—critical for robust understanding of characters with shared components (e.g., 叨/召/另, 旮/旯, 枷/架, 景/晾, 愀/愁, and 愉/愈.
This figure displays the various chirography styles of ancient Chinese characters and their corresponding distribution carriers, including seal script, clerical script, running script, and cursive script found on bamboo slips, stone carvings, plaques, bronze artifacts, ancient coins, and calligraphy and painting works.
With the advent of deep learning, transformers have revolutionized natural language processing and computer vision22, addressing limitations of traditional convolutional neural networks (CNNs)23 via dynamic self-attention24,25. Vision transformer (ViT)26 performs well in image tasks, and Swin transformer27 boosts efficiency with hierarchical features. However, their large parameter demands hinder deployment in resource-constrained scenarios—a gap filled by TinyViT28 through lightweight design and knowledge distillation.
Additionally, graph convolutional networks (GCNs)29 (e.g., GazeGCN30, GKA31, RGEN32) are powerful for processing graph-structured data and enhancing relation feature representation. Despite their ability to model contextual and structural relationships, applying GCNs to ancient Chinese characters remains challenging due to insufficient modeling of complex stroke relationships, poor handling of component spatial configurations, and limited adaptation to paleographic variations.
To address these challenges, we propose a novel framework that synergistically integrates a local content transformation (LCT) module for enhanced local region representation and a local correlation reasoning (LCR) module for comprehensive cross-patch relationship modeling. Built upon the lightweight TinyViT backbone, the contributions can be summarized as follows:
-
1.
The LCT module introduces an attention-guided mechanism that detects and enhances discriminative local features, significantly improving the recognition of inter-class visual similar characters.
-
2.
The LCR module innovatively integrates GCNs with Transformer attention weights to learn and capture structural relationships among character components.
-
3.
Extensive experiments demonstrate that the proposed method outperforms previous approaches, achieving state-of-the-art (SOTA) performance with a 2.02% improvement in top-1 accuracy while showing particular robustness in handling degraded samples and stylistically similar characters.
Methods
Inspired by the cognitive process of character learning, which includes the process of focusing the holistic observation, component analysis and structural relation building, AttGraph is proposed for homoglyphic inter-class recognition of ancient Chinese characters. As illustrated in Fig. 3, the architecture comprises three key components: a TinyViT backbone for global structure feature extraction, a LCT module for salient region enhancement, which is designed to adaptively detect and amplify salient regions through learnable spatial attention weights, and a LCR module for spatial-semantic relationship modeling, which employs GCN to model patch-wise spatial semantic dependencies and transformer attention maps to establish contextual correlation graphs.
TinyVit backbone network
The TinyViT backbone employs a four-stage hierarchical architecture design with progressively reduced feature resolution. The initial stage utilizes a convolutional module incorporating an MBConv for low-level feature extraction combined with downsampling to enhance representation capability. The following three stages consist of different numbers of Transformer blocks, forming the BasicLayer. To optimize computational efficiency, the architecture implements shifted window-based local self-attention mechanisms augmented with residual connections, maintaining performance while reducing resource requirements. Higher-order feature transformations are facilitated through MLP module incorporating layer normalization and nonlinear activation. The final Transformer module outputs are processed by a classification head to generate prediction vectors, where the global structure loss is conducted using cross-entropy loss, as shown in Eq. (1):
where C denotes the number of classes, yi represents ground truth labels, and zi corresponds to the raw logits from the final Transformer module in TinyViT.
Local content transformation module
The LCT module employs attention mechanisms to identify and enhance discriminative regions within character images. Building upon the feature representations from the last two Transformer blocks containing H attention heads, the module computes attention weights through a scaled dot-product operation, as formulated in Eqs. (2) and (3).
where \(Q\in {{\mathbb{R}}}^{N\times D}\) and \(K\in {{\mathbb{R}}}^{N\times D}\) represent the query and key vectors, respectively, with D denoting the feature dimension. Here, dk = D/H normalizes the dot product results, preventing gradient vanishing issues, while h ∈ [1, H] indexes the attention heads. The concatenation operation ⊕ combines multi-head attention matrices, yielding a normalized attention matrix \({\alpha }_{i}\in {{\mathbb{R}}}^{N\times D}\) for each transformer block i. These matrices capture the contextual relationships between all N image patches, effectively quantifying each patch’s contribution to the final classification decision.
The attention weights from both blocks are aggregated through stacking and averaging to produce a composite attention weight map A. The average pixel value \(\overline{b}\) of the map is defined in Eq. (4):
where A(x, y) represents the attention value at spatial location (x, y), with h and w denoting the map dimensions. Regions with higher activation values correspond to semantically significant character components that most influence the model’s decisions.
To extract these discriminative regions, the module implements an adaptive thresholding mechanism, as denoted in Eq. (5).
where ε is an experimentally optimized scaling factor. The resulting binary mask isolates contiguous high-response regions, which are then cropped from the original image to generate focused feature areas. This process preserves the structural integrity of critical character components while eliminating irrelevant background information, significantly enhancing the model’s ability to discern subtle inter-class differences in ancient Chinese characters.
Local correlation reasoning module
The LCR module employs a sophisticated graph-based approach to model structural relationships between character components. Considering that Chinese characters typically feature a centro-radial structure with radicals/strokes arranged around core components, the directional and relative positional relationships are directly parameterized on polar coordinates. The structural information is extracted using a relative coordinate predictor (RCP) based on the attention weight map A from the LCT module. The RCP establishes a polar coordinate system centered at the reference point P0 = (x0, y0), corresponding to the patch with maximum average feature response. For each remaining patch, its polar coordinate ρxy = (ρxy, θxy) is computed as follows in Eqs. (6) and (7).
where ρxy represents the normalized radial distance 0 ≤ ρxy ≤ and θxy denotes the normalized azimuthal angle 0 ≤ θxy ≤. This polar-based representation improves over traditional Cartesian coordinates by providing a rotation-invariant and fine-grained description of relative spatial relationships.
The polar coordinate features are integrated with attention-based positional weights to construct a relational graph structure, where nodes correspond to image patches and edges model spatial-semantic dependencies. This graph representation is processed through a GCN that simultaneously learns from both the topological connectivity patterns and localized node features. The resulting GCN embeddings are concatenated with the final feature representations from the TinyViT backbone, forming a unified feature space that effectively captures fine-grained structural relationships between character components while preserving global contextual understanding.
Loss function and fusion
The total global-local consistency loss establishes dual supervision signals, where the global structure loss preserves holistic character morphology while the local discriminative loss enhances fine-grained feature learning. This complementary design ensures balanced optimization of macroscopic and microscopic characteristics, as formulated in Eq. (8):
where \({{y}^{{\prime} }}_{glb}\) and \({{y}^{{\prime} }}_{loc}\) denote the predictions from the global structure and local discriminative branches, respectively. The learnable parameters λ1, λ2 balance the contribution of each branch during training.
The final prediction probability is obtained through adaptive fusion of both branches’ outputs, as formulated in Eq. (9).
here \({P}_{x}={\text{softmax}}({z}_{x})\) denotes the probability distribution from the global and local branches, weighting parameters ω1 and ω2 serve to dynamically balance the contributions of each branch, with their sum constrained to equal 1 to ensure the output remains a valid probability distribution. This synergy strategy ensures robust performance across varying character conditions by leveraging both macroscopic structural patterns and microscopic discriminative features.
Results
Experimental setup
The experimental evaluation was conducted on the MACT dataset containing 2884 categories of ancient Chinese characters, with 106,696 training images and 14,315 test images. The training process incorporated random horizontal flipping and brightness adjustments for data augmentation, using SGD optimization with an initial learning rate of 0.001 that decayed by 0.1 at epochs 15, 30, and 50. The model was trained with weight decay of 5e-4 and batch size 8 on NVIDIA RTX 3090 GPUs using PyTorch, with key parameters including the LCT clipping threshold ε = 0.2 and balanced branch fusion weights ω1 = ω2 = 0.5.
Experimental results and analysis
Comparative analysis with state-of-the-art methods revealed significant improvements achieved by our approach, as shown in Table 1. The proposed AttGraph model attained 78.33% top-1 accuracy when initialized with ImageNet-22K pretrained weights, representing a 2.02% absolute improvement over previous best methods and a 1.26% gain compared to the TinyViT baseline. Notably, AttGraph exhibits lower data dispersion and a more concentrated distribution. These performance advantages are consistent across all evaluation metrics (top-2 to top-5) as well as all distribution scenes and calligraphic styles, as shown in Table 2. The corresponding standard deviations (SD) are also provided in Table 1 and Fig. 4 to verify performance stability, which can be attributed to the effective integration of local feature enhancement and structural relationship modeling in our framework. Experiment results from Table 3 demonstrate that the AttGraph achieves better outperforms on other datasets with distinct visual characteristics, further confirming its robustness and adaptability. While a recent study from HUNet33 reports competitive results, their model was trained on 7 million training samples. When evaluated under equivalent training conditions using only the original MACT dataset, HUNet’s performance drops to 36.13% top-1 accuracy, and due to our inability to contact the authors, the comparative experiments trained on their self-built MTACCR dataset could not be conducted.
Ablation studies demonstrated the complementary contributions of the proposed modules, as shown in Table 4. Comparing with the baseline TinyViT, which achieves 77.07% top-1 accuracy, improvements of 0.42% and 0.75% are achieved with LCR and LCT, respectively, demonstrating that the attention-guided cropping mechanism effectively directs the network’s focus to the most discriminative local regions and enhances the correlation modeling ability among components. Specifically, a 1.26% improvement over the baseline confirms the synergistic benefits of joint LCT and correlation reasoning. As formulated in Eq. (9), the optimal result is achieved when ω1 = ω2 = 0.5, where the influence of two modules is equal, as shown in Fig. 5. This enhancement stems from the method’s ability to simultaneously focus on discriminative local regions while modeling their structural relationships.
Visualization
Visual analysis demonstrates the model’s robust feature learning capabilities. As shown in Fig. 6, which examines the relationship between stroke complexity and recognition performance, the majority of Chinese characters (only the complex characters with more than 10 strokes are shown) are recognized with high confidence scores. Notably, the model maintains strong performance even for more complex characters with higher stroke counts, indicating its effectiveness in handling intricate character structures. The visualization reveals that while recognition difficulty generally increases with stroke number, the system consistently achieves reliable predictions across varying complexity levels.
The synergistic integration of LCT and LCR modules enables the model to capture both discriminative local features and structural relationships, significantly enhancing representational power, as further illustrated through t-SNE visualizations in Fig. 7, where the proposed method shows markedly improved feature space organization compared to the baseline. Quantitative analysis reveals substantially tighter intra-class clustering and greater inter-class separation between cluster centroids. These quantitative improvements in feature space geometry directly correlate with the model’s enhanced classification performance, particularly for visually similar character categories that require fine-grained discrimination.
Character recognition results
The recognition results in Fig. 8 exhibit the model’s ability to handle the fundamental challenges in ancient text analysis. Figure 8a highlights successful recognition across substantial intra-class variations between historical and modern character forms, demonstrating remarkable robustness to stylistic differences spanning centuries of evolution. Conversely, Fig. 8b presents correctly classified examples of visually similar inter-class pairs, underscoring the method’s fine-grained discrimination capabilities.
Figure 9 reveals remaining challenges, with misclassifications primarily occurring in cases of (1) poor text-background contrast and extreme image degradation, such as (R2, C3), where the key part of the text is missing, (R3, C5), where only the right part is detected, (R3, C7), where only the background texture is detected; (2) extreme exaggeration and personal writing stylistic, such as (R1, C3), (R1, C8), (R2, C6), (R2, C7), which is more common in cursive script, and only 59.82% accuracy is achieved; (3) the lack of variant samples in the training set, such as (R1, 2C), where are the variant form of (R1, C4, where are the variant form of, (R1, C5), where are the variant form of. Many variant characters, which are used only in specific periods and not passed down, and traditional characters are absent from the raining set, leading to incorrect recognition.
Discussion
This paper presents AttGraph, an enhanced TinyViT-based framework for homoglyphic ancient Chinese character recognition. The proposed method introduces a LCT module that adaptively enhances discriminative regions through learned attention mechanisms, and a LCR module that explicitly models structural correlation dependencies between character components using GCNs. By synergistically integrating global visual features with localized structural representations, AttGraph effectively bridges the critical gap between fine-grained local analysis and holistic contextual understanding. Experimental results demonstrate the framework’s superior performance, especially on substantial intra-class variations and subtle inter-class differences in structurally similar characters. Future research directions include extending the framework to handle cursive script variations and developing specialized data augmentation techniques for degraded historical materials.
Data availability
The datasets for this study are available at https://drive.google.com/file/d/1giKePEa_trYbNMtetNHZfExAycOfs9-Z/view?usp=sharing.
Code availability
The code for AttGraph model, experimental scripts and declaration are available at https://github.com/wtq123-git/AttGraph.
References
Chen, X., Jin, L., Zhu, Y., Luo, C. & Wang, T. Text recognition in the wild: a survey. ACM Comput. Surv. 54, 42:1–42:35 (2022).
Narang, S. R., Jindal, M. K. & Kumar, M. Ancient text recognition: a review. Artif. Intell. Rev. 53, 5517–5558 (2020).
Zhang, D., Li, M., Tian, D., Song, L. & Shen, Y. Intelligent text recognition based on multi-feature channels network for construction quality control. Adv. Eng. Inform. 53, 101669 (2022).
Atakishiyev, S., Salameh, M., Yao, H. & Goebel, R. Explainable artificial intelligence for autonomous driving: a comprehensive overview and field guide for future research directions. IEEE Access 12, 101603–101625 (2024).
Tang, X. et al. Interacting-enhancing feature transformer for cross-modal remote-sensing image and text retrieval. IEEE Trans. Geosci. Remote. Sens. 61, 1–15 (2023).
Du, Y. et al. Context perception parallel decoder for scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 47, 4668–4683 (2025).
Jiang, Q., Wang, J., Peng, D., Liu, C. & Jin, L. Revisiting scene text recognition: a data perspective. In Proc. IEEE International Conference on Computer Vision 20543–20554 (IEEE, 2023).
Li, Y., Chen, D., Tang, T. & Shen, X. HTR-VT: handwritten text recognition with vision transformer. Pattern Recognit. 158, 110967 (2025).
AlKendi, W., Gechter, F., Heyberger, L. & Guyeux, C. Advancements and challenges in handwritten text recognition: a comprehensive survey. J. Imaging 10, 18 (2024).
Yang, X., Yin, C., Tian, D. & Liang, W. Rule-based perspective rectification for chinese text in natural scene images. Multim. Tools Appl. 80, 18243–18262 (2021).
Martin, M. et al. Data-efficient handwritten text recognition of diplomatic historical text. Multim. Tools Appl. 84, 39107–39122 (2025).
Qiao, W., Han, L. & Bi, X. Dual-branch similarity network with bi-correlative attention for few-shot shui script recognition. npj Herit. Sci. 13, 361 (2025).
Yang, L., Wu, Z., Xu, T., Du, J. & Wu, E. Easy recognition of artistic Chinese calligraphic characters. Vis. Comput. 39, 3755–3766 (2023).
Si, H. Analysis of calligraphy chinese character recognition technology based on deep learning and computer-aided technology. Soft Comput. 28, 721–736 (2024).
Wang, Z., Li, Y. & Li, H. Chinese inscription restoration based on artificial intelligent models. npj Herit. Sci. 13, 326 (2025).
Wang, K., Yi, Y., Liu, J., Lu, L. & Song, Y. Multi-scene ancient Chinese text recognition. Neurocomputing 377, 64–72 (2020).
Wang, K., Yi, Y., Tang, Z. & Peng, J. Multi-scene ancient chinese text recognition with deep coupled alignments. Appl. Soft Comput. 108, 107475 (2021).
Fu, Z., Chen, Q., Wang, M. & Huang, C. Transformer based on multi-scale local feature for colon cancer histopathological image classification. Biomed. Signal Process. Control. 100, 106970 (2025).
Banerjee, A., Shivakumara, P., Bhattacharya, S., Pal, U. & Liu, C. An end-to-end model for multi-view scene text recognition. Pattern Recognit. 149, 110206 (2024).
Chou, P., Lin, C. & Kao, W. A novel plug-in module for fine-grained visual classification. Preprint at https://doi.org/10.48550/arXiv.2202.03822 (2022).
Li, S., Zhou, C. & Wang, K. Wa-net: wavelet integrated attention network for silk and bamboo character recognition. Eng. Appl. Artif. Intell. 140, 109674 (2025).
Vaswani, A. et al. Attention is all you need. In Proc. Advances in Neural Information Processing Systems 5998–6008 (MIT Press, 2017).
Gu, X., Zhang, L., Guan, P., Lin, Q. & Yang, R. Removing ghost markers in trinocular optical tracking using convolutional neural network. Signal Image Video Process. 19, 464 (2025).
Yin, X. et al. Rsaformer: a method of polyp segmentation with region self-attention transformer. Comput. Biol. Med. 172, 108268 (2024).
Guo, M., Liu, Z., Mu, T. & Hu, S. Beyond self-attention: external attention using two linear layers for visual tasks. IEEE Trans. Pattern Anal. Mach. Intell. 45, 5436–5447 (2023).
Dosovitskiy, A. et al. An image is worth 16 × 16 words: transformers for image recognition at scale. In Proc. International Conference on Learning Representations 1–21 (OpenReview.net, 2021).
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proc. International Conference on Computer Vision 9992–10002 (IEEE, 2021).
Wu, K. et al. Tinyvit: Fast pretraining distillation for small vision transformers. In Proc. European Conference on Computer Vision Vol. 13681 68–85 (Springer, 2022).
Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. In Proc. International Conference on Learning Representations (OpenReview.net, 2017).
Wang, B. et al. Gazegcn: Gaze-aware graph convolutional network for text classification. Neurocomputing 611, 128680 (2025).
Wang, Y., Ye, S., Hou, W., Xu, D. & You, X. GKA: graph-guided knowledge association for fine-grained visual categorization. Neurocomputing 634, 129819 (2025).
Guo, J. et al. On the element-wise representation and reasoning in zero-shot image recognition: a systematic survey. Preprint at https://doi.org/10.48550/arXiv.2408.04879 (2024).
Wang, Z., Zhang, C., Lang, Q., Jin, L. & Qi, H. Hunet: hierarchical universal network for multi-type ancient chinese character recognition. npj Herit. Sci. 13, 284 (2025).
Wang, P. et al. An open dataset for oracle bone character recognition and decipherment. Sci. Data 11, 976 (2024).
Acknowledgements
This work was supported by the Scientific Research Plan of Hubei Provincial Department of Education (Grant No. Q20231510), Graduate Innovative Fund of Wuhan Institute of Technology (project number CX2024174), and National Natural Science Foundation of China (Grant No. 52201415).
Author information
Authors and Affiliations
Contributions
Kaili Wang: Writing—original draft, methodology, supervision, funding acquisition; Tianquan Wu: Writing—original draft, methodology, software. Yuanlin Shi: Validation, data curation. Chen Chen: Resources, supervision, project administration, writing review and editing. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Wang, K., Wu, T., Shi, Y. et al. AttGraph disentangling confusable ancient Chinese characters via component-correlation synergy. npj Herit. Sci. 14, 7 (2026). https://doi.org/10.1038/s40494-025-02278-6
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s40494-025-02278-6











