AttGraph disentangling confusable ancient Chinese characters via component-correlation synergy

Wang, Kaili; Wu, Tianquan; Shi, Yuanlin; Chen, Chen

doi:10.1038/s40494-025-02278-6

Download PDF

Article
Open access
Published: 05 January 2026

AttGraph disentangling confusable ancient Chinese characters via component-correlation synergy

Kaili Wang¹,
Tianquan Wu¹,
Yuanlin Shi¹ &
…
Chen Chen¹

npj Heritage Science volume 14, Article number: 7 (2026) Cite this article

587 Accesses
Metrics details

Abstract

With three millennia of evolution history, ancient Chinese characters present unique challenges for intelligent recognition systems due to severe degradation and stylistic variations, causing notable intra-class variation and reduced inter-class separability. To address this, we propose a novel framework with two key components: a local content transformation (LCT) module enhancing salient regions through learnable spatial attention weights, and a local correlation reasoning (LCR) module employing graph convolutional networks (GCNs) to model patch-wise spatial-semantic dependencies via Transformer-derived attention maps. Experiments on the MACT benchmark show it achieves state-of-the-art performance with a 2.02% top-1 accuracy improvement compared to existing methods. The proposed architecture exhibits particular robustness in handling degraded samples and distinguishing visually similar character classes, showing substantial potential for applications in archeological documentation, digital paleography, and intelligent cultural heritage preservation systems. Code is available at https://github.com/wtq123-git/AttGraph.

Reconstructing the local structures of Chinese ancient architecture using unsupervised depth estimation

Article Open access 30 August 2024

Semantic-enhanced graph neural network for named entity recognition in ancient Chinese books

Article Open access 30 July 2024

Conceptual model construction for the A Comprehensive Collection of Ancient Chinese Paintings based on metadata and ontology

Article Open access 28 January 2026

Introduction

As a foundational medium of human civilization, text has long been a critical epistemic vessel, enabling the codification and cross-generational transmission of cultural knowledge. With advances in image processing and computer vision, intelligent text recognition^1,2,3 has become a key tool in domains like literary analysis, digitization, and assistive reading, and is extensively deployed in fields such as image retrieval, autonomous driving and accessible human-computer interaction^4,5. Within this field, scene text recognition^6,7 focuses on detecting and interpreting text from complex, cluttered visual environments, while handwritten text recognition^8,9 addresses challenges from high variability and deformation caused by individualized writing styles.

As shown in Fig. 1, ancient text recognition presents more challenges, which not only includes material degradation (e.g., stains, fading, damage), but also exhibits significant paleographic variations across historical periods—from oracle bone script to cursive styles—leading to notable intra-class variability and inter-class similarity (Fig. 2). Recent text recognition has expanded to natural scene text¹⁰, handwritten text^11,12 and artistic calligraphy^13,14,15. Traditional unsupervised methods work for specific tasks (e.g., document analysis, digitization) but lack performance and robustness due to their inability to adaptively extract features under diverse visual conditions. While approaches like ensemble model MME¹⁶ and domain-adaptive CA-CF¹⁷ have been explored, challenges persist in handling ancient Chinese complexities, especially wide intra-class variations and subtle inter-class differences from historical writing and multi-scene distribution. Though recent methods^18,19,20,21 help by letting networks localize key regions and capture fine-grained features to distinguish subtle inter-class differences, they neglect inter-region correlations—critical for robust understanding of characters with shared components (e.g., 叨/召/另, 旮/旯, 枷/架, 景/晾, 愀/愁, and 愉/愈.

**Fig. 1: Multi-scene distribution of ancient Chinese characters.**

**Fig. 2: The inherent characteristics of Chinese characters.**

With the advent of deep learning, transformers have revolutionized natural language processing and computer vision²², addressing limitations of traditional convolutional neural networks (CNNs)²³ via dynamic self-attention^24,25. Vision transformer (ViT)²⁶ performs well in image tasks, and Swin transformer²⁷ boosts efficiency with hierarchical features. However, their large parameter demands hinder deployment in resource-constrained scenarios—a gap filled by TinyViT²⁸ through lightweight design and knowledge distillation.

Additionally, graph convolutional networks (GCNs)²⁹ (e.g., GazeGCN³⁰, GKA³¹, RGEN³²) are powerful for processing graph-structured data and enhancing relation feature representation. Despite their ability to model contextual and structural relationships, applying GCNs to ancient Chinese characters remains challenging due to insufficient modeling of complex stroke relationships, poor handling of component spatial configurations, and limited adaptation to paleographic variations.

To address these challenges, we propose a novel framework that synergistically integrates a local content transformation (LCT) module for enhanced local region representation and a local correlation reasoning (LCR) module for comprehensive cross-patch relationship modeling. Built upon the lightweight TinyViT backbone, the contributions can be summarized as follows:

1.
The LCT module introduces an attention-guided mechanism that detects and enhances discriminative local features, significantly improving the recognition of inter-class visual similar characters.
2.
The LCR module innovatively integrates GCNs with Transformer attention weights to learn and capture structural relationships among character components.
3.
Extensive experiments demonstrate that the proposed method outperforms previous approaches, achieving state-of-the-art (SOTA) performance with a 2.02% improvement in top-1 accuracy while showing particular robustness in handling degraded samples and stylistically similar characters.

Methods

Inspired by the cognitive process of character learning, which includes the process of focusing the holistic observation, component analysis and structural relation building, AttGraph is proposed for homoglyphic inter-class recognition of ancient Chinese characters. As illustrated in Fig. 3, the architecture comprises three key components: a TinyViT backbone for global structure feature extraction, a LCT module for salient region enhancement, which is designed to adaptively detect and amplify salient regions through learnable spatial attention weights, and a LCR module for spatial-semantic relationship modeling, which employs GCN to model patch-wise spatial semantic dependencies and transformer attention maps to establish contextual correlation graphs.

**Fig. 3: AttGraph model architecture, which comprises a TinyViT backbone, a Local Content Transformation (LCT) module, and a Local Correlation Reasoning (LCR) module.**

TinyVit backbone network

The TinyViT backbone employs a four-stage hierarchical architecture design with progressively reduced feature resolution. The initial stage utilizes a convolutional module incorporating an MBConv for low-level feature extraction combined with downsampling to enhance representation capability. The following three stages consist of different numbers of Transformer blocks, forming the BasicLayer. To optimize computational efficiency, the architecture implements shifted window-based local self-attention mechanisms augmented with residual connections, maintaining performance while reducing resource requirements. Higher-order feature transformations are facilitated through MLP module incorporating layer normalization and nonlinear activation. The final Transformer module outputs are processed by a classification head to generate prediction vectors, where the global structure loss is conducted using cross-entropy loss, as shown in Eq. (1):

$${L}_{CE}=-\mathop{\sum }\limits_{i=1}^{C}{y}_{i}\log \left(\frac{{{\rm{e}}}^{{z}_{i}}}{{\sum }_{j=1}^{C}{{\rm{e}}}^{{z}_{i}}}\right)$$

(1)

where C denotes the number of classes, y_i represents ground truth labels, and z_i corresponds to the raw logits from the final Transformer module in TinyViT.

Local content transformation module

The LCT module employs attention mechanisms to identify and enhance discriminative regions within character images. Building upon the feature representations from the last two Transformer blocks containing H attention heads, the module computes attention weights through a scaled dot-product operation, as formulated in Eqs. (2) and (3).

$${\text{Atttn}}_{i}^{(h)}=\frac{{Q}_{i}^{(h)}{({K}_{i}^{(h)})}^{\top }}{\sqrt{{d}_{k}}}$$

(2)

$${\alpha }_{i}={\text{softmax}}\left(\mathop{\mathop{\oplus }\limits^{H}}\limits_{h=1}{{\text{Attn}}}_{i}^{(h)}\right)$$

(3)

where $Q\in {{\mathbb{R}}}^{N\times D}$ and $K\in {{\mathbb{R}}}^{N\times D}$ represent the query and key vectors, respectively, with D denoting the feature dimension. Here, d_k = D/H normalizes the dot product results, preventing gradient vanishing issues, while h ∈ [1, H] indexes the attention heads. The concatenation operation ⊕ combines multi-head attention matrices, yielding a normalized attention matrix ${\alpha }_{i}\in {{\mathbb{R}}}^{N\times D}$ for each transformer block i. These matrices capture the contextual relationships between all N image patches, effectively quantifying each patch’s contribution to the final classification decision.

The attention weights from both blocks are aggregated through stacking and averaging to produce a composite attention weight map A. The average pixel value $\overline{b}$ of the map is defined in Eq. (4):

$$\overline{b}=\frac{1}{h\times w}\mathop{\sum }\limits_{x=1}^{h}\mathop{\sum }\limits_{y=1}^{w}A(x,y)$$

(4)

where A(x, y) represents the attention value at spatial location (x, y), with h and w denoting the map dimensions. Regions with higher activation values correspond to semantically significant character components that most influence the model’s decisions.

To extract these discriminative regions, the module implements an adaptive thresholding mechanism, as denoted in Eq. (5).

$$Mas{k}_{(x,y)}^{loc}=\left\{\begin{array}{l}1,A(x,y) > \varepsilon \bar{b}\\ 0,{\text{otherwise}}\end{array}\right.$$

(5)

where ε is an experimentally optimized scaling factor. The resulting binary mask isolates contiguous high-response regions, which are then cropped from the original image to generate focused feature areas. This process preserves the structural integrity of critical character components while eliminating irrelevant background information, significantly enhancing the model’s ability to discern subtle inter-class differences in ancient Chinese characters.

Local correlation reasoning module

The LCR module employs a sophisticated graph-based approach to model structural relationships between character components. Considering that Chinese characters typically feature a centro-radial structure with radicals/strokes arranged around core components, the directional and relative positional relationships are directly parameterized on polar coordinates. The structural information is extracted using a relative coordinate predictor (RCP) based on the attention weight map A from the LCT module. The RCP establishes a polar coordinate system centered at the reference point P₀ = (x₀, y₀), corresponding to the patch with maximum average feature response. For each remaining patch, its polar coordinate ρ_xy = (ρ_xy, θ_xy) is computed as follows in Eqs. (6) and (7).

$${\rho }_{xy}=\sqrt{{\left(\frac{x-{x}_{0}}{W}\right)}^{2}+{\left(\frac{y-{y}_{0}}{H}\right)}^{2}}$$

(6)

$${\theta }_{xy}=\frac{\arctan 2(y-{y}_{0},x-{x}_{0})+\pi }{2\pi }$$

(7)

where ρ_xy represents the normalized radial distance 0 ≤ ρ_xy≤ and θ_xy denotes the normalized azimuthal angle 0 ≤ θ_xy≤. This polar-based representation improves over traditional Cartesian coordinates by providing a rotation-invariant and fine-grained description of relative spatial relationships.

The polar coordinate features are integrated with attention-based positional weights to construct a relational graph structure, where nodes correspond to image patches and edges model spatial-semantic dependencies. This graph representation is processed through a GCN that simultaneously learns from both the topological connectivity patterns and localized node features. The resulting GCN embeddings are concatenated with the final feature representations from the TinyViT backbone, forming a unified feature space that effectively captures fine-grained structural relationships between character components while preserving global contextual understanding.

Loss function and fusion

The total global-local consistency loss establishes dual supervision signals, where the global structure loss preserves holistic character morphology while the local discriminative loss enhances fine-grained feature learning. This complementary design ensures balanced optimization of macroscopic and microscopic characteristics, as formulated in Eq. (8):

$${L}_{glc}={\lambda }_{1}\cdot {L}_{gs}(y,{{y}^{{\prime} }}_{glb})+{\lambda }_{2}\cdot {L}_{ld}(y,{{y}^{{\prime} }}_{loc})$$

(8)

where ${{y}^{{\prime} }}_{glb}$ and ${{y}^{{\prime} }}_{loc}$ denote the predictions from the global structure and local discriminative branches, respectively. The learnable parameters λ₁, λ₂ balance the contribution of each branch during training.

The final prediction probability is obtained through adaptive fusion of both branches’ outputs, as formulated in Eq. (9).

$$Fus{e}_{prob}={\omega }_{1}\cdot {P}_{glb}+{\omega }_{2}\cdot {P}_{loc}$$

(9)

here ${P}_{x}={\text{softmax}}({z}_{x})$ denotes the probability distribution from the global and local branches, weighting parameters ω₁ and ω₂ serve to dynamically balance the contributions of each branch, with their sum constrained to equal 1 to ensure the output remains a valid probability distribution. This synergy strategy ensures robust performance across varying character conditions by leveraging both macroscopic structural patterns and microscopic discriminative features.

Results

Experimental setup

The experimental evaluation was conducted on the MACT dataset containing 2884 categories of ancient Chinese characters, with 106,696 training images and 14,315 test images. The training process incorporated random horizontal flipping and brightness adjustments for data augmentation, using SGD optimization with an initial learning rate of 0.001 that decayed by 0.1 at epochs 15, 30, and 50. The model was trained with weight decay of 5e-4 and batch size 8 on NVIDIA RTX 3090 GPUs using PyTorch, with key parameters including the LCT clipping threshold ε = 0.2 and balanced branch fusion weights ω₁= ω₂= 0.5.

Experimental results and analysis

Comparative analysis with state-of-the-art methods revealed significant improvements achieved by our approach, as shown in Table 1. The proposed AttGraph model attained 78.33% top-1 accuracy when initialized with ImageNet-22K pretrained weights, representing a 2.02% absolute improvement over previous best methods and a 1.26% gain compared to the TinyViT baseline. Notably, AttGraph exhibits lower data dispersion and a more concentrated distribution. These performance advantages are consistent across all evaluation metrics (top-2 to top-5) as well as all distribution scenes and calligraphic styles, as shown in Table 2. The corresponding standard deviations (SD) are also provided in Table 1 and Fig. 4 to verify performance stability, which can be attributed to the effective integration of local feature enhancement and structural relationship modeling in our framework. Experiment results from Table 3 demonstrate that the AttGraph achieves better outperforms on other datasets with distinct visual characteristics, further confirming its robustness and adaptability. While a recent study from HUNet³³ reports competitive results, their model was trained on 7 million training samples. When evaluated under equivalent training conditions using only the original MACT dataset, HUNet’s performance drops to 36.13% top-1 accuracy, and due to our inability to contact the authors, the comparative experiments trained on their self-built MTACCR dataset could not be conducted.

**Fig. 4: Comparison of accuracy distribution and performance stability between the baseline and the proposed AttGraph.**

Table 1 Comparative experiment results (%) from different network structures

Full size table

Table 2 Top-1 recognition performance comparison (%) under different calligraphic styles and distribution scenes

Full size table

Table 3 Top-1 comparative experiment results (%) on other ancient Chinese text datasets, where 1781 and 1558 are the categories w/o traditional and Simplified Differentiation in HUST-OBC, and the best results are highlighted in bold

Full size table

Ablation studies demonstrated the complementary contributions of the proposed modules, as shown in Table 4. Comparing with the baseline TinyViT, which achieves 77.07% top-1 accuracy, improvements of 0.42% and 0.75% are achieved with LCR and LCT, respectively, demonstrating that the attention-guided cropping mechanism effectively directs the network’s focus to the most discriminative local regions and enhances the correlation modeling ability among components. Specifically, a 1.26% improvement over the baseline confirms the synergistic benefits of joint LCT and correlation reasoning. As formulated in Eq. (9), the optimal result is achieved when ω₁ = ω₂ = 0.5, where the influence of two modules is equal, as shown in Fig. 5. This enhancement stems from the method’s ability to simultaneously focus on discriminative local regions while modeling their structural relationships.

**Fig. 5: Comparison of accuracy results obtained with different fusion weight parameters (ω), using ω1=0.5 (with a corresponding accuracy y=0) as the reference standard.**

Table 4 Ablation experiments

Full size table

Visualization

Visual analysis demonstrates the model’s robust feature learning capabilities. As shown in Fig. 6, which examines the relationship between stroke complexity and recognition performance, the majority of Chinese characters (only the complex characters with more than 10 strokes are shown) are recognized with high confidence scores. Notably, the model maintains strong performance even for more complex characters with higher stroke counts, indicating its effectiveness in handling intricate character structures. The visualization reveals that while recognition difficulty generally increases with stroke number, the system consistently achieves reliable predictions across varying complexity levels.

**Fig. 6: The correlation between stroke counts of characters, experiment accuracy and confidence.**

The synergistic integration of LCT and LCR modules enables the model to capture both discriminative local features and structural relationships, significantly enhancing representational power, as further illustrated through t-SNE visualizations in Fig. 7, where the proposed method shows markedly improved feature space organization compared to the baseline. Quantitative analysis reveals substantially tighter intra-class clustering and greater inter-class separation between cluster centroids. These quantitative improvements in feature space geometry directly correlate with the model’s enhanced classification performance, particularly for visually similar character categories that require fine-grained discrimination.

**Fig. 7: The t-SNE visualization results.**

Character recognition results

The recognition results in Fig. 8 exhibit the model’s ability to handle the fundamental challenges in ancient text analysis. Figure 8a highlights successful recognition across substantial intra-class variations between historical and modern character forms, demonstrating remarkable robustness to stylistic differences spanning centuries of evolution. Conversely, Fig. 8b presents correctly classified examples of visually similar inter-class pairs, underscoring the method’s fine-grained discrimination capabilities.

**Fig. 8: Correct recognition results.**

Figure 9 reveals remaining challenges, with misclassifications primarily occurring in cases of (1) poor text-background contrast and extreme image degradation, such as (R2, C3), where the key part of the text is missing, (R3, C5), where only the right part is detected, (R3, C7), where only the background texture is detected; (2) extreme exaggeration and personal writing stylistic, such as (R1, C3), (R1, C8), (R2, C6), (R2, C7), which is more common in cursive script, and only 59.82% accuracy is achieved; (3) the lack of variant samples in the training set, such as (R1, 2C), where are the variant form of (R1, C4, where are the variant form of, (R1, C5), where are the variant form of. Many variant characters, which are used only in specific periods and not passed down, and traditional characters are absent from the raining set, leading to incorrect recognition.

**Fig. 9: Error recognition cases of the AttGraph model with GT-predicted top-1 result(confidence score) forms.**

Discussion

This paper presents AttGraph, an enhanced TinyViT-based framework for homoglyphic ancient Chinese character recognition. The proposed method introduces a LCT module that adaptively enhances discriminative regions through learned attention mechanisms, and a LCR module that explicitly models structural correlation dependencies between character components using GCNs. By synergistically integrating global visual features with localized structural representations, AttGraph effectively bridges the critical gap between fine-grained local analysis and holistic contextual understanding. Experimental results demonstrate the framework’s superior performance, especially on substantial intra-class variations and subtle inter-class differences in structurally similar characters. Future research directions include extending the framework to handle cursive script variations and developing specialized data augmentation techniques for degraded historical materials.

Data availability

The datasets for this study are available at https://drive.google.com/file/d/1giKePEa_trYbNMtetNHZfExAycOfs9-Z/view?usp=sharing.

Code availability

The code for AttGraph model, experimental scripts and declaration are available at https://github.com/wtq123-git/AttGraph.

References

Chen, X., Jin, L., Zhu, Y., Luo, C. & Wang, T. Text recognition in the wild: a survey. ACM Comput. Surv. 54, 42:1–42:35 (2022).
Article Google Scholar
Narang, S. R., Jindal, M. K. & Kumar, M. Ancient text recognition: a review. Artif. Intell. Rev. 53, 5517–5558 (2020).
Article Google Scholar
Zhang, D., Li, M., Tian, D., Song, L. & Shen, Y. Intelligent text recognition based on multi-feature channels network for construction quality control. Adv. Eng. Inform. 53, 101669 (2022).
Article Google Scholar
Atakishiyev, S., Salameh, M., Yao, H. & Goebel, R. Explainable artificial intelligence for autonomous driving: a comprehensive overview and field guide for future research directions. IEEE Access 12, 101603–101625 (2024).
Article Google Scholar
Tang, X. et al. Interacting-enhancing feature transformer for cross-modal remote-sensing image and text retrieval. IEEE Trans. Geosci. Remote. Sens. 61, 1–15 (2023).
Google Scholar
Du, Y. et al. Context perception parallel decoder for scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 47, 4668–4683 (2025).
Article PubMed Google Scholar
Jiang, Q., Wang, J., Peng, D., Liu, C. & Jin, L. Revisiting scene text recognition: a data perspective. In Proc. IEEE International Conference on Computer Vision 20543–20554 (IEEE, 2023).
Li, Y., Chen, D., Tang, T. & Shen, X. HTR-VT: handwritten text recognition with vision transformer. Pattern Recognit. 158, 110967 (2025).
Article Google Scholar
AlKendi, W., Gechter, F., Heyberger, L. & Guyeux, C. Advancements and challenges in handwritten text recognition: a comprehensive survey. J. Imaging 10, 18 (2024).
Article PubMed PubMed Central Google Scholar
Yang, X., Yin, C., Tian, D. & Liang, W. Rule-based perspective rectification for chinese text in natural scene images. Multim. Tools Appl. 80, 18243–18262 (2021).
Article Google Scholar
Martin, M. et al. Data-efficient handwritten text recognition of diplomatic historical text. Multim. Tools Appl. 84, 39107–39122 (2025).
Article Google Scholar
Qiao, W., Han, L. & Bi, X. Dual-branch similarity network with bi-correlative attention for few-shot shui script recognition. npj Herit. Sci. 13, 361 (2025).
Article Google Scholar
Yang, L., Wu, Z., Xu, T., Du, J. & Wu, E. Easy recognition of artistic Chinese calligraphic characters. Vis. Comput. 39, 3755–3766 (2023).
Article Google Scholar
Si, H. Analysis of calligraphy chinese character recognition technology based on deep learning and computer-aided technology. Soft Comput. 28, 721–736 (2024).
Article Google Scholar
Wang, Z., Li, Y. & Li, H. Chinese inscription restoration based on artificial intelligent models. npj Herit. Sci. 13, 326 (2025).
Article Google Scholar
Wang, K., Yi, Y., Liu, J., Lu, L. & Song, Y. Multi-scene ancient Chinese text recognition. Neurocomputing 377, 64–72 (2020).
Article Google Scholar
Wang, K., Yi, Y., Tang, Z. & Peng, J. Multi-scene ancient chinese text recognition with deep coupled alignments. Appl. Soft Comput. 108, 107475 (2021).
Article Google Scholar
Fu, Z., Chen, Q., Wang, M. & Huang, C. Transformer based on multi-scale local feature for colon cancer histopathological image classification. Biomed. Signal Process. Control. 100, 106970 (2025).
Article Google Scholar
Banerjee, A., Shivakumara, P., Bhattacharya, S., Pal, U. & Liu, C. An end-to-end model for multi-view scene text recognition. Pattern Recognit. 149, 110206 (2024).
Article Google Scholar
Chou, P., Lin, C. & Kao, W. A novel plug-in module for fine-grained visual classification. Preprint at https://doi.org/10.48550/arXiv.2202.03822 (2022).
Li, S., Zhou, C. & Wang, K. Wa-net: wavelet integrated attention network for silk and bamboo character recognition. Eng. Appl. Artif. Intell. 140, 109674 (2025).
Article Google Scholar
Vaswani, A. et al. Attention is all you need. In Proc. Advances in Neural Information Processing Systems 5998–6008 (MIT Press, 2017).
Gu, X., Zhang, L., Guan, P., Lin, Q. & Yang, R. Removing ghost markers in trinocular optical tracking using convolutional neural network. Signal Image Video Process. 19, 464 (2025).
Article Google Scholar
Yin, X. et al. Rsaformer: a method of polyp segmentation with region self-attention transformer. Comput. Biol. Med. 172, 108268 (2024).
Article PubMed Google Scholar
Guo, M., Liu, Z., Mu, T. & Hu, S. Beyond self-attention: external attention using two linear layers for visual tasks. IEEE Trans. Pattern Anal. Mach. Intell. 45, 5436–5447 (2023).
PubMed Google Scholar
Dosovitskiy, A. et al. An image is worth 16 × 16 words: transformers for image recognition at scale. In Proc. International Conference on Learning Representations 1–21 (OpenReview.net, 2021).
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proc. International Conference on Computer Vision 9992–10002 (IEEE, 2021).
Wu, K. et al. Tinyvit: Fast pretraining distillation for small vision transformers. In Proc. European Conference on Computer Vision Vol. 13681 68–85 (Springer, 2022).
Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. In Proc. International Conference on Learning Representations (OpenReview.net, 2017).
Wang, B. et al. Gazegcn: Gaze-aware graph convolutional network for text classification. Neurocomputing 611, 128680 (2025).
Article Google Scholar
Wang, Y., Ye, S., Hou, W., Xu, D. & You, X. GKA: graph-guided knowledge association for fine-grained visual categorization. Neurocomputing 634, 129819 (2025).
Article Google Scholar
Guo, J. et al. On the element-wise representation and reasoning in zero-shot image recognition: a systematic survey. Preprint at https://doi.org/10.48550/arXiv.2408.04879 (2024).
Wang, Z., Zhang, C., Lang, Q., Jin, L. & Qi, H. Hunet: hierarchical universal network for multi-type ancient chinese character recognition. npj Herit. Sci. 13, 284 (2025).
Google Scholar
Wang, P. et al. An open dataset for oracle bone character recognition and decipherment. Sci. Data 11, 976 (2024).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported by the Scientific Research Plan of Hubei Provincial Department of Education (Grant No. Q20231510), Graduate Innovative Fund of Wuhan Institute of Technology (project number CX2024174), and National Natural Science Foundation of China (Grant No. 52201415).

Author information

Authors and Affiliations

Hubei Key Laboratory of Intelligent Robot, School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan, Hubei, China
Kaili Wang, Tianquan Wu, Yuanlin Shi & Chen Chen

Authors

Kaili Wang
View author publications
Search author on:PubMed Google Scholar
Tianquan Wu
View author publications
Search author on:PubMed Google Scholar
Yuanlin Shi
View author publications
Search author on:PubMed Google Scholar
Chen Chen
View author publications
Search author on:PubMed Google Scholar

Contributions

Kaili Wang: Writing—original draft, methodology, supervision, funding acquisition; Tianquan Wu: Writing—original draft, methodology, software. Yuanlin Shi: Validation, data curation. Chen Chen: Resources, supervision, project administration, writing review and editing. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Chen Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, K., Wu, T., Shi, Y. et al. AttGraph disentangling confusable ancient Chinese characters via component-correlation synergy. npj Herit. Sci. 14, 7 (2026). https://doi.org/10.1038/s40494-025-02278-6

Download citation

Received: 16 September 2025
Accepted: 16 December 2025
Published: 05 January 2026
Version of record: 05 January 2026
DOI: https://doi.org/10.1038/s40494-025-02278-6

AttGraph disentangling confusable ancient Chinese characters via component-correlation synergy

Abstract

Similar content being viewed by others

Reconstructing the local structures of Chinese ancient architecture using unsupervised depth estimation

Semantic-enhanced graph neural network for named entity recognition in ancient Chinese books

Conceptual model construction for the A Comprehensive Collection of Ancient Chinese Paintings based on metadata and ontology

Introduction