Introduction

As a foundational medium of human civilization, text has long been a critical epistemic vessel, enabling the codification and cross-generational transmission of cultural knowledge. With advances in image processing and computer vision, intelligent text recognition1,2,3 has become a key tool in domains like literary analysis, digitization, and assistive reading, and is extensively deployed in fields such as image retrieval, autonomous driving and accessible human-computer interaction4,5. Within this field, scene text recognition6,7 focuses on detecting and interpreting text from complex, cluttered visual environments, while handwritten text recognition8,9 addresses challenges from high variability and deformation caused by individualized writing styles.

As shown in Fig. 1, ancient text recognition presents more challenges, which not only includes material degradation (e.g., stains, fading, damage), but also exhibits significant paleographic variations across historical periods—from oracle bone script to cursive styles—leading to notable intra-class variability and inter-class similarity (Fig. 2). Recent text recognition has expanded to natural scene text10, handwritten text11,12 and artistic calligraphy13,14,15. Traditional unsupervised methods work for specific tasks (e.g., document analysis, digitization) but lack performance and robustness due to their inability to adaptively extract features under diverse visual conditions. While approaches like ensemble model MME16 and domain-adaptive CA-CF17 have been explored, challenges persist in handling ancient Chinese complexities, especially wide intra-class variations and subtle inter-class differences from historical writing and multi-scene distribution. Though recent methods18,19,20,21 help by letting networks localize key regions and capture fine-grained features to distinguish subtle inter-class differences, they neglect inter-region correlations—critical for robust understanding of characters with shared components (e.g., 叨/召/另, 旮/旯, 枷/架, 景/晾, 愀/愁, and 愉/愈.

Fig. 1: Multi-scene distribution of ancient Chinese characters.
figure 1

This figure displays the various chirography styles of ancient Chinese characters and their corresponding distribution carriers, including seal script, clerical script, running script, and cursive script found on bamboo slips, stone carvings, plaques, bronze artifacts, ancient coins, and calligraphy and painting works.

Fig. 2: The inherent characteristics of Chinese characters.
figure 2

a Intra-class variability; b Inter-class similarity.

With the advent of deep learning, transformers have revolutionized natural language processing and computer vision22, addressing limitations of traditional convolutional neural networks (CNNs)23 via dynamic self-attention24,25. Vision transformer (ViT)26 performs well in image tasks, and Swin transformer27 boosts efficiency with hierarchical features. However, their large parameter demands hinder deployment in resource-constrained scenarios—a gap filled by TinyViT28 through lightweight design and knowledge distillation.

Additionally, graph convolutional networks (GCNs)29 (e.g., GazeGCN30, GKA31, RGEN32) are powerful for processing graph-structured data and enhancing relation feature representation. Despite their ability to model contextual and structural relationships, applying GCNs to ancient Chinese characters remains challenging due to insufficient modeling of complex stroke relationships, poor handling of component spatial configurations, and limited adaptation to paleographic variations.

To address these challenges, we propose a novel framework that synergistically integrates a local content transformation (LCT) module for enhanced local region representation and a local correlation reasoning (LCR) module for comprehensive cross-patch relationship modeling. Built upon the lightweight TinyViT backbone, the contributions can be summarized as follows:

  1. 1.

    The LCT module introduces an attention-guided mechanism that detects and enhances discriminative local features, significantly improving the recognition of inter-class visual similar characters.

  2. 2.

    The LCR module innovatively integrates GCNs with Transformer attention weights to learn and capture structural relationships among character components.

  3. 3.

    Extensive experiments demonstrate that the proposed method outperforms previous approaches, achieving state-of-the-art (SOTA) performance with a 2.02% improvement in top-1 accuracy while showing particular robustness in handling degraded samples and stylistically similar characters.

Methods

Inspired by the cognitive process of character learning, which includes the process of focusing the holistic observation, component analysis and structural relation building, AttGraph is proposed for homoglyphic inter-class recognition of ancient Chinese characters. As illustrated in Fig. 3, the architecture comprises three key components: a TinyViT backbone for global structure feature extraction, a LCT module for salient region enhancement, which is designed to adaptively detect and amplify salient regions through learnable spatial attention weights, and a LCR module for spatial-semantic relationship modeling, which employs GCN to model patch-wise spatial semantic dependencies and transformer attention maps to establish contextual correlation graphs.

Fig. 3: AttGraph model architecture, which comprises a TinyViT backbone, a Local Content Transformation (LCT) module, and a Local Correlation Reasoning (LCR) module.
figure 3

The LCT module employs attention mechanisms to identify and enhance discriminative regions within character images, while the LCR module adopts a graph-based approach to model structural relationships between character components.

TinyVit backbone network

The TinyViT backbone employs a four-stage hierarchical architecture design with progressively reduced feature resolution. The initial stage utilizes a convolutional module incorporating an MBConv for low-level feature extraction combined with downsampling to enhance representation capability. The following three stages consist of different numbers of Transformer blocks, forming the BasicLayer. To optimize computational efficiency, the architecture implements shifted window-based local self-attention mechanisms augmented with residual connections, maintaining performance while reducing resource requirements. Higher-order feature transformations are facilitated through MLP module incorporating layer normalization and nonlinear activation. The final Transformer module outputs are processed by a classification head to generate prediction vectors, where the global structure loss is conducted using cross-entropy loss, as shown in Eq. (1):

$${L}_{CE}=-\mathop{\sum }\limits_{i=1}^{C}{y}_{i}\log \left(\frac{{{\rm{e}}}^{{z}_{i}}}{{\sum }_{j=1}^{C}{{\rm{e}}}^{{z}_{i}}}\right)$$
(1)

where C denotes the number of classes, yi represents ground truth labels, and zi corresponds to the raw logits from the final Transformer module in TinyViT.

Local content transformation module

The LCT module employs attention mechanisms to identify and enhance discriminative regions within character images. Building upon the feature representations from the last two Transformer blocks containing H attention heads, the module computes attention weights through a scaled dot-product operation, as formulated in Eqs. (2) and (3).

$${\text{Atttn}}_{i}^{(h)}=\frac{{Q}_{i}^{(h)}{({K}_{i}^{(h)})}^{\top }}{\sqrt{{d}_{k}}}$$
(2)
$${\alpha }_{i}={\text{softmax}}\left(\mathop{\mathop{\oplus }\limits^{H}}\limits_{h=1}{{\text{Attn}}}_{i}^{(h)}\right)$$
(3)

where \(Q\in {{\mathbb{R}}}^{N\times D}\) and \(K\in {{\mathbb{R}}}^{N\times D}\) represent the query and key vectors, respectively, with D denoting the feature dimension. Here, dk = D/H normalizes the dot product results, preventing gradient vanishing issues, while h [1, H] indexes the attention heads. The concatenation operation combines multi-head attention matrices, yielding a normalized attention matrix \({\alpha }_{i}\in {{\mathbb{R}}}^{N\times D}\) for each transformer block i. These matrices capture the contextual relationships between all N image patches, effectively quantifying each patch’s contribution to the final classification decision.

The attention weights from both blocks are aggregated through stacking and averaging to produce a composite attention weight map A. The average pixel value \(\overline{b}\) of the map is defined in Eq. (4):

$$\overline{b}=\frac{1}{h\times w}\mathop{\sum }\limits_{x=1}^{h}\mathop{\sum }\limits_{y=1}^{w}A(x,y)$$
(4)

where A(x, y) represents the attention value at spatial location (x, y), with h and w denoting the map dimensions. Regions with higher activation values correspond to semantically significant character components that most influence the model’s decisions.

To extract these discriminative regions, the module implements an adaptive thresholding mechanism, as denoted in Eq. (5).

$$Mas{k}_{(x,y)}^{loc}=\left\{\begin{array}{l}1,A(x,y) > \varepsilon \bar{b}\\ 0,{\text{otherwise}}\end{array}\right.$$
(5)

where ε is an experimentally optimized scaling factor. The resulting binary mask isolates contiguous high-response regions, which are then cropped from the original image to generate focused feature areas. This process preserves the structural integrity of critical character components while eliminating irrelevant background information, significantly enhancing the model’s ability to discern subtle inter-class differences in ancient Chinese characters.

Local correlation reasoning module

The LCR module employs a sophisticated graph-based approach to model structural relationships between character components. Considering that Chinese characters typically feature a centro-radial structure with radicals/strokes arranged around core components, the directional and relative positional relationships are directly parameterized on polar coordinates. The structural information is extracted using a relative coordinate predictor (RCP) based on the attention weight map A from the LCT module. The RCP establishes a polar coordinate system centered at the reference point P0 = (x0, y0), corresponding to the patch with maximum average feature response. For each remaining patch, its polar coordinate ρxy = (ρxy, θxy) is computed as follows in Eqs. (6) and (7).

$${\rho }_{xy}=\sqrt{{\left(\frac{x-{x}_{0}}{W}\right)}^{2}+{\left(\frac{y-{y}_{0}}{H}\right)}^{2}}$$
(6)
$${\theta }_{xy}=\frac{\arctan 2(y-{y}_{0},x-{x}_{0})+\pi }{2\pi }$$
(7)

where ρxy represents the normalized radial distance 0 ≤ ρxy≤ and θxy denotes the normalized azimuthal angle 0 ≤ θxy≤. This polar-based representation improves over traditional Cartesian coordinates by providing a rotation-invariant and fine-grained description of relative spatial relationships.

The polar coordinate features are integrated with attention-based positional weights to construct a relational graph structure, where nodes correspond to image patches and edges model spatial-semantic dependencies. This graph representation is processed through a GCN that simultaneously learns from both the topological connectivity patterns and localized node features. The resulting GCN embeddings are concatenated with the final feature representations from the TinyViT backbone, forming a unified feature space that effectively captures fine-grained structural relationships between character components while preserving global contextual understanding.

Loss function and fusion

The total global-local consistency loss establishes dual supervision signals, where the global structure loss preserves holistic character morphology while the local discriminative loss enhances fine-grained feature learning. This complementary design ensures balanced optimization of macroscopic and microscopic characteristics, as formulated in Eq. (8):

$${L}_{glc}={\lambda }_{1}\cdot {L}_{gs}(y,{{y}^{{\prime} }}_{glb})+{\lambda }_{2}\cdot {L}_{ld}(y,{{y}^{{\prime} }}_{loc})$$
(8)

where \({{y}^{{\prime} }}_{glb}\) and \({{y}^{{\prime} }}_{loc}\) denote the predictions from the global structure and local discriminative branches, respectively. The learnable parameters λ1, λ2 balance the contribution of each branch during training.

The final prediction probability is obtained through adaptive fusion of both branches’ outputs, as formulated in Eq. (9).

$$Fus{e}_{prob}={\omega }_{1}\cdot {P}_{glb}+{\omega }_{2}\cdot {P}_{loc}$$
(9)

here \({P}_{x}={\text{softmax}}({z}_{x})\) denotes the probability distribution from the global and local branches, weighting parameters ω1 and ω2 serve to dynamically balance the contributions of each branch, with their sum constrained to equal 1 to ensure the output remains a valid probability distribution. This synergy strategy ensures robust performance across varying character conditions by leveraging both macroscopic structural patterns and microscopic discriminative features.

Results

Experimental setup

The experimental evaluation was conducted on the MACT dataset containing 2884 categories of ancient Chinese characters, with 106,696 training images and 14,315 test images. The training process incorporated random horizontal flipping and brightness adjustments for data augmentation, using SGD optimization with an initial learning rate of 0.001 that decayed by 0.1 at epochs 15, 30, and 50. The model was trained with weight decay of 5e-4 and batch size 8 on NVIDIA RTX 3090 GPUs using PyTorch, with key parameters including the LCT clipping threshold ε = 0.2 and balanced branch fusion weights ω1 = ω2 = 0.5.

Experimental results and analysis

Comparative analysis with state-of-the-art methods revealed significant improvements achieved by our approach, as shown in Table 1. The proposed AttGraph model attained 78.33% top-1 accuracy when initialized with ImageNet-22K pretrained weights, representing a 2.02% absolute improvement over previous best methods and a 1.26% gain compared to the TinyViT baseline. Notably, AttGraph exhibits lower data dispersion and a more concentrated distribution. These performance advantages are consistent across all evaluation metrics (top-2 to top-5) as well as all distribution scenes and calligraphic styles, as shown in Table 2. The corresponding standard deviations (SD) are also provided in Table 1 and Fig. 4 to verify performance stability, which can be attributed to the effective integration of local feature enhancement and structural relationship modeling in our framework. Experiment results from Table 3 demonstrate that the AttGraph achieves better outperforms on other datasets with distinct visual characteristics, further confirming its robustness and adaptability. While a recent study from HUNet33 reports competitive results, their model was trained on 7 million training samples. When evaluated under equivalent training conditions using only the original MACT dataset, HUNet’s performance drops to 36.13% top-1 accuracy, and due to our inability to contact the authors, the comparative experiments trained on their self-built MTACCR dataset could not be conducted.

Fig. 4: Comparison of accuracy distribution and performance stability between the baseline and the proposed AttGraph.
figure 4

Results obtained from the AttGraph approach demonstrate higher accuracy, smaller fluctuations, and more stable outcomes, which validates the effectiveness of the proposed synergistic design.

Table 1 Comparative experiment results (%) from different network structures
Table 2 Top-1 recognition performance comparison (%) under different calligraphic styles and distribution scenes
Table 3 Top-1 comparative experiment results (%) on other ancient Chinese text datasets, where 1781 and 1558 are the categories w/o traditional and Simplified Differentiation in HUST-OBC, and the best results are highlighted in bold

Ablation studies demonstrated the complementary contributions of the proposed modules, as shown in Table 4. Comparing with the baseline TinyViT, which achieves 77.07% top-1 accuracy, improvements of 0.42% and 0.75% are achieved with LCR and LCT, respectively, demonstrating that the attention-guided cropping mechanism effectively directs the network’s focus to the most discriminative local regions and enhances the correlation modeling ability among components. Specifically, a 1.26% improvement over the baseline confirms the synergistic benefits of joint LCT and correlation reasoning. As formulated in Eq. (9), the optimal result is achieved when ω1 = ω2 = 0.5, where the influence of two modules is equal, as shown in Fig. 5. This enhancement stems from the method’s ability to simultaneously focus on discriminative local regions while modeling their structural relationships.

Fig. 5: Comparison of accuracy results obtained with different fusion weight parameters (ω), using ω1=0.5 (with a corresponding accuracy y=0) as the reference standard.
figure 5

Accuracy results achieved with other ω values are lower than those obtained when ω1=0.5, indicating that the optimal synergistic performance is achieved when two modules are assigned equal weights.

Table 4 Ablation experiments

Visualization

Visual analysis demonstrates the model’s robust feature learning capabilities. As shown in Fig. 6, which examines the relationship between stroke complexity and recognition performance, the majority of Chinese characters (only the complex characters with more than 10 strokes are shown) are recognized with high confidence scores. Notably, the model maintains strong performance even for more complex characters with higher stroke counts, indicating its effectiveness in handling intricate character structures. The visualization reveals that while recognition difficulty generally increases with stroke number, the system consistently achieves reliable predictions across varying complexity levels.

Fig. 6: The correlation between stroke counts of characters, experiment accuracy and confidence.
figure 6

The symbol size represents sample quantity and color intensity corresponds to confidence score.

The synergistic integration of LCT and LCR modules enables the model to capture both discriminative local features and structural relationships, significantly enhancing representational power, as further illustrated through t-SNE visualizations in Fig. 7, where the proposed method shows markedly improved feature space organization compared to the baseline. Quantitative analysis reveals substantially tighter intra-class clustering and greater inter-class separation between cluster centroids. These quantitative improvements in feature space geometry directly correlate with the model’s enhanced classification performance, particularly for visually similar character categories that require fine-grained discrimination.

Fig. 7: The t-SNE visualization results.
figure 7

a Tested on the backbone model; b Illustrating the proposed method.

Character recognition results

The recognition results in Fig. 8 exhibit the model’s ability to handle the fundamental challenges in ancient text analysis. Figure 8a highlights successful recognition across substantial intra-class variations between historical and modern character forms, demonstrating remarkable robustness to stylistic differences spanning centuries of evolution. Conversely, Fig. 8b presents correctly classified examples of visually similar inter-class pairs, underscoring the method’s fine-grained discrimination capabilities.

Fig. 8: Correct recognition results.
figure 8

a Notable intra-class discrepancies and b significant inter-class similarities with predicted top-1 result (confidence score) forms.

Figure 9 reveals remaining challenges, with misclassifications primarily occurring in cases of (1) poor text-background contrast and extreme image degradation, such as (R2, C3), where the key part of the text is missing, (R3, C5), where only the right part is detected, (R3, C7), where only the background texture is detected; (2) extreme exaggeration and personal writing stylistic, such as (R1, C3), (R1, C8), (R2, C6), (R2, C7), which is more common in cursive script, and only 59.82% accuracy is achieved; (3) the lack of variant samples in the training set, such as (R1, 2C), where are the variant form of (R1, C4, where are the variant form of, (R1, C5), where are the variant form of. Many variant characters, which are used only in specific periods and not passed down, and traditional characters are absent from the raining set, leading to incorrect recognition.

Fig. 9: Error recognition cases of the AttGraph model with GT-predicted top-1 result(confidence score) forms.
figure 9

This figure highlights the key failure scenarios including severe image degradation, extreme stylistic variations, and insufficient variant character samples in the training set.

Discussion

This paper presents AttGraph, an enhanced TinyViT-based framework for homoglyphic ancient Chinese character recognition. The proposed method introduces a LCT module that adaptively enhances discriminative regions through learned attention mechanisms, and a LCR module that explicitly models structural correlation dependencies between character components using GCNs. By synergistically integrating global visual features with localized structural representations, AttGraph effectively bridges the critical gap between fine-grained local analysis and holistic contextual understanding. Experimental results demonstrate the framework’s superior performance, especially on substantial intra-class variations and subtle inter-class differences in structurally similar characters. Future research directions include extending the framework to handle cursive script variations and developing specialized data augmentation techniques for degraded historical materials.