Fine grained representation learning for low resource Yi script detection and dataset construction

Sun, Haipeng; Ding, Xueyan; Yu, Hua; Yang, Zukang; Zhang, Jianxin

doi:10.1038/s40494-026-02418-6

Download PDF

Article
Open access
Published: 26 March 2026

Fine grained representation learning for low resource Yi script detection and dataset construction

Haipeng Sun^1,2,3,
Xueyan Ding³,
Hua Yu⁴,
Zukang Yang⁵ &
…
Jianxin Zhang³

npj Heritage Science volume 14, Article number: 183 (2026) Cite this article

1374 Accesses
1 Citations
Metrics details

Abstract

Yi character detection in historical documents is challenged by complex morphology, dense strokes, and multi-scale layouts. To address these issues, we propose a novel fine-grained representation learning framework for Yi character detection (FGRL-YiNet) that integrates dynamic convolution and adaptive multi-scale fusion modules. This design enables the model to adaptively refine receptive fields to capture elusive stroke topology while suppressing background interference, directly addressing the fundamental limitations of static feature extraction in existing methods. Integrated with multi-scale feature fusion and a differentiable binarization head, our end-to-end system achieves robust character localization under severe degradation. Furthermore, we develop the YiPrint-694 dataset to support training in this low-resource domain. Extensive experiments show that FGRL-YiNet significantly outperforms state-of-the-art models on Yi benchmarks, particularly for weak strokes, and demonstrates strong generalizability on the public MTHv2 dataset. This work establishes a benchmark and architectural paradigm for underserved scripts, enabling practical solutions for digital heritage preservation.

Yi script character detection in ancient manuscripts using a dual branch transformer

Article Open access 06 May 2026

A digital twin model for grain enterprise financial shared service centers based on distributed deep learning and neural symbolic reasoning

Article Open access 18 November 2025

Digital restoration of ancient Jiangnan murals via proxy learning and structural guidance

Article Open access 25 March 2026

Introduction

The Yi ethnic group, the sixth-largest in China, has preserved a vast corpus of historical texts spanning medicine, astronomy, religion, and other domains, making the digitization of Yi script essential for cultural preservation and interdisciplinary research^1,2. However, this task poses unique challenges that make it distinctly more difficult than recognizing standardized scripts such as Latin or Chinese. As shown in Fig. 1, the left panel presents a densely arranged Yi dictionary page with a complex multi-column layout, while the right panel isolates individual characters for closer inspection. These examples reveal the defining traits of Yi script: densely curved, frequently interweaving strokes that create complex topologies, extreme visual similarity between distinct characters, and subtle glyph variations across historical periods. Unlike the relatively isolated strokes of Chinese or the simple character structures of Latin, these characteristics blur inter-character boundaries and introduce a high degree of intra-class variance.

**Fig. 1: Challenges in Yi script detection.**

These challenges suggest that Yi script recognition presents a more complex and demanding problem than the recognition of most common scripts. Conventional detectors that rely on static convolution kernels, designed for consistent feature extraction, struggle to capture the diverse and evolving glyph patterns. This high feature diversity necessitates an adaptive learning mechanism. Dynamic convolution (DConv) addresses this fundamental need by dynamically aggregating multiple kernels based on the input, thereby specializing its feature extraction for the specific topological nuances of each character instance. This adaptive capability stands in direct contrast to the rigidity of static kernels, making it particularly well-suited for modeling the subtle topological variations and intricate inter-character relationships inherent to the Yi script^3,4. Consequently, the architectural solutions developed to overcome the extreme difficulties of Yi character detection are inherently robust and possess strong potential for transfer learning, making them applicable to a wide range of other document analysis and script recognition tasks. This combination of challenges, rarely encountered in common script recognition, substantially increases the risk of misclassification and demands such specialized architectural solutions for reliable detection^5,6.

Traditional methods and early deep learning approaches to Yi script detection encounter fundamental limitations due to the script’s unique structural and spatial characteristics⁷. Yi characters often exhibit highly irregular proportions, complex curved strokes, and inconsistent inter-character spacing, yet conventional clustering or projection-based methods rely on fixed heuristics that assume uniformity of shape and alignment. As a result, these techniques frequently mis-segment strokes, merge adjacent symbols, or fail altogether when confronted with dense arrangements, multi-directional orientations, or degraded manuscripts. Attempts to improve performance, such as the entropy-enhanced clustering of Jia et al.⁸, or the RnnNet-Yi and ParallelRnnNet-Yi models of Yin et al.⁹, mitigate annotation burdens and capture local curve information, but they remain constrained by local grouping assumptions and regression-based segmentation rules. Consequently, they struggle to generalize beyond isolated or clean samples, leading to fragmentation, over- or under-segmentation, and poor robustness when characters overlap, appear faint, or are distorted by age and noise.

Recently, substantial progress in text detection for languages such as Chinese and English has produced a wealth of techniques that offer valuable insights for Yi character detection^{10,11,12,13,14}. Many of these methods are based on Convolutional Neural Networks (CNNs), which excel at extracting local spatial features from images^15,16,17,18. For example, Long et al.¹⁹ proposed a Fully Convolutional Network (FCN) that can detect text at arbitrary angles by identifying character centers and estimating text-line dimensions. Renton et al.²⁰ introduced a text-detection approach based on dilated convolutions that is particularly effective at annotating small targets and detecting fine text lines. Similarly, Tian et al.²¹ designed the Connectionist Text Proposal Network (CTPN), which uses anchor boxes for candidate selection and processes images of arbitrary size to facilitate the direct localization of text lines. Zhou et al.²² advanced the field with the Efficient and Accurate Scene Text Detection (EAST) framework, capable of accurately detecting word-level and quadrilateral-shaped text, including curved text lines. Wang et al.²³ proposed the Progressive Scale Expansion Network (PSENet), which employs adaptive polygonal representations to handle complex text regions, and Liao et al.^24,25 developed the Differentiable Binarization Network (DBNet) and its improved variant DBNet++, both of which enable rapid and accurate text-region segmentation through differentiable binarization and multiscale feature fusion. More recently, Zhu et al.²⁶ presented the Fourier Contour Embedding Network (FCENet), which models arbitrary-shaped text bounding boxes with high precision. Building upon these foundations, recent advances have explored cross-modal paradigms: Jiang et al.²⁷ proposed T-Rex2, which synergizes Textual and visual prompts through contrastive learning for open-set detection (T-Rex2), yet its prompt dependency fundamentally limits adaptation to Yi characters topological complexity and internal structural voids; Ranjbarzadeh et al.²⁸ developed a robust detection framework incorporating Inception layers and Modified ReLU activation (Inception-mReLU), though its contour-oriented design systematically erases critical concave features that define Yi script morphology; Maktabdar et al.²⁹ introduced a RoBERTa-based Custom Deep Model (CDM) for AI-generated content detection, whose linguistic modeling paradigm proves intrinsically misaligned with the geometric precision required for Yi character spatial analysis. Each of these approaches addresses a specific subproblem, such as angle invariance, small-target sensitivity, polygonal shape modeling, or differentiable thresholding, but they leave distinct gaps for Yi script. First, many CNN-based pipelines use static convolutional kernels, which limit their adaptability to very subtle stroke variations. Second, contour and polygon methods can oversmooth concavities, leading to the loss of information indicating internal holes. Third, standard multiscale fusion often underweights inter-channel relationships, which are critical when many characters look visually similar.

To address the inherent challenges of Yi character detection, we propose FGRL-YiNet, which establishes a new paradigm for Yi character analysis through its systematic architectural innovations. This work presents the first comprehensive framework that establishes a synergistic integration of DConv with coordinated attention mechanisms specifically designed for Yi script analysis, creating a cohesive system in which these components mutually enhance each other's capabilities through continuous feature refinement. The DConv module preserves delicate stroke topology through input-adaptive receptive fields, while the coordinated attention mechanism maintains discriminability across channels and spatial relationships at multiple scales. This design produces properties that transcend conventional detection approaches, enabling breakthrough performance in handling the script's unique morphological characteristics. Through this tightly-coupled architecture, FGRL-YiNet not only achieves state-of-the-art performance but also establishes a reproducible blueprint for analyzing other underserved writing systems. Our main contributions can therefore be summarized as follows:

(1) We construct the Yi character detection dataset YiPrint-694 from print manuscripts, encompassing a total of 1,165 categories. This dataset is designed specifically as a benchmark for research on Yi character recognition.

(2) We propose FGRL-YiNet, a framework for detecting authentic Yi characters in printed historical documents. We address the challenge of detecting similar Yi characters by introducing two innovative modules: the Dynamic Convolution (DConv) and the Adaptive Multi-Scale Fusion (AMSF). Where DConv adapts to stroke variations, while AMSF enhances feature discrimination across scales.

(3) Our method achieves state-of-the-art detection performance on the YiPrint-694 dataset, and comprehensive experiments further demonstrate its strong generalizability to other benchmarks, including MTHv2.

The remainder of this paper is organized as follows. Section “Methods” provides a detailed description of the proposed FGRL-YiNet framework and the construction of the YiPrint-694 dataset. Section “Results” outlines the experimental setup and reports the results. Finally, Section “Discussion” summarizes the main findings and discusses potential directions for future research.

Methods

Overview of the FGRL-YiNet architecture

The architecture of FGRL-YiNet, as illustrated in Fig. 2, comprises three main components: the feature-extraction module, the feature-fusion module, and the prediction module.

**Fig. 2: Overview of FGRL-YiNet for Yi character detection: the input image is processed by a ResNet-18 backbone with DConv to extract hierarchical features.**

Figure 2 illustrates our end-to-end detection framework based on a fundamental insight: the unique challenges of Yi character analysis, high inter-class similarity, complex stroke topology, and severe background degradation collectively demand a co-design paradigm that moves beyond the sequential application of improved components. Conventional detectors fail because their static operators cannot jointly adapt to local stroke variations and global contextual confusion. Rather than merely assembling existing techniques, our framework introduces a core synergistic unit that deeply intertwines two innovations. First, we evolve the ResNet-18 backbone with DConv in its deeper stages, not simply to increase capacity, but to provide a spatially adaptive feature foundation that is responsive to subtle stroke geometry. Second, these dynamic features are processed by our Joint Attention-based Adaptive Fusion Module, a dedicated refinement system that exploits input-specific patterns to suppress background noise and resolve scale ambiguities across feature hierarchies. Crucially, this creates a virtuous cycle: the attention mechanism guides the DConv towards semantically critical regions, while the adaptive features provide a richer, more relevant input for the attention to refine. This tightly-coupled workflow ensures that each component directly addresses a theoretical shortcoming of traditional methods, transcending incremental improvement to deliver a novel, coherent solution for Yi heritage document analysis.

Feature extraction module based on DConv

Our selection of ResNet-18 as the foundational backbone is grounded in a deliberate three-tiered rationale, informed by the specific requirements of Yi script analysis. First, the residual learning mechanism provides unparalleled capability for modeling the complex curvilinear strokes and internal topological structures characteristic of Yi characters, where precise gradient-flow preservation is essential for capturing subtle stroke variations, addressing the fundamental challenge of inter-character discrimination in this unique writing system. Second, the constrained scale of our YiPrint-694 dataset makes ResNet strong inductive biases and parameter efficiency particularly advantageous, ensuring robust generalization without the massive pretraining datasets required by alternative architectures. Third, ResNet localized receptive fields and hierarchical feature-extraction paradigm offer optimal computational efficiency for character-level detection, providing the necessary foundation for introducing targeted enhancements specifically designed for Yi character analysis.

To translate these architectural advantages into effective feature extraction for Yi characters, our framework decomposes ResNet-18 into multiple hierarchical stages, each comprising layers with distinct structural configurations. Stacked convolutional blocks within these stages progressively extract features, capturing both local, fine-grained details and global, semantic representations. This architectural refinement directly addresses the core challenges outlined above: significant scale disparities, complex curved-stroke topologies, and frequent occlusion or interference from the multi-column layouts prevalent in historical documents. A feature extractor must therefore be highly adaptive to such pronounced spatial and morphological variations to be effective. To bolster the learning of high-level features, we integrate DConv into the deeper stages of the network. This mechanism allows the model to adaptively adjust receptive fields and recalibrate convolutional kernel weights based on spatial context, thereby enhancing its capacity to model region-specific visual complexity. This adaptive capability is important for accurately representing the Yi script, which is characterized by fine-grained stroke junctions, subtle curvature transitions, and frequently overlapping components. In contrast to conventional convolutions, which often resort to larger kernels or greater depth at high computational cost, DConv offers a more flexible, parameter-efficient approach. Specifically, we replace standard residual blocks with DConv blocks, where multiple kernels are combined via input-dependent attention weights. This design not only sharpens the model's ability to discriminate subtle structural nuances in Yi characters but also improves its robustness against scale inconsistencies and layout-induced distortions, which are pervasive obstacles in the digital preservation of cultural heritage documents. This process is summarized in Eqs. (1), (2), (3), and (4).

$$y=g\left({W}^{\top }x+b\right)$$

(1)

$$y=g\left(\widetilde{W}{(x)}^{\top }x+\widetilde{b}(x)\right)$$

(2)

$$\widetilde{W}(x)=\mathop{\sum }\limits_{k=1}^{K}{\pi }_{k}(x)\,{\widetilde{W}}_{k}$$

(3)

$$\widetilde{b}(x)=\mathop{\sum }\limits_{k=1}^{K}{\pi }_{k}(x)\,{\widetilde{b}}_{k}$$

(4)

where $x\in {{\mathbb{R}}}^{d}$ denotes the input vector and y the layer output. W and b are the static weight matrix and bias; g(⋅) is an activation function. Eq. (1) shows the static linear transformation. In contrast, Eq. (2) defines a dynamic linear operator $\widetilde{W}(x)$ and bias $\widetilde{b}(x)$ that depend on the input. Each is formed as a convex combination of K basis transformations $\widetilde{W}k,\widetilde{b}k{k=1}^{K}$ with input-dependent mixture weights π_k(x). The gating weights π_k(x) are produced by a small nonlinear gating network and normalized by a softmax so that 0 ≤ π_k(x) ≤ 1 and ∑k = 1^Kπ_k(x) = 1.

To further stabilize this process, we employ activation functions tailored to different submodules: ReLU in the backbone for efficient gradient propagation, Leaky ReLU in low-intensity regions to mitigate dead neurons, and GELU in the gating network for its smooth probabilistic weighting, which enhances the stability of π_k(x). This input-dependent aggregation, reinforced by carefully chosen activations, enables the layer to adaptively emphasize different kernels per feature vector, thereby increasing representational capacity compared to a single static transformation.

DConv improves feature representation by using a set of K parallel convolutional kernels instead of a single kernel at each layer (see Fig. 3). The aggregation of these kernels is input-dependent: a small gating network predicts attention scores for the K branches, and the branch outputs are fused according to those scores to produce a more expressive convolutional operator. Concretely, we first obtain a global descriptor via global average pooling, then pass it through two fully connected layers with a ReLU nonlinearity in between. The first FC reduces the channel dimension by a factor of four, and the second projects the result to K logits. Applying a softmax yields normalized attention weights ${\{{\pi }_{k}(x)\}}_{k=1}^{K}$ with 0 ≤ π_k(x) ≤ 1 and ${\sum }_{k=1}^{K}{\pi }_{k}(x)=1$. In the diagram, each π_k is used to re-weight the feature map produced by the k-th convolutional branch; the weighted feature maps are then combined (via a linear aggregation/transformation block T) to form the aggregated feature map, which is followed by batch normalization and a ReLU activation. Compared to SENet, which applies channel-wise attention to a single kernel, DConv assigns attention across multiple parallel kernels, producing richer features while adding only modest computational overhead.

**Fig. 3: Deformable convolution structure.**

Feature fusion module based on AMSF

The proposed AMSF module addresses the limitation of existing methods, which focus solely on local spatial interactions and overlook inter-channel dependencies across different scales. This limitation can lead to underrepresented informative channel-wise features, diminishing the discriminative power of the fused representation. To overcome the shortcomings of traditional attention mechanisms, we introduce a novel parallel design, i.e., EPSA + spatial attention. Unlike SE, which suffers from over-compression of global channel context, our EPSA component preserves richer multi-scale channel information. Furthermore, while CBAM (Convolutional Block Attention Module) serializes channel and spatial attention, potentially causing information bottlenecks, our parallel processing mechanism enables simultaneous and complementary refinement of both channel and spatial dimensions. This design enables the model to better capture fine-grained structural differences, such as subtle stroke variations in Yi script, resulting in superior performance across diverse layouts and conditions.

To tackle this issue, we introduce an EPSA mechanism, which runs in parallel with spatial attention. The EPSA mechanism captures inter-channel correlations, extracting complementary channel-wise information, while spatial attention focuses on learning the importance of each spatial location in the feature maps. The outputs from both attention mechanisms are expanded to compatible dimensions and fused in parallel, effectively integrating multi-scale features. This parallel fusion of channel-wise and spatial information enhances the model's capacity to capture both local and global context, improving its performance on multi-scale tasks. The AMSF module, incorporating this joint attention mechanism, is illustrated in Fig. 4.

**Fig. 4: Adaptive multi-scale fusion structure.**

We developed an adaptive multi-scale fusion module specifically designed to integrate features that exhibit distinct receptive fields and descriptive capabilities. To effectively address these scale differences, our method employs an Adaptive Multi-Scale Fusion (AMSF) module that dynamically adjusts per-scale weights for more accurate and effective feature fusion, thereby enhancing text detection performance. The procedure begins by extracting multi-scale features from the input image using convolutional kernels of different sizes, which are then resampled to a common spatial resolution before being passed into the fusion module. Let the input tensor be $X\in {{\mathbb{R}}}^{N\times C\times H\times W}={\{{X}_{i}\}}_{i=0}^{N-1}$, where N denotes the number of scales (with N = 4 in our implementation). The scaled input feature maps are concatenated and processed by a 3 × 3 convolution to produce an intermediate feature $S\in {{\mathbb{R}}}^{C\times H\times W}$. A spatial-attention module then computes an attention tensor $A\in {{\mathbb{R}}}^{N\times H\times W}$ from the intermediate feature S, which is split into N spatial attention maps. Each map E_i is applied to the corresponding scale feature X_i by element-wise multiplication, and the reweighted per-scale tensors are concatenated to form the fused output $F\in {{\mathbb{R}}}^{N\times C\times H\times W}$. The design is theoretically grounded in addressing both core challenges. Firstly, the adaptive weighting directly tackles scale variation by allowing the model to dynamically emphasize features from the most reliable scale for each spatial location. This ensures robust character detection at any size, mitigating issues caused by noisy, distorted, or overlapping strokes. Secondly, and more critically, the joint attention mechanism is essential for disambiguating structurally similar characters. It enables the model to perform a synergistic selection: identifying which fine-grained spatial details (e.g., a specific stroke junction or curvature) from which scale are most diagnostic for distinguishing one character from another. By fusing features under this learned, discriminative guidance, the module constructs a representation that amplifies subtle inter-class differences while suppressing irrelevant contextual noise, thereby directly resolving the visual ambiguities inherent to Yi script. This process, summarized in Eqs. (5), (6), and (7), yields a fused feature representation that is both scale-invariant for robust detection and highly discriminative for accurate character recognition.

$$S=Conv\left(Concat\left([{X}_{0},{X}_{I},{X}_{N-I}]\right)\right)$$

(5)

$$A=Spatial\_Attention(S)$$

(6)

$$F=Concat\left([{E}_{0}{X}_{0},{E}_{1}{X}_{1},\ldots ,{E}_{N-1}{X}_{N-1}]\right)$$

(7)

where Concat denotes concatenation, Conv denotes a 3 × 3 convolution, Spatial_Attention denotes the spatial-attention operator, and E_i (for i = 0, …, N − 1) denotes the i-th spatial attention map extracted from A. By adaptively reweighting and fusing multi-scale features in this manner, the module enables the network to focus on spatially relevant regions at each scale, thereby improving the robustness and accuracy of the subsequent text-detection head.

To further enhance model performance, we incorporated a spatial attention module that selectively weights spatial positions and feature responses, a strategy widely utilized in image and text processing^30,31,32. For Yi characters, which have strong structural similarities, per-pixel methods often fail at precise localization. The spatial-attention mechanism helps the model focus on key cues such as edges and contours while suppressing irrelevant regions, thereby improving detection accuracy for Yi characters. This module addresses this by generating a spatial attention map that pinpoints regions with critical structural information. In the context of the Yi script, this means learning to highlight semantically rich areas such as stroke terminals, intersection junctions, and distinctive curvature points, while suppressing homogeneous background regions and noise. This capability to focus on locally discriminative cues forms the foundational step for subsequent stages to resolve character-level ambiguities.

As shown in Fig. 5, the implementation begins by applying channel-wise average pooling across the feature channels to generate a spatial descriptor of size 1 × H × W. This descriptor is processed by convolutional layers, followed by a sigmoid activation, to produce an initial spatial attention map. The map is then expanded along the channel dimension to match the shape of the input feature tensor C × H × W, and is applied via element-wise multiplication. Using channel-wise pooling along with explicit broadcasting ensures that attention consistently modulates the same spatial locations across all channels while remaining computationally efficient. A subsequent convolution and sigmoid activation refine the result, producing the final attention tensor of shape N × 1 × H × W, where N denotes the number of scales being fused.

**Fig. 5: Spatial attention mechanism.**

To achieve efficient multi-scale feature extraction via a pyramid-style mechanism, the EPSA channel-attention module integrates a Pyramid Squeeze Attention (PSA) unit in place of the standard 3 × 3 convolution within the ResNet bottleneck. The core of the PSA module is the Squeeze Pyramid Concat (SPC) module, enabling scale-aware processing within the bottleneck. The detailed operation of each module component is described below. In Yi character detection, this module performs channel-wise feature recalibration. It learns to amplify the filters most relevant for capturing the scripts' defining attributes, such as those detecting specific stroke widths, curvature radii, or the presence of internal holes, while suppressing less informative feature channels. This process ensures that the network dedicates its representational capacity to the most salient visual concepts for this specific writing system.

In the PSA module, as illustrated in Fig. 6. The PSA module operates in four stages. First, the input channels are divided and processed with multi-scale operators via the SPC module to produce scale-aware feature responses. In the second stage, an SE-style weighting block (SEWeight) computes adaptive channel attention, which emphasizes informative channels across scales. The third stage involves a softmax normalization step that normalizes the multi-scale attention weights, ensuring their relative contributions are balanced. Finally, in the fourth stage, the normalized attention weights are applied element-wise to the original features, producing refined multi-scale representations that enhance detection performance and improve the model’s generalization ability.

**Fig. 6: Pyramid squeeze attention module.**

In the SPC module, as illustrated in Fig. 7, the input feature map is initially split into X channel-wise segments, where X denotes the number of channel partitions. Each segment is processed by a multi-scale convolutional branch to capture spatial information at specific scales. To make the design adaptive, the module selects the convolutional group size dynamically based on the kernel size used in each branch. The outputs of the multi-scale convolutions are then concatenated along the channel dimension to form the SPC output. This design enables the network to capture target information across multiple spatial scales, thereby improving robustness and generalization. In this context, K represents the kernel size for each branch, G denotes the group size, and Concat refers to the concatenation operation along the channel dimension.

**Fig. 7: Squeeze pyramid concat module.**

Differentiable binarization

The prediction module, along with the DB operation, employs an end-to-end learning approach to enhance both text detection and binarization, specifically tailored for Yi character recognition. The prediction module serves as the core component for detecting text and performing DB. Initially, the output features from the multi-scale fusion module are fed into the prediction module, which produces two outputs: a prediction map and a threshold map. These two outputs are then combined via a DB operation to obtain an approximate binary map. Within the prediction module, the input feature map passes through convolutional, pooling, and normalization layers to extract high-level features that facilitate the detection of text regions in YiPrint-694, where intricate structures and structural similarities between characters pose significant challenges.

DB is particularly well-suited for Yi character detection because it enables the model to handle image degradation and faint characters by predicting an adaptive per-pixel threshold that dynamically adjusts to the varying characteristics of Yi characters. This is crucial as the Yi script is often found in degraded manuscripts where characters may be faint, overlapped, or distorted. The self-adjusting threshold mechanism enables the binarization process to preserve fine-grained text details, effectively separating the foreground from the background in challenging conditions. Its maps continuous real-valued scores to near-binary values by using a smooth approximation of the step function, allowing the binarization operation to be optimized via backpropagation. This approach suppresses background interference while preserving fine-grained details of the text regions, which is crucial for accurate segmentation of Yi characters, known for their complex shapes. Since the standard step function is non-differentiable, we replace it with a differentiable approximation and control its slope near the origin using a gain factor (with an empirical value of k = 50). The approximate binary map is then passed to a box formulation module, which extracts text bounding boxes. By predicting a per-pixel threshold through the threshold map, the foreground and background are effectively separated, enabling precise localization of text regions. This end-to-end learning framework jointly optimizes text detection and binarization, improving the systems robustness and accuracy, especially in the context of the unique challenges posed by Yi character detection. It can be specifically known from Eq. (8).

$$\begin{array}{l}{\mathop{B}\limits^{\cdot }}_{i,j}=\frac{1}{1+{e}^{-k\left({P}_{i,j}-{T}_{i,j}\right)}}.\end{array}$$

(8)

where $\mathop{B}\limits^{\cdot }$ denotes the approximate binary map used to distinguish text from background; P_i,j is the predicted probability at pixel (i, j); and T_i,j is the predicted threshold at pixel (i, j). The DB uses a sigmoid function with gain k, producing values in the interval (0, 1), and allows the network to learn an adaptive binarization strategy.

Loss function

The loss function L for the FGRL-YiNet model is defined as a weighted sum of three key components: the probability map loss L_s, the threshold map loss L_t, and the approximate binary map loss L_b. The overall loss function is expressed as follows in Equation (9). The individual loss components are defined in Equations (10) and (11):

$$L={L}_{s}+\alpha \times {L}_{b}+\beta \times {L}_{t}$$

(9)

$${L}_{s}={L}_{b}=\mathop{\sum }\limits_{i\in {S}_{l}}{y}_{i}\log {x}_{i}+(1-{y}_{i})\log (1-{x}_{i})$$

(10)

$${L}_{t}=\mathop{\sum }\limits_{i\in {R}_{d}}| {y}_{i}^{* }-{x}_{i}^{* }|$$

(11)

In this loss formulation, both the probability map loss L_s and the approximate binary map loss L_b are computed using the binary cross-entropy loss function. On the other hand, the threshold map loss L_t utilizes the L1 loss, which calculates the absolute difference between the predicted values and the ground truth labels. The weight coefficients α and β are set to 1 and 10, respectively, to balance the contributions of each loss component. Additionally, Ω denotes the set of positive and negative samples, with a 1:3 ratio.

Construction of Yi character detection dataset

We have constructed the YiPrint-694 dataset, a novel dataset designed for Yi-character detection and recognition tasks. This dataset aims to address the scarcity of publicly available annotated Yi data, providing a high-quality resource for advancing research in Yi-character detection. In this section, we will provide a detailed description of the construction, including the processes of data collection, augmentation, and preprocessing.

We constructed the YiPrint-694 dataset using a semi-automatic annotation pipeline that combines computer-assisted techniques with manual verification. This hybrid approach ensures both efficiency in data annotation and high segmentation accuracy. The dataset was compiled from the Liangshan Yi classics, which include influential works such as Mamuteyi, Le’oteyi, and Zhilujing. It contains 694 images representing 1165 distinct Yi character categories, with each image featuring approximately 500 characters, resulting in a total of approximately 347,000 characters. Annotations were performed manually using the open-source tool PPOCRLabel, which enabled high-quality, accurate labeling of the Yi characters. The specific style of YiPrint-694 is shown in Fig. 8.

**Fig. 8: Example of the original YiPrint-694 dataset.**

The construction scale of the YiPrint-694 dataset reflects the inherent challenge of digital preservation for historically underserved scripts. High-quality, publicly accessible Yi historical manuscripts are exceptionally scarce due to their specific cultural heritage status and the limited number of existing documents. The curation of these 694 images, while limited in absolute scale, represents a substantial and critical effort to create a foundational benchmark under these significant resource constraints. This low-resource scenario is not a limitation of the study but rather a precise characterization of the real-world problem this research aims to solve.

To directly address this data scarcity and enhance the models robustness, we applied augmentation techniques to simulate various historical Yi manuscripts. Specifically, 200 images were generated to emulate ancient Yi manuscripts by modifying background colors and varying layouts. This augmentation introduces real-world variability, making the dataset more representative of diverse manuscript styles. The 200 augmented images were divided into two sets of 100 images each: one featuring a yellowish background and the other with a faded, brownish background. This diversity in background colors helps simulate different historical manuscript conditions, thereby enhancing the datasets generalizability. The augmented images, with their distinct background colors, are shown in the left column of Fig. 9, while the original Yi character images are displayed in Fig. 8. Together, the 494 original images and the 200 augmented images form a rich, varied set for both training and evaluation. To ensure a balanced distribution and high quality for the YiPrint-694 dataset, we conducted a rigorous manual inspection. This process involved three key procedures: retaining up to 100 images per character, eliminating duplicates, and screening out substandard samples. Substandard images were defined as those lacking valid Yi characters or containing incomplete or corrupted glyphs; all such images were systematically discarded. The remaining images were then carefully annotated by a panel of three native Yi language specialists, guided by authoritative lexicons. The inter-annotator agreement was assessed, revealing a strong initial consensus, which substantiates the reliability of the labels. All subsequent discrepancies were rigorously adjudicated and resolved through a structured consensus process to establish the definitive ground truth. This protocol yielded the YiPrint-694 dataset comprising 694 samples spanning a diverse range of Yi character shapes and layouts. It thereby serves as a robust benchmark for training and evaluating detection models on this complex, low-resource script.

**Fig. 9: Visualization results on the YiPrint-694.**

To prepare images for accurate character segmentation and ensure high-quality data for detection tasks, we implemented a rigorous image processing stage. To begin with, Gaussian filtering was applied to reduce background noise that could compromise character clarity, while preserving edges to maintain their structural integrity. This step ensures that irrelevant noise does not obscure the characters’ essential features. Following this, the Sobel operator was used for edge detection, enhancing the contours and improving the definition of character boundaries. This step is vital for accurately delineating characters, especially in complex backgrounds. Finally, global binarization was performed using Otsu’s method, converting the image to black-and-white (0 and 255), clearly separating the foreground characters from the background. These preprocessing steps ensured that the images were sufficiently cleaned and optimized for subsequent segmentation. Once the images were preprocessed, we applied a projection-based segmentation algorithm. This method analyzes the distribution of foreground pixels along each row and column of the binary image to identify potential character boundaries. After the initial segmentation, we refined the results by calculating the coefficient of determination and setting an adaptive threshold to accurately define character regions. Segments that exceeded this threshold were identified as part of a character, and adjacent segments with similar pixel values were merged. When adjacent segments showed significant pixel differences, new character boundaries were introduced.

Given the inherent challenges in character segmentation, strict annotation protocols were followed throughout the process. To ensure high accuracy, we employed computer-assisted techniques for segmentation, followed by thorough manual verification to correct any remaining errors. This rigorous verification process is essential, as errors in the segmentation stage can significantly degrade dataset quality and, in turn, affect subsequent detection and recognition performance. Through these efforts, the segmented images were optimized for detection tasks, ensuring the dataset was clean and accurately annotated for training and evaluation.

Results

Experimental setup

The YiPrint-694 dataset, consisting of 694 images, served as the basis for our experimental evaluation. It originates from an original set of 494 images, which was further expanded with two augmented sets of 100 images each. The entire dataset was partitioned into training, validation, and test subsets with an approximate ratio of 8:1:1. Specifically, the splits were performed as follows: the original 494 images were divided into 396 for training, 49 for validation, and 49 for testing. Similarly, each of the two 100-image augmented sets was split into 80 for training, 10 for validation, and 10 for testing. This resulted in a cumulative distribution of 556 training, 69 validation, and 69 test images. This rigorous splitting strategy ensures the independence of all subsets, preventing data leakage and facilitating a fair model evaluation. A detailed summary of the sample distribution is provided in Table 1.

Table 1 Dataset splits

Full size table

To ensure a rigorous and fair evaluation, we trained all comparative models under identical experimental conditions, strictly adhering to established fine-tuning practices. FGRL-YiNet employs a ResNet-18 backbone initialized with ImageNet-pretrained weights, while all baseline models use their officially released pre-trained weights, ensuring an equitable comparison across architectural paradigms. Experiments were conducted on a Linux workstation with NVIDIA GeForce RTX 3090 Ti GPU using Python 3.8 and PaddlePaddle 2.7.1. We established a standardized optimization protocol using Adam optimizer (β₁ = 0.9, β₂ = 0.999, ϵ = 10⁻⁸) with an initial learning rate of 0.0001 decaying polynomially to a minimum of 10⁻⁶, batch size of 16, and maximum training epochs of 800 with early stopping (patience = 50 epochs). The training incorporated balanced regularization through weight decay (0.0001) and dropout (0.5). This meticulously controlled experimental framework ensures that performance differences directly reflect architectural capabilities rather than variations in training methods.

Evaluation metrics

To evaluate FGRL-YiNet’s performance, we use three widely used metrics for character detection: Precision (P), Recall (R), and F-score (F). These metrics provide a comprehensive assessment of the models’ ability to detect Yi characters accurately and completely.

Precision quantifies the accuracy of the model's predictions, specifically measuring the proportion of true positive detections among all instances that the model identified as positive. It is defined as:
$$Precision=\frac{TP}{TP+FP}$$
(12)
where TP represents the number of true positives (correct detections), and FP is the number of false positives (incorrect detections).
Recall, on the other hand, measures the model's ability to identify all relevant characters, quantifying the proportion of true positive detections out of the total actual positive instances in the ground truth. It is computed as:
$$Recall=\frac{TP}{TP+FN}$$
(13)
where FN refers to false negatives, indicating the number of relevant characters that were missed by the model.
While precision assesses the accuracy of the models predictions, and recall evaluates its completeness, it is often necessary to balance these two metrics. To address this, we use the f-score, which combines precision and recall into a single metric by calculating their harmonic mean:
$$F-score=2\times \frac{Precision\times Recall}{Precision+Recall}$$
(14)
where the f-score provides a more balanced measure of model performance when there is a trade-off between precision and recall, ensuring that both false positives and false negatives are considered. As a result, the f-score serves as the primary evaluation metric in our experiments, enabling us to assess the model’s overall effectiveness in detecting Yi characters, accounting for both accuracy and completeness.

Comparison with state-of-the-art models

Detecting Yi characters poses unique challenges due to their intricate structures, high visual similarity among characters, and complex layout. These challenges require character detection models capable of handling fine-grained details and distinguishing between visually similar characters. To evaluate the performance of our proposed method, FGRL-YiNet, we selected several state-of-the-art character detection models known for their ability to handle complex, dense text structures, which are essential for Yi character detection.

The character detection process in ancient Yi manuscripts represents a complex challenge due to the unique topological structure of Yi characters, characterized by intricate curvilinear strokes, internal topological holes, and non-uniform character density. These inherent complexities are further exacerbated by multispectral degradation patterns in historical manuscripts, including ink fading, paper warping, and background interference. To comprehensively validate the architectural advantages of our proposed framework, we establish an exhaustive comparative framework against representative models spanning multiple detection paradigms, each of which exposes fundamental limitations when confronting the core challenges of Yi script analysis.

General-purpose object detectors (e.g., Faster R-CNN³³, T-Rex2²⁷) demonstrate intrinsic limitations in specialized document analysis tasks. While T-Rex2 introduces innovative vision-language synergy for open-vocabulary detection, its generalization capability fails to capture the fine-grained geometric precision required for Yi character localization, particularly under severe degradation conditions.
Segmentation-based methodologies (e.g., PSENet²³, TextSnake³⁴) prove highly vulnerable to low-contrast stroke boundaries and subtle ink variations. Their pixel-level classification mechanisms frequently produce substantially oversized masks in complex backgrounds, while struggling to maintain topological consistency across character interiors with intricate internal structures.
Differentiable binarization networks (e.g., DBNet²⁴, DBNet++ ²⁵) display critical sensitivity to illumination variance and ink bleed-through artifacts. Their adaptive thresholding mechanisms tend to either omit faint strokes due to thresholding artifacts or generate false-positive detections in textured background regions. The recently proposed Inception-mReLU architecture²⁹, while incorporating advanced feature extraction modules, remains constrained by similar binarization-sensitive characteristics that limit robustness against severe document degradation.
Contour-based models (e.g., FCENet²⁶, TextBPN³⁵) systematically oversimplify complex concave regions and internal topological holes, thereby erasing morphologically critical details essential for Yi character identification. Their parametric curve representations fail to capture the nuanced stroke variations that define Yi script morphology.
Cross-domain architectures (e.g., CDM²⁹) highlight the fundamental paradigm gap between natural language processing and visual document analysis. While achieving exceptional performance in AI-generated text detection, transformer-based language models demonstrate limited applicability to geometric character localization tasks, underscoring the necessity of domain-specific solutions.

In contrast, FGRL-YiNet addresses these challenges by employing DConv, which adapts to stroke direction and strength, effectively recovering faint or broken curves while suppressing background noise. This adaptability is crucial for Yi characters, where subtle structural features distinguish correct from incorrect detections. Furthermore, DynConAt_YiNet integrates multi-scale attention and boundary-aware decoding, preserving fine details such as cavities and gaps while avoiding boundary expansion into neighboring characters. As shown in Table 2, FGRL-YiNet achieves the highest f-score (94.7), precision (98.4), and recall (91.3), outperforming all other models in these metrics. Notably, it surpasses the second-best model by 2.6 percentage points in f-score, demonstrating its superior ability to detect Yi characters and maintain structural integrity, even under challenging conditions.

Table 2 Detection results of different methods on the YiPrint-694 dataset

Full size table

Figure 9 visualizes the text-detection results of FGRL-YiNet applied to the YiPrint-694 from ancient manuscripts. The left panels display sample manuscript images, while the right panels show the corresponding detection results produced by FGRL-YiNet. As seen in the figure, the Yi characters are generally evenly distributed across the page, with minimal tilt or distortion, yet still presenting typical challenges such as faded ink and irregular spacing, common in historical documents. Despite these challenges, the proposed model successfully identifies and localizes the characters. The detected characters are highlighted with blue and green bounding boxes, which indicate their positions and boundaries. Notably, FGRL-YiNet dynamic convolutional attention mechanism adapts to the varying stroke directions and intensities of the Yi characters, enabling it to recover faint or broken strokes in the manuscript. The multi-scale attention and boundary-aware decoding further enhance the models ability to preserve delicate features such as internal holes, cavities, and gaps between characters, ensuring accurate localization even in degraded manuscript conditions. This visualization demonstrates the model’s robust performance in detecting Yi characters with high precision and reliability, making it particularly effective in handling the unique challenges posed by ancient Yi manuscripts.

Ablation study

To systematically evaluate the contributions of FGRL-YiNet’s core components, we conduct hierarchical ablation studies. All experiments are performed on the YiPrint-694 dataset under consistent training protocols. The impact of each module is quantified using a controlled variable approach. Detailed ablation results and corresponding analyses for each architectural component are presented in the subsequent subsections.

To evaluate how different backbone architectures influence Yi character detection, we conducted an ablation study that includes both convolutional networks and Transformer-based models. This experiment examines how variations in architectural complexity affect generalization on the YiPrint-694 dataset, focusing on f-score, precision, and recall. Beyond the commonly used ResNet family, we incorporate three representative Vision Transformer variants, ViT-Tiny, ViT-Small, and ViT-Base, to assess whether recent Transformer architectures can provide advantages for this fine-grained detection task.

As shown in Table 3, deeper ResNet architectures such as ResNet-45, ResNet-50, and ResNet-101 do not surpass the lightweight ResNet-18. Their increased parameter counts tend to cause overfitting to the Yi script. In contrast, ResNet-18 achieves the highest f-score across all compared backbones, confirming that a compact CNN architecture is better suited to capturing the fine-grained, topologically complex structures of Yi characters. A similar trend is observed within the Vision Transformer family. Although ViT-Base contains substantially more parameters than ViT-Tiny, it does not achieve superior performance; instead, ViT-Tiny produces the second-highest f-score and precision, as well as the highest recall among the compared feature extractors. This indicates that lighter Transformer backbones can model Yi character structure more effectively than their heavier counterparts.

Table 3 Detection results with different backbone settings

Full size table

Despite their strong representational capacity, Transformer models still require substantially higher computational cost, while failing to outperform the far more compact ResNet-18. Further, the pattern that ViT-Tiny surpasses ViT-Base, despite having far fewer parameters, further demonstrates that increasing model size does not inherently lead to better generalization for Yi character detection. In summary, although the ViT variants demonstrate strong performance, their substantially higher computational cost remains considerably larger than that of the ResNet family. Considering both accuracy and efficiency, we opt to select the lightweight ResNet-18 as the backbone for our FGRL-YiNet.

As detailed in Table 4, the ablation study regarding the DConv, DB, and AMSF modules confirms the performance advantages derived from the synergistically integrated FGRL-YiNet architecture. We conceptualized that the intricate strokes, spatial distortions, and multiscale patterns endemic to Yi script necessitate a feature-learning paradigm in which complementary modules operate in concert. This is rigorously validated by our ablation sequence: beginning with a ResNet-18 backbone, integrating DConv yields a substantial gain (f-score: 90.3% to 91.0%), as it fundamentally reshapes the model's geometric representation to align with the non-rigid structures of Yi characters. This adaptive feature foundation is then enhanced by the joint introduction of DB and AMSF. Within this framework for Yi script, DB explicitly tackles document degradation and low contrast, while AMSF coherently amalgamates multi-scale contextual information. The key innovation is their cascaded, cooperative operation. DConv first ensures features are geometrically robust, and DB and AMSF jointly refine semantic clarity and scale invariance. This orchestrated workflow creates a enhancement effect that uniquely and effectively resolves the core challenges of geometric distortion, ambiguous boundaries, and multi-scale representation in historical Yi documents, thereby establishing a new state of the art for this culturally critical task.

Table 4 Detection results with different settings of deformable convolution, differentiable binarization (DB), and adaptive multi-scale fusion (AMSF) module

Full size table

Further improvements are observed with the addition of DB, increasing the f-score to 91.8%, with Precision rising to 92.1% and recall to 91.5%. DB helps the module effectively separate Yi characters from the background, preserving small gaps and internal strokes that are essential for accurate detection, particularly in historical manuscripts. The introduction of AMSF further improves performance, bringing the f-score to 92.3%, with precision of 92.7% and recall of 91.9%. AMSF is crucial for handling the multi-scale nature of Yi characters, which can appear in various sizes, especially in densely packed manuscript layouts. By fusing features across different scales, AMSF ensures reliable detection of both small and large characters. The combination of all three modules, DConv, DB, and AMSF, results in the best overall performance, with an f-score of 94.7%, precision of 98.4%, and recall of 91.3%, highlighting the synergy between these components in accurately detecting Yi characters with their unique structural complexity.

To validate the pivotal architectural innovation of our hierarchical attention-and-adaptation mechanism, we deconstructed the FGRL-YiNet to isolate the contributions of the DConv, SE, and EPSA modules. This analysis demonstrates how the architecture synergistically combines spatial deformation modeling with multi-granularity feature refinement. This design is a direct response to the quintessential challenges of Yi script, where geometric distortion, fine-grained stroke complexity, and channel-wise feature redundancy coexist and compound. As shown in Table 5, our ablation experiments demonstrate that neither spatial adaptation nor attention mechanisms alone are sufficient; their orchestrated integration is key to the observed performance breakthrough. The DConv module serves as the foundational stage, providing essential spatial adaptability to capture the non-rigid, deformed strokes of Yi characters. Building upon this geometrically robust foundation, the SE and EPSA attention modules operate in a complementary hierarchy: the SE module acts as a global channel-wise filter, efficiently suppressing less informative features, while the EPSA module performs a subsequent dense, pixel-wise calibration, crucially amplifying the local spatial details definitive for distinguishing intricate glyphs. The novelty of our work lies in this cascaded ’Adapt-Filter-Amplify’ pipeline, which provides a cohesive and effective framework for holistically addressing the intertwined challenges of shape, scale, and detail inherent to Yi character detection, offering a promising approach for complex historical document analysis.

Table 5 Detection results with different settings of deformable convolution, squeeze-and-excitation attention, and enhanced pixel-wise squeeze-and-excitation attention modules

Full size table

Starting with ResNet-18, the model achieves an f-score of 90.3%, precision of 90.2%, and recall of 90.5%. While this configuration performs reasonably well, it struggles with Yi characters due to spatial distortions, intricate strokes, and fine details, which are common in historical manuscript images. The introduction of Deformable Convolution (DConv) significantly enhances the model’s ability to handle these spatial and shape variations, improving the f-score to 91.0%. This increase is mainly due to DConv’s ability to adapt the receptive field, which helps the model detect characters with distortions and irregular shapes, a common challenge in Yi character detection. The flexibility of DConv is particularly useful for recovering Yi characters that are spatially distorted or fragmented in ancient manuscripts. The addition of SE further improves the model’s performance, bringing the f-score to 91.9%, precision to 92.4%, and recall to 91.7%. SE refines the model’s channel-wise feature extraction, emphasizing critical features such as stroke width and fine details, which are crucial for accurately identifying Yi characters. However, the improvement from SE is more modest than that from DConv because it primarily addresses channel-wise recalibration rather than spatial distortions.

The addition of EPSA provides a significant boost, bringing the f-score to 93.5% and precision to 97.2%, while maintaining recall at 90.8%. EPSA effectively reduces false positives and focuses on the most discriminative features of Yi characters, such as small internal gaps and intricate strokes that are difficult to detect in degraded manuscripts. This improvement demonstrates EPSA’s ability to handle the fine details that are characteristic of Yi characters, which are essential for accurate detection. Finally, the combination of DConv and EPSA achieves the best performance, with an f-score of 94.7%, precision of 98.4%, and recall of 91.3%. This model strikes an optimal balance between recall and precision, making it highly effective in detecting Yi characters in degraded manuscript images, where both spatial distortions and fine structural details are common. The synergy between DConv, which recovers missed detections due to spatial variations, and EPSA, which improves focus on the most relevant features, demonstrates the power of combining spatial and channel-wise attention mechanisms to address the unique challenges of Yi character detection.

Experimental evaluation on public benchmarks

The evaluation also includes the MTHv2 dataset³⁶, a collection comprising images from the “Han-Chosun Tripitaka" and “Han Chinese Tripitaka" designed for text localization and detection. It includes 2399 training images and 800 test images, with all text regions meticulously annotated using quadrilateral bounding boxes. These annotations enable precise text localization, making the dataset ideal for research on text detection in historical manuscript images.

To address the challenges of text detection in culturally diverse historical documents, we present a comparative evaluation of deep learning-based text detection frameworks. We benchmark our proposed method against six state-of-the-art models (PSENet, FCENet, TextPMs, TextBPN++, CBNet, DBNet++) using standardized metrics (f-score, precision, recall). The comparative results on the MTHv2 dataset are presented in Table 6. An extended comparative analysis of model structures reveals that while FGRL-YiNet achieves the second-best f-score on the MTHv2 dataset (95.7%), it is marginally outperformed by PSENet (96.3%). This performance gap can be partially explained by the inherent structural differences between the two models. PSENet is specifically designed with a progressive scale expansion mechanism, which excels at delineating densely packed and arbitrarily shaped text instances, a feature common in the MTHv2 dataset due to its inclusion of curved, rotated, and multi-font scripts. In contrast, FGRL-YiNet is architected as a specialized paradigm for fine-grained feature recovery, focusing on adaptive feature refinement through DConv and joint attention. This primary innovation, while exceptionally effective at capturing the intricate strokes of Yi characters (as shown in our ablation studies), may be less aggressive in boundary expansion in general scenes, slightly limiting recall (95.4%) in scenes with overlapping or fragmented text lines.

Table 6 Detection results of different methods on the MTHv2 dataset

Full size table

When examining error sources in the context of MTHv2 characteristics, the slight performance gap in recall between FGRL-YiNet and PSENet (95.4% vs. 95.8%) can be attributed to dataset-specific challenges. This dataset includes multi-lingual characters, varied aspect ratios, and background noise, such as blur, low contrast, and document artifacts, which introduce complexities in boundary inference and text segmentation. While FGRL-YiNet demonstrates strong generalizability across styles, its joint attention module may prioritize more salient strokes and inadvertently suppress weaker character boundaries in noisy regions, leading to a small number of false negatives. In contrast, PSENet iterative kernel expansion mechanism is explicitly tailored to recover full object extents, allowing it to capture even partially degraded characters more robustly. This comparison underscores that FGRL-YiNets core innovation, a specialized architecture for fine-grained feature learning, represents a distinct and necessary paradigm shift for addressing the unique challenges of low-resource historical scripts, rather than an incremental improvement in general text detection. However, FGRL-YiNet maintains a better balance between precision and recall across diverse layouts, indicating its strong potential for transferability and real-world applicability in rare-script scenarios.

Discussion

Yi script detection presents a set of challenges that differ fundamentally from mainstream scene-text tasks, owing to its strong curvilinearity, dense stroke convergence, and high structural homogeneity across characters. These characteristics significantly constrain the generalization ability of existing text detectors. Classical two-stage methods, such as Faster R-CNN, struggle to delineate irregular, highly curved glyph contours, whereas progressive contour-evolution approaches, such as PSENet and TextSnake, are sensitive to stroke fragmentation and background interference. Single-shot CNN-based detectors, including DBNet, FCENet, and TextBPN, achieve competitive boundary localization but are limited by their reliance on fixed geometric priors, thereby reducing their discriminability for characters with subtle intra-class variations. More advanced multi-scale pipelines, such as PANet and DBNet++, improve hierarchical aggregation but still struggle to capture the nuanced geometric deviations prevalent in historical Yi manuscripts. In contrast, our method demonstrates superior robustness, achieving 94.7 F-score, 98.4 Precision, and 91.3 Recall, indicating its stronger ability to model the fine structural distinctions required for reliable detection of visually similar and geometrically complex Yi characters.

The superior performance of FGRL-YiNet establishes a pivotal architectural principle: accurately detecting Yi characters under severe degradation requires a purpose-built framework that goes beyond generic object-detection paradigms. Our findings conclusively demonstrate that the synergistic integration of DConv and joint attention is not merely beneficial but essential, forging a representation space that is both geometrically invariant to spatial distortions and semantically discerning of fine-grained strokes. This core “deform-and-refine" strategy directly addresses the fundamental failure of standard CNNs, their static, rigid nature, in capturing the low-contrast and broken strokes endemic to historical scripts. More broadly, this work transcends the Yi script itself, offering a scalable blueprint for analysing other “analytically underserved" writing systems. It posits that the key to advancement in low-resource digital heritage preservation lies not in brute-force parameter scaling, but in architectural specificity and adaptive, fine-grained feature learning, thereby charting a more efficient and interpretable path forward for the field.

In future work, we outline four interconnected research directions specifically designed to address the unique challenges of Yi script analysis. First, we will investigate more expressive backbone designs and feature aggregation strategies to further enhance the modeling of fine-grained stroke variations that are critical for Yi character detection. Second, we will incorporate graph-based relational modeling to explicitly represent the complex topological structures and spatial dependencies among Yi character components, with particular focus on their distinctive curvilinear strokes and internal geometric patterns. Third, we will expand our data framework by continually collecting and annotating additional Yi documents to increase dataset diversity and address the scarcity of high-quality annotated materials. We acknowledge that the relatively small dataset may introduce robustness constraints, and therefore, enlarging YiPrint-694 is essential for further improving model generalization. Finally, we aim to develop a culturally-aware multimodal analysis system capable of jointly reasoning over textual and visual cues to handle the distinctive degradations of Yi manuscripts, including ink bleed-through, background interference, and the script’s characteristic spatial arrangements. Collectively, these directions form a systematic trajectory from architectural refinement to culturally grounded system integration, advancing the digital preservation of Yi cultural heritage.

Data availability

The Yi script dataset constructed in this study is intended for further comparative evaluations and extended research in our upcoming work. Therefore, it is currently available from the first author upon reasonable request for academic purposes.

Code availability

The underlying code for this study is not publicly available due to proprietary considerations and ongoing research. However, the implementation details and relevant scripts may be made available to qualified researchers upon reasonable request to the corresponding author.

References

Cheng, J. et al. Genetic polymorphism of 19 autosomal STR loci in the Yi ethnic minority of Liangshan Yi Autonomous Prefecture from Sichuan Province in China. Sci. Rep. 11, 16327 (2021).
Article CAS PubMed PubMed Central Google Scholar
Bi, X., Sun, Z. & Chen, Z. A novel unsupervised contrastive learning framework for ancient Yi script character dataset construction. Heritage Sci. 13, 39 (2025).
Google Scholar
Sun, H. et al. Linguistic-visual based multimodal Yi character recognition. Sci. Rep. 15, 11874 (2025).
Article CAS PubMed PubMed Central Google Scholar
Sun, H., Ding, X., Sun, J., Yu, H. & Zhang, J. A method for detecting and recognizing Yi character based on deep learning. Comput. Mater. Contin. 78, 2721 (2024).
Google Scholar
Zheng, J., Li, M., Li, X., Zhang, P. & Wu, Y. Revisiting local and global descriptor-based metric network for few-shot SAR target classification. IEEE Trans. Geosci. Remote Sens. 62, 1–14 (2024).
Article CAS Google Scholar
Zheng, J., Li, M., Zhang, P., Wu, Y. & Chen, H. Position-aware graph neural network for few-shot SAR target classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 17, 8028–8042 (2024).
Article Google Scholar
Liu, X., Han, X., Chen, S., Dai, W. & Ruan, Q. Ancient Yi script handwriting sample repository. Sci. Data. 11, 1183 (2024).
Article PubMed PubMed Central Google Scholar
Jia, X., Gong, W. & Yuan, J. Handwritten Yi character recognition with density-based clustering algorithm and convolutional neural network. Proc. IEEE Int. Conf. Comput. Sci. Eng. 1, 337–341 (2017).
Google Scholar
Yin, Z. et al. Yi characters online handwriting recognition models based on recurrent neural network: RnnNet-Yi and ParallelRnnNet-Yi. Proc. Int. Conf. Front. Handwriting Recognit. 13639, 375–388 (2022).
Article Google Scholar
Sruthi, K. S., Sreekumar, A. & Balakrishnan, K. Text-attributed community detection in complex networks through LLMs and GNNs: a powerful fusion of language and graphs. Neurocomputing. 647, 130573 (2025).
Article Google Scholar
Cui, S., Duan, K., Ma, W. & Shinnou, H. CMGN: text GNN and RWKV MLP-mixer combined with cross-feature fusion for fake news detection. Neurocomputing. 633, 129811 (2025).
Article Google Scholar
Wu, J. et al. A survey on LLM-generated text detection: necessity, methods, and future directions. Comput. Linguist. 51, 275–338 (2025).
Article Google Scholar
Ye, Q. & Doermann, D. Text detection and recognition in imagery: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 37, 1480–1500 (2014).
Article Google Scholar
Qu, C., Zhong, Y., Guo, F. & Jin, L. Revisiting tampered scene text detection in the era of generative AI. Proc. AAAI Conf. Artif. Intell. 39, 694–702 (2025).
Google Scholar
Zhang, J., Li, D., Zeng, Z., Zhang, R. & Wang, J. Dual-branch crack segmentation network with multi-shape kernel based on convolutional neural network and mamba. Eng. Appl. Artif. Intell. 150, 110536 (2025).
Article Google Scholar
Xu, C. et al. Fusion-based graph neural networks for synergistic underwater image enhancement. Inf. Fusion. 117, 102857 (2025).
Article Google Scholar
Shen, J. et al. An algorithm based on lightweight semantic features for ancient mural element object detection. npj Heritage Sci. 13, 70 (2025).
Article Google Scholar
Shen, J. et al. Finger vein recognition algorithm based on lightweight deep convolutional neural network. IEEE Trans. Instrum. Meas. 71, 1–13 (2021).
Google Scholar
Long, J., Shelhamer, E. & Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 640–651 (2017).
Article PubMed Google Scholar
Renton, G. et al. Fully convolutional network with dilated convolutions for handwritten text line segmentation. Int. J. Doc. Anal. Recognit. 21, 177–186 (2018).
Article Google Scholar
Tian, Z., Huang, W., He, T., He, P. & Qiao, Y. Detecting text in natural image with connectionist text proposal network. Proc. Eur. Conf. Comput. Vis. 9912, 56–72 (2016).
Google Scholar
Zhou, X. et al. EAST: an efficient and accurate scene text detector. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 5551–5560, https://doi.org/10.1109/CVPR.2017.283 (2017).
Wang, W. et al. Shape robust text detection with progressive scale expansion network. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 9336–9345, https://doi.org/10.1109/CVPR.2019.00956 (2019).
Liao, M., Wan, Z., Yao, C., Chen, K. & Bai, X. Real-time scene text detection with differentiable binarization. Proc. AAAI Conf. Artif. Intell. 34, 11474–11481 (2020).
Google Scholar
Liao, M., Zou, Z., Wan, Z., Yao, C. & Bai, X. Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE Trans. Pattern Anal. Mach. Intell. 45, 919–931 (2022).
Article PubMed Google Scholar
Zhu, Y. et al. Fourier contour embedding for arbitrary-shaped text detection. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 3123–3131, https://doi.org/10.1109/CVPR46437.2021.00314 (2017).
Jiang, Q. et al. T-Rex2: towards generic object detection via text-visual prompt synergy. Proc. Eur. Conf. Comput. Vis. 38–57, https://doi.org/10.1007/978-3-031-73414-4_3 (2024).
Ranjbarzadeh, R. et al. A deep learning approach for robust, multi-oriented, and curved text detection. Cognit. Comput. 16, 1979–1991 (2024).
Article Google Scholar
Maktabdar Oghaz, M., Saheer, L. B., Dhame, K. & Singaram, G. Detection and classification of ChatGPT-generated content using deep transformer models. Front. Artif. Intell. 8, 1458707 (2025).
Article PubMed PubMed Central Google Scholar
Wang, X. et al. On the effect of the attention mechanism for automatic welding defects detection based on deep learning. Expert Syst. Appl. 268, 126386 (2025).
Article Google Scholar
Yu, Y., Zhang, Y., Cheng, Z., Song, Z. & Tang, C. Multi-scale spatial pyramid attention mechanism for image recognition: an effective approach. Eng. Appl. Artif. Intell. 133, 108261 (2024).
Article Google Scholar
Guo, M. H. et al. Attention mechanisms in computer vision: a survey. Comput. Visual Media. 8, 331–368 (2022).
Article Google Scholar
Girshick, R. Fast R-CNN. In Proc. IEEE International Conference on Computer Vision (ICCV) 1440–1448, https://doi.org/10.1109/ICCV.2015.169 (2015).
Long, S. et al. Textsnake: a flexible representation for detecting text of arbitrary shapes. Proc. Eur. Conf. Comput. Vis. 20–36, https://doi.org/10.48550/arXiv.1807.01544 (2018).
Zhang, S. X., Zhu, X., Yang, C., Wang, H. & Yin, X. C. Adaptive boundary proposal network for arbitrary shape text detection. In Proc.International Conference on Computer Vision (ICCV) 1305–1314, https://doi.org/10.1109/ICCV48922.2021.00134 (2021).
Shi, Y., Peng, D., Zhang, Y., Cao, J. & Jin, L. A large-scale dataset for Chinese historical document recognition and analysis. Sci. Data 12, 169 (2025).
Article PubMed PubMed Central Google Scholar
Wang, K., Liew, J. H., Zou, Y. & Zhou, D. PANet: few-shot image semantic segmentation with prototype alignment. In Proc.International Conference on Computer Vision (ICCV) 9197–9206, https://doi.org/10.1109/ICCV.2019.00929 (2019).
Zhang, S. X. et al. Arbitrary shape text detection via segmentation with probability maps. IEEE Trans. Pattern Anal. Mach. Intell. 45, 2736–2750 (2022).
Google Scholar
Zhang, S. X., Yang, C., Zhu, X. & Yin, X. C. Arbitrary shape text detection via boundary transformer. IEEE Trans. Multimed. 26, 1747–1760 (2023).
Article Google Scholar
Liu, Y. et al. CBNet: a novel composite backbone network architecture for object detection. Proc. AAAI Conf. Artif. Intell. 34, 11653–11660 (2020).
Google Scholar

Download references

Acknowledgements

This work is supported in part by the National Natural Science Foundation of China Grants 61972062, 62402085, the Liaoning Basic Research Project 2023JH2/101300191, the Liaoning Doctoral Research Start-up Fund 2023-BS-078, the Dalian Youth Science and Technology Star Project 2023RQ023, and the Open Project of Key Laboratory of Ethnic Language Intelligent Analysis and Security Management of MOE Project ORP-202401.

Author information

Authors and Affiliations

School of Chinese Ethnic Minority Languages and Literatures, Minzu University of China, Beijing, China
Haipeng Sun
Key Laboratory of Ethnic Language Intelligent Analysis and Security Management of MOE, Minzu University of China, Beijing, China
Haipeng Sun
School of Computer Science and Engineering, Dalian Minzu University, Dalian, China
Haipeng Sun, Xueyan Ding & Jianxin Zhang
Yi Language Research Room, China Ethnic Languages Translation Centre, Beijing, China
Hua Yu
School of Information, University of California, Berkeley, CA, USA
Zukang Yang

Authors

Haipeng Sun
View author publications
Search author on:PubMed Google Scholar
Xueyan Ding
View author publications
Search author on:PubMed Google Scholar
Hua Yu
View author publications
Search author on:PubMed Google Scholar
Zukang Yang
View author publications
Search author on:PubMed Google Scholar
Jianxin Zhang
View author publications
Search author on:PubMed Google Scholar

Contributions

The authors confirm contribution to the paper as follows: study conception and design: H.P. Sun, X.Y. Ding, and Z.K. Yang; data collection: H.P. Sun, H. Yu; analysis and interpretation of results: H.P. Sun, X.Y. Ding, Z.K. Yang, and J.X. Zhang; draft manuscript preparation: H.P. Sun, X.Y. Ding, and J.X. Zhang. All authors reviewed the results and approved the final version of the manuscript.

Corresponding authors

Correspondence to Haipeng Sun, Xueyan Ding or Jianxin Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sun, H., Ding, X., Yu, H. et al. Fine grained representation learning for low resource Yi script detection and dataset construction. npj Herit. Sci. 14, 183 (2026). https://doi.org/10.1038/s40494-026-02418-6

Download citation

Received: 10 October 2025
Accepted: 27 February 2026
Published: 26 March 2026
Version of record: 26 March 2026
DOI: https://doi.org/10.1038/s40494-026-02418-6

Fine grained representation learning for low resource Yi script detection and dataset construction

Abstract

Similar content being viewed by others

Yi script character detection in ancient manuscripts using a dual branch transformer

A digital twin model for grain enterprise financial shared service centers based on distributed deep learning and neural symbolic reasoning

Digital restoration of ancient Jiangnan murals via proxy learning and structural guidance

Introduction