Introduction

Oracle Bone Script, the earliest known form of Chinese writing, serves both as an ancient script and a crucial historical artifact for understanding the life and times of the Shang Dynasty (circa 1600–1046 BCE). These inscriptions were meticulously carved on turtle shells and animal bones, making significant contributions to the understanding of the origins of Chinese civilization1. In recent years, advancements in computer vision technology, particularly through deep convolutional neural networks, have been utilized to decipher these enigmatic scripts, providing invaluable assistance to archaeologists and experts in ancient script studies.

However, automatic recognition of Oracle Bone Script still faces numerous challenges, mainly including: (1) the scarcity of annotated data; (2) the difficulty in effectively separating the complex texture and structural features in Oracle Bone Script images; and (3) the need for highly discriminative features to distinguish visually similar characters and handle issues such as wear and deformation. Firstly, the scarcity of annotated data is one of the primary challenges. Oracle Bone Script characters are extremely rare, and annotating them requires highly specialized knowledge. This process is both expensive and labor-intensive, resulting in a severe limitation of available annotated data. Consequently, traditional supervised learning methods struggle to be effective. To alleviate the lack of data, data augmentation methods have been widely employed. However, traditional data augmentation techniques, such as rotation2, scaling3, shearing4, and flipping5, show significant limitations when processing Oracle Bone Script6,7. This is because the semantic information of Oracle Bone Script heavily relies on its structural features, such as the shape and relative positions of strokes; simple geometric transformations may disrupt these critical features. To address this, non-rigid data augmentation methods like Free-Form Deformation (FFD)8 and Elastic Transformation (ER)9 have been proposed. These methods generate new data samples by altering the shape and alignment of images without directly modifying the image content. However, FFD lacks fine control when handling complex local features, potentially failing to accurately preserve the subtle structures of Oracle Bone Script. While ER is effective in adjusting spatial alignment, its fixed regularization parameters cannot adapt to the subtle variations in Oracle Bone Script images, leading to the loss of some local features during augmentation. Additionally, these methods are computationally intensive when processing high-resolution Oracle Bone Script images, making them less suitable for large-scale applications. Secondly, the complex intertwining of texture and structural features in Oracle Bone Script images poses challenges for feature extraction. Oracle Bone Script rubbings (target domain) and reproductions (source domain) exhibit significant differences in texture and structure. Reproduction data have clear glyphs and primarily contain structural information, whereas rubbing data include complex noise such as blurring and degradation, offering rich texture features. If these features are aligned indiscriminately, the abundant texture information may mislead the model, making it difficult to learn domain-invariant features. Most domain adaptation methods10,11,12 cannot effectively reduce the feature distribution discrepancies between different domains. Some studies13 attempt to decouple features to separate texture and structural information but often overlook the role of texture information. For example, UDCN14 only incorporates structural semantic information into adaptation and neglects the impact of texture information on image wear, stains, and deformation. MixupAda15 combines mixup data augmentation and adversarial domain adaptation with the aim of improving domain adaptive performance. TransPar16 is a transformer-based partial-domain adaptive approach that focuses on migrating feature representations shared between source and target domains. FixBi17 is a bi-directional alignment method aimed at better bridging the feature spaces of source and target domains for effective domain adaptation. PRONOUN18 use prototypical representations and normalized output conditioners to enhance generalization of models. BSP19 balance mobility and discriminativity via batch spectrum penalties for adversarial domain adaptation. Lastly, due to different writing styles and the significant variations in strokes and structures of Oracle Bone Script characters, models need to possess highly discriminative feature learning capabilities. Recognizing Oracle Bone Script characters requires not only distinguishing visually similar characters but also handling subtle differences caused by wear and deformation.

To address the aforementioned three challenges, this paper introduces OracleNet. Specifically, we designed three modules in this network: the Adaptive Deformation Module (ADM), the Texture–Structure Decoupling Module (TSDM), and the Multi-Level Structured Perceptual Attention Module (MLSPAM). The ADM enhances fine local deformation control in Oracle Bone Script images by introducing adaptive control points based on FFD, allowing for subtle adjustments that preserve the key structural features of the characters. This approach enables the model to better adapt to complex local features in the image, reducing semantic information loss due to deformation. The TSDM separates texture and structural features within the image through feature decoupling, enabling the model to better understand and utilize the different types of information present in the image. The MLSPAM enhances the model’s perception of structural features through a multi-level attention mechanism. Utilizing self-attention, this module captures crucial information in the image at both macro and micro levels. At the macro level, the attention mechanism helps the model identify key areas and features within the image; at the micro level, it refines these features further, ensuring the model can distinguish visually similar Chinese characters while dealing with subtle differences caused by wear and deformation. The design of the MLSPAM aims to enhance the model’s adaptability to complex Oracle Bone Script images and improve overall recognition performance. With these three modules, OracleNet can not only precisely handle complex deformations and noise in Oracle Bone Script images but also effectively separate and utilize structural and textural features, significantly enhancing the recognition accuracy and robustness of Oracle Bone Script characters.

The contributions of this paper are primarily as follows:

(1) The ADM in our proposed OracleNet utilizes adaptive control points for finer local deformation adjustments in Oracle Bone Script images. These adjustments not only preserve the key structural features of the Oracle Bone Script characters but also adapt to complex local features within the image, thereby reducing the loss of semantic information due to deformation.

(2) The TSDM in OracleNet effectively separates texture and structural features in Oracle Bone Script images through feature decoupling techniques. This separation allows the model to more accurately understand and utilize different information within the image, enhancing its recognition capability for Oracle Bone Script.

(3) The MLSPAM we designed employs an attention mechanism to capture crucial information from both macro and micro levels of the image. This design not only helps the model identify key areas and features within the image but also refines these features further, ensuring the model can distinguish visually similar Chinese characters and handle subtle differences caused by wear and deformation.

(4) Through extensive experimental validation on multiple Oracle Bone Script datasets, OracleNet significantly outperforms existing methods in terms of recognition accuracy and robustness. Our method shows improvements in the accuracy of Oracle Bone Script character recognition by 2.5% on the Oracle-241 dataset, 1.74% on the OBC306 dataset, and 2.07% on the Oracle-MNIST dataset compared to the previous best methods.

Unlike previous methods, this paper utilizes the structural information of the Oracle Bone Script itself (including the shape, length, and relative positions of strokes) combined with textural information (such as cracks and wear marks) for domain adaptation to reduce performance discrepancies when processing different datasets.

Methods

Problem formulation

Given M labeled handprint source domain samples Xs with corresponding labels Ys, and N unlabeled target domain topographic samples Xt, there exists a significant distribution discrepancy between the source and target domains, i.e., P(Xs) ≠ P(Xt).

The goal of this study is to train a model G that can generalize well to topographic data by training on both the labeled source domain {Xs, Ys} and the unlabeled target domain {Xt}. Specifically, the aim is to optimize the model G to minimize the prediction error across both domains, expressed as:

$$\mathop{\min }\limits_{G}\Gamma (G;{X}^{s},{Y}^{s},{X}^{t})=\lambda {\Gamma }^{s}(G;{X}^{s},{Y}^{s})+(1-\lambda ){\Gamma }^{t}(G;{X}^{t})$$
(1)

Here Γs and Γt represent the loss functions for the source and target domains, respectively, and λ is a hyperparameter that balances the importance of the two domains. Thus, the model G is designed to acquire domain-invariant features that are equally effective in both domains, thereby enhancing performance on unseen target domain topographic data.

Overview

As illustrated in Fig. 1, the proposed model, OracleNet, includes three modules: the ADM, the TSDM, and the MLSPAM, aimed at improving the processing effectiveness and generalization ability of Oracle Bone Script images. The ADM employs adaptive control point technology for precise local adjustments to Oracle Bone Script images. Compared to traditional FFD techniques, adaptive control points dynamically adjust based on the content of the image, more accurately handling complex local features and effectively preserving the intricate structures of the Oracle Bone Script while minimizing information loss during deformation.

Fig. 1: Overview of the OracleNet model.
Fig. 1: Overview of the OracleNet model.The alternative text for this image may have been generated using AI.
Full size image

This diagram depicts OracleNet’s three main modules: (1) the Adaptive Deformation Module, which uses adaptive control points for precise adjustments; (2) the Texture–Structure Decoupling Module, which separates textural and structural features to enhance recognition accuracy; and (3) the Multi-level Structured Perceptual Attention Module, which applies attention mechanisms to refine feature recognition at both macro and micro levels.

In the processing of Oracle Bone Script images, distinguishing between structural and textural features is particularly important. The TSDM effectively separates the textural and structural features of the images. This not only enhances the recognition accuracy of structural features but also allows the model to better handle texture noise caused by image wear and degradation. The MLSPAM utilizes a self-attention mechanism to enhance the perception and recognition of Oracle Bone Script images across multiple levels. By applying the attention mechanism at both macro and micro levels, the module not only identifies key areas and features within the images but also refines these features to differentiate visually similar Chinese characters and adapt to minor deformations and wear in the images.

Adaptive Deformation Module

Traditional FFD techniques manipulate an image’s local deformation by setting up a regular grid of control points on the image and moving these control points. Given the special structural features and detail requirements of Oracle Bone Script images, this paper introduces an adaptive control point technique that dynamically adjusts the density and orientation of control points based on the content of the image, thus achieving more precise local deformation control. Details are shown in Fig. 2. At this point, for the handprint source domain samples Xs, the improved FFD transformation function can be expressed as:

$${T}_{{{ENHANCE}}\_{{FFD}}}({X}^{s})={X}^{s}+\mathop{\sum }\limits_{i=1}^{N({X}^{s})}\cdot \Delta {P}_{i}$$
(2)

where N(Xs) is the number of control points based on the handprint source domain sample Xs, and ΔPi is the displacement vector of control point i.

Fig. 2: Illustration of Adaptive Deformation.
Fig. 2: Illustration of Adaptive Deformation.The alternative text for this image may have been generated using AI.
Full size image

The blue points represent control points and the blue arrows represent the direction of displacement.

The displacement vector ΔPi for each control point i depends not only on the original position of the control point but also on the feature changes in the surrounding local area. The direction and magnitude of the displacement vector are determined by the following process:

$$\Delta {P}_{i}=f(\nabla I({X}^{s}),{\theta }_{i})$$
(3)

where I(Xs) represents the image gradient around control point i, and θi is an automatically adjusted parameter that modifies the displacement direction and range based on the image content. This adjustment ensures that the movement of control points enhances the structural representation capabilities of the image. θi can be automatically computed in following ways:

For handprint source domain samples Xs with clear structures and distinct strokes, this paper utilizes the edge intensity and directionality of the image as the primary features to dynamically compute θi:

$$\begin{array}{ll}&{\theta }_{i}({X}^{s})={\alpha }_{1}\cdot EdgeMagnitude({X}^{s})+{\alpha }_{2}\cdot EdgeOrientation({X}^{s})\\ &EdgeMagnitude({X}^{s})=\sqrt{{I}_{x}^{2}+{I}_{y}^{2}}\\ &EdgeOrientation({X}^{s})=arctan({I}_{x}^{2}+{I}_{y}^{2})\\ &{I}_{x}={G}_{x}* {X}^{s}\\ &{I}_{y}={G}_{y}* {X}^{s}\\ \end{array}$$
(4)

where EdgeMagnitude(Xs) and EdgeOrientation(Xs) represent the edge strength and direction near control point i. EdgeMagnitude() quantifies the degree of significant changes in the image near the control points, which is useful for controlling deformations. EdgeOrientation() helps adjust the movement direction of the control points to align with the edge directions in the image, thus maintaining the continuity and integrity of the image structure. Ix and Iy represent the first-order derivatives of the handprint source domain sample near control point i along the x and y directions. Gx and Gy are Sobel operators, commonly used for edge detection by highlighting regions of high spatial frequency that correspond to edges. α1 and α2 are coefficients that weigh the contributions of edge magnitude and edge orientation, respectively.

For the target domain topographic samples Xt that include noise and degradation, this paper utilizes local contrast and noise suppression as the primary features to dynamically compute θi:

$$\begin{array}{ll}&{\theta }_{i}({X}^{t})={\beta }_{1}\cdot LocalContrast({X}^{t})+{\beta }_{2}\cdot NoiseSuppression({X}^{t})\\ &LocalContrast({X}^{t})=\frac{1}{W\times H}\mathop{\sum}\limits _{x,y}\parallel {X}_{x,y}^{t}-{\mu }_{{{local}}}\parallel \\ &NoiseSuppression({X}^{t})=1-\frac{{\tau }_{{{local}}}}{{\tau }_{{{global}}}}\end{array}$$
(5)

where LocalContrast(Xt) represents the local contrast, emphasizing the visibility of important features in the image; NoiseSuppression(Xt) represents the degree of noise suppression, which helps reduce the impact of noise in the control point displacement; β1 and β2 are coefficients that weigh the contributions of local contrast and noise suppression, respectively. \({X}_{x,y}^{t}\) is the pixel value of the image Xt at position (x, y), μlocal is the average pixel value in the area near control point i of Xt, W and H are the width and height of the local window, τlocal is the standard deviation of pixel values in the area near control point i of Xt, τglobal is the standard deviation of pixel values across the entire image Xt.

With the improvements described above, the ADM enables the creation of enhanced samples Fs from the handprint source domain samples Xs and Ft from the unlabeled target domain topographic samples Xt, as follows:

$${F}^{i}=({T}_{{{ENHANCE}}\_{{FFD}}}({X}^{i})),i\in \{s,t\}$$
(6)

This module aids in increasing data diversity and the model’s generalization ability without compromising the original semantic content of the Oracle Bone Script images. By implementing these deformations, the images retain their essential characteristics while adapting to variations that might be encountered in real-world scenarios, thus enhancing the robustness and accuracy of subsequent recognition tasks.

Texture–Structure Decoupling Module

To effectively process Oracle Bone Script images, especially topographic data with complex textures, it is crucial to distinguish between structural features and texture features.

Structural features refer to the intrinsic, shape-related, and semantically meaningful components that constitute the characters themselves. These features are fundamental to character identity and recognition. Specifically, structural features in Oracle Bone Script images primarily include: character shape and outline, stroke composition and arrangement, geometric information, and topological structure. These structural features are identified by focusing on the essential lines and curves that define the character’s form, while minimizing the influence of extraneous elements. Conversely, texture features in Oracle Bone Script images refer to the surface-level visual patterns and variations that are not directly related to the character’s semantic identity or structural form. These features are often domain-specific noise or artifacts introduced by the material properties of oracle bones, the carving process, the aging process, and the image acquisition process. Texture features in Oracle Bone Script images typically include: surface noise and grain, cracks and fissures, stains and discolorations, blurring and degradation, and wear and erosion marks.

Images of Oracle Bone Script typically contains structural features, while topographic data also includes textural features14. The Texture–Structure Analysis Module aims to extract structural features \({f}_{1}^{t}\) and textural features \({f}_{2}^{t}\) from the enhanced samples of the unlabeled target domain Ft. The core idea of this module is to achieve feature separation by minimizing and maximizing the differences between Ft and the enhanced samples from the handprint source domain Fs.

The structural features of Oracle Bone Script images primarily consist of their shape, outline, and other geometric information. The problem of extracting structural features \({f}_{1}^{t}\) can be expressed as:

$${f}_{1}^{t}=\arg \mathop{\min }\limits_{{f}_{1}^{t}}\parallel S({F}^{t})-S({F}^{s}){\parallel }^{2}$$
(7)

where S represents the operation for extracting structural features. This is because S(Ft) includes both structural and textural features, while S(Fs) contains only structural features. By minimizing the difference described above, the structural features can be effectively extracted.

For the textural features of Oracle Bone Script images \({f}_{2}^{t}\), the problem can be expressed as:

$${f}_{2}^{t}=\arg \mathop{\max }\limits_{{f}_{2}^{t}}\parallel \Gamma ({F}^{t})-\Gamma ({F}^{s}){\parallel }^{2}$$
(8)

where Γ represents the operation for extracting textural features. Contrary to the extraction of structural features \({f}_{1}^{t}\), the textural features \({f}_{2}^{t}\) can be obtained by maximizing the difference. This approach emphasizes distinguishing the unique textural properties found in the target domain samples from those in the source domain samples.

In formulas (7) and (8), S represents an abstract operation for extracting structural features from an Oracle Bone Script image. It is not a single, fixed algorithm, but rather a conceptual representation of the process of isolating and emphasizing the shape-related, semantically meaningful components of the image, while suppressing or ignoring texture-related noise and variations. The implementation techniques for operation S (within the structure branch) in this paper are attention mechanisms (within MLSPAM, applied to structure features), guided by the structure feature loss Lossstructure.

Similarly, Γ represents an abstract operation for extracting texture features. It conceptually aims to isolate and capture the surface-level visual patterns and variations that are distinct from the character’s structural form and semantic content. Like operation S, Γ is not a single algorithm but a representation of the texture feature extraction process within the TSDM. The implementation techniques for operation Γ (within the texture branch) in this paper are network layers trained with texture-focused loss Losstexture.

In summary, operations S and Γ, as presented in formulas (7) and (8), are conceptual abstractions representing the distinct goals of structural and texture feature extraction within the TSDM. Operation S (structural feature extraction) aims to isolate and emphasize shape-related, semantically meaningful components, while Operation Γ (texture feature extraction) aims to capture surface-level visual patterns and variations distinct from structural form.

Therefore, the structural feature loss Lossstructure can be expressed as:

$$Los{s}_{structure}=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\parallel {f}_{1}^{s}(i)-{f}_{2}^{t}(i){\parallel }_{2}^{2}$$
(9)

This structural feature loss function is used to minimize the structural differences between enhanced samples from the source domain and the target domain, employing mean squared error to measure the extent of the differences. Here, \(\parallel \cdot {\parallel }_{2}^{2}\) represents the squared Euclidean distance, N is the total number of samples, and \({f}_{1}^{s}(i)\) and \({f}_{2}^{t}(i)\) respectively represent the structural features of the ith sample from the source and target domains.

The texture feature loss Losstexture can be represented as:

$$Los{s}_{texture}=-\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\frac{{f}_{2}^{s}(i)\cdot {f}_{2}^{t}(i)}{\parallel {f}_{2}^{s}(i){\parallel }_{2}\parallel {f}_{2}^{t}(i){\parallel }_{2}}$$
(10)

This texture feature loss function is designed to maximize the textural differences between the source domain and the target domain. This can be achieved by minimizing a negative loss function, which essentially minimizes the similarity between textural features. Using the negative cosine similarity, it measures the directional differences between feature vectors. Here, 2 represents the Euclidean norm of the vector, the dot product, and \({f}_{2}^{s}(i)\) and \({f}_{2}^{t}(i)\) respectively are the textural features of the ith sample from the source and target domains.

Multi-Level Structured Perceptual Attention Module

This module focuses on the multi-level structural features of Oracle Bone Script images, encompassing everything from basic strokes to complex symbol combinations, followed by hierarchical attention learning and integration. Specifically, at the micro-level, it captures edges and basic shapes of the Oracle Bone Script images, while at the macro-level, it concentrates on the overall layout, the combination of symbols, and their interrelations. Details of this module as shown in Fig. 3.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Details of Multi-Level Structured Perceptual Attention Module.

Micro-level feature extraction can be represented as:

$${\phi }_{micro}({F}^{s})=ReLU(Con{v}_{3\times 3}({F}^{s}))$$
(11)

Micro-level features are distinguished by their focus on fine-grained details and local structures within the Oracle Bone Script images. These features are extracted using a 3 × 3 convolutional kernel (Conv3 × 3), which provides a smaller receptive field, enabling the module to capture localized patterns. Specifically, micro-level features primarily capture edges and fine strokes of the characters. They also encompass basic shapes and local patterns that form the fundamental building blocks of the characters. Furthermore, these features are sensitive to details within individual strokes, such as subtle curves, junctions, and variations in stroke width. ReLU, the activation function, helps to highlight these micro-structural features.

Macro-level feature extraction can be represented as:

$${\phi }_{macro}({F}^{s})=ReLU(Con{v}_{5\times 5}(MaxPool({\phi }_{micro}({F}^{s}))))$$
(12)

Macro-level features, in contrast, are characterized by their emphasis on broader contextual information and overall structural patterns. These features are extracted using a 5 × 5 convolutional kernel (Conv5 × 5) followed by Max Pooling, which provides a larger receptive field, allowing the module to capture more global context. Macro-level features capture the overall layout and spatial arrangement of strokes within a character. They encompass combinations of strokes and symbols that form larger semantic units within the character. These features are sensitive to the global shape and overall form of the Oracle Bone Script character, and capture contextual relationships between different parts of the character, providing a holistic view. Here, the 5 × 5 convolutional kernel, Conv5 × 5, helps capture broader structural features, and using Max Pool reduces the dimensionality of the feature maps and increases the receptive field, enhancing the capture of the global information of the image.

In essence, micro-level features focus on local details and fine structures, while macro-level features concentrate on the overall layout and broader contextual information of the Oracle Bone Script characters.

For the micro-level and macro-level features of Oracle Bone Script images, an attention module is designed for each level to learn the importance of the structural features at that level. This can be expressed as:

$$\begin{array}{l}{A}_{micro}({F}^{s})=\sigma (Conv({\phi }_{micro}({F}^{s})))\\ {A}_{macro}({F}^{s})=\sigma (Conv({\phi }_{macro}({F}^{s})))\\ \end{array}$$
(13)

where σ is the sigmoid function, and Conv is a convolution operation used to learn spatial attention from the hierarchical features.

The attention-weighted features from both the micro and macro levels are fused to obtain a comprehensive output feature Total(Fs):

$$\begin{array}{ll}Total({F}^{s})\,={\omega }_{micro}\cdot ({A}_{micro}({F}^{s})\bigodot {\phi }_{micro}({F}^{s}))\\\qquad\qquad\quad +\,{\omega }_{macro}\cdot ({A}_{macro}({F}^{s})\bigodot {\phi }_{macro}({F}^{s}))\\ \end{array}$$
(14)

where ωmicro and ωmacro are the learned weight parameters.

Classification via the addition of a fully connected layer can be represented as:

$${{\Phi }}\,=softmax(W\cdot Total({F}^{s})+b)$$
(15)

where W and b are the weight and bias parameters, respectively.

The classification loss function Losscategory using cross-entropy, can be represented as:

$$Los{s}_{category}=-\mathop{\sum }\limits_{c = 1}^{C}{y}_{c}\log ({{\Phi }})$$
(16)

where C is the total number of categories, yc indicates whether category c is the correct classification for the sample, and Φ represents the probability of the sample being classified as c.

The domain discrepancy loss function Lossgap is represented as:

$$Los{s}_{gap}=\parallel Total({F}^{s})-{f}_{1}^{t}{\parallel }_{2}^{2}$$
(17)

where \(\parallel\!\! \cdot {\parallel }_{2}^{2}\) denotes the squared Euclidean distance. This loss function aims to minimize the Euclidean distance between the weighted features of the source domain enhanced sample Fs and the structural features \({f}_{1}^{t}\) of the target domain, encouraging the model to find a consistent structural feature representation across the two different domains. This approach effectively reduces the discrepancies in structural features between the source and target domains, enhancing the model’s generalization capabilities on Oracle Bone Script topographic data, thus enabling more accurate classification of unseen oracle bone topographic data.

Total loss function

The total loss function for the paper LossTotal can be specifically expressed as:

$$Los{s}_{Total}={\lambda }_{1}Los{s}_{structure}+{\lambda }_{2}Los{s}_{texture}+{\lambda }_{3}Los{s}_{gap}+{\lambda }_{4}Los{s}_{category}$$
(18)

where Lossstructure represents the loss associated with structural features within the Texture–Structure Analysis Module, Losstexture represents the loss associated with textural features within the same module, Lossgap represents the loss used in domain adaptation methods, and Losscategory represents the classification loss in the MLSPAM. The parameters λ1, λ2, λ3, and λ4 are weight parameters used to balance the contributions of different losses.

Synergistic operation of OracleNet modules

OracleNet’s strength lies in the synergistic operation of its three core modules: the ADM, the TSDM, and the MLSPAM. These modules are not isolated units but are designed to work in concert, optimizing different aspects of the Oracle Bone Script recognition task in a coordinated manner.

Sequential data flow

The input Oracle Bone Script image first enters the ADM. The ADM adaptively deforms the image, mitigating intra-domain variations and enhancing the prominence of structural features. This deformation process ensures that subsequent modules operate on a structurally refined input. The output from the ADM then flows into the TSDM. The TSDM performs a crucial role in disentangling texture and structural information. By separating these features, the TSDM allows the model to focus on learning domain-invariant structural representations, effectively handling the texture noise inherent in Oracle Bone Script images. Both feature streams from the TSDM are then fed into the MLSPAM. The MLSPAM is designed to extract and refine multi-scale structural features hierarchically. Utilizing attention mechanisms at both macro and micro levels, MLSPAM focuses on the most salient parts of the Oracle Bone Script, further enhancing feature discrimination. Finally, the refined feature representation from the MLSPAM is passed to the classifier to produce the recognition output.

Synergistic optimization through joint training

OracleNet is trained end to end, allowing for the modules to be optimized jointly and interdependently. The total loss function, LossTotal, orchestrates this joint optimization by combining LossStructure, LossTexture, LossGap, and LossCategory. During backpropagation, gradients from LossTotal flow through the MLSPAM, TSDM, and ADM, guiding the learning process in each module. This joint training approach enables synergistic learning: the ADM learns to provide optimal inputs for the TSDM, the TSDM learns to extract features that best leverage the MLSPAM’s attention mechanism, and the MLSPAM learns to focus on the most discriminative features from the structure-enhanced and texture-decoupled representations.

Complementary roles

Each module plays a complementary role in the overall optimization of OracleNet. The ADM reduces data variance and standardizes input, the TSDM disentangles confounding factors of texture, allowing a focus on structure, and the MLSPAM provides refined, multi-scale feature analysis. This carefully orchestrated modular design, combined with joint end-to-end training, is what enables OracleNet to achieve superior performance in Oracle Bone Script recognition.

Results

Datasets

The Oracle-241 dataset13 contains approximately 80,000 images covering 241 categories of handprint and topographic Oracle Bone Script characters, used for unsupervised domain adaptation tasks. The dataset is divided into training and test sets, with the training set comprising 10,861 labeled handprint data and 50,168 unlabeled topographic data; the test set includes 3730 handprint data and 13,806 topographic data. The Oracle Bone Script images in the Oracle-241 dataset exhibit extremely severe and unique noise due to long periods of burial and careless excavation, and most categories in the dataset feature multiple writing styles, which increases the difficulty of recognition and adaptation, as shown in Fig. 4.

Fig. 4: Examples in Oracle-241 dataset (left), OBC306 dataset (right) and Oracle-MNIST dataset (bottom).
Fig. 4: Examples in Oracle-241 dataset (left), OBC306 dataset (right) and Oracle-MNIST dataset (bottom).The alternative text for this image may have been generated using AI.
Full size image

In Oracle-241 dataset, the left columns are handprinted examples and the right columns show the corresponding topographic samples of same classes.

The OBC306 dataset20 is currently the largest Oracle Bone Script dataset, containing 309,551 samples divided into 306 categories, each representing a unique Oracle Bone Script character, used for pattern classification benchmark testing. As shown in Figure 4, all samples in the OBC306 dataset are extracted from real Oracle Bone Script shards. The training and testing set ratio is divided into 3:1.

The Oracle-MNIST dataset21 consists of 28 × 28 grayscale images containing 30,222 ancient characters across 10 categories, used for pattern classification benchmark testing, especially for challenges related to image noise and distortion. The training set includes 27,222 images, with each category in the test set containing 300 images. The training and testing set ratio is divided into 4:1.

Implement detail

Model

The width and height of the handprint source domain Xs and the target domain topographic samples Xt are both 224 pixels, and each Oracle Bone Script image has 3 color channels. In the ADM, the number of enhanced samples from the handprint source domain Fs and the target domain topographic Ft is set at 20, allowing for increased sample diversity while not introducing excessive computational overhead. The displacement of control points is set to 17, with 19 adaptive control points, ensuring fine adjustments at details while maintaining the stability of the overall structure. The coefficients for edge magnitude (α1), edge direction (α2), local contrast (β1), and noise suppression (β2) are selected as 0.53, 0.62, 0.74, and 0.43. In micro-feature extraction, 2 convolutional layers are set, each with 32 filters of size 3 × 3 and a stride of 1. In macro-feature extraction, 4 convolutional layers are set with 64, 128, 256, and 512 filters, respectively, of size 5 × 5 and a stride of 1, with max pooling layers using a 2 × 2 window and a stride of 2. Hidden layers all use ReLU. The output of the attention layers uses the sigmoid function.

Experiment setup

The initial learning rate for Adam is set at 0.001, with the learning rate decreasing by a factor of 0.1 every 10,000 iterations. We conducted 90,000 iterations of data training with a batch size of 64. The weights for the loss functions λ1, λ2, λ3, and λ4 are set to 0.48, 0.47, 0.54, and 0.52. All experiments were conducted on a single NVIDIA GeForce RTX 3090 GPU.

Evaluation

We set the validation standards according to the criteria established in this paper22. All labeled source characters and all unlabeled target characters are used for training, and the average classification accuracy and standard deviation are reported based on three random experiments.

Comparison of other methods

Results on Oracle-241

Table 1 displays the results on the Oracle-241 dataset to evaluate the effectiveness of the methods described in this paper for transferring recognition knowledge from Oracle Bone Script characters to topographic data. This paper also includes the “Source-only” model as NNDML (2019)15, which is used because it trains the model solely on Oracle Bone Script handprint data without any adaptation. The other models13,14,15,16,17,18,19,23,24 not only train on Oracle Bone Script handprint data but also undergo adaptation.

Table 1 Source (Handprint, corresponding to “Handprinted Character” in STSN13) and target (Topographic, corresponding to “Scanned Data” in STSN13) accuracy (mean ± std%) on Oracle-241 dataset

From the results in Table 1, the following conclusions can be drawn: (1) Without adaptive adjustments, the “Source-only” model does not perform well on topographic data, demonstrating the distribution differences between handprint data and topographic data. (2) STSN is the first work to focus on identifying domain gaps in Oracle Bone Script, achieving good results due to its design of joint and transformation modules; UDCN performs well in the target domain due to its unsupervised discriminative learning design. (3) The model presented in this paper, OracleNet, achieves the best accuracy on both topographic data and handprint source data. This is because the ADM increases the fault tolerance of the data samples, the Texture–Structure Analysis Module separates and processes structural and textural features aiding the model in obtaining more characteristics, and the MLSPAM implements hierarchical attention learning and integration at both macro and micro levels, enabling precise localization of the model.

Results on OBC306

On the OBC306 dataset, this paper primarily compares the source accuracy and target accuracy among various models, including SADE25, ResLT26, PaCO27, and MixupAda28. The performances of these models on the OBC306 dataset are shown in Table 2.

Table 2 Source and target accuracy (mean ± std%) on OBC306 dataset

From the results in Table 2, the following conclusions can be drawn: (1) MixupAda shows good results because it combines the complementarities between adversarial data augmentation and a hybrid generator. (2) The reason this paper presents the best results on the OBC306 dataset is that the ADM generates high-quality samples.

Results on Oracle-MNIST

On the Oracle-MNIST dataset, this paper mainly compares the average accuracy and overall accuracy among models including VGG-1629, AlexNet30, ResNet5031, Inception-V332, and LR-Net33.

From the results in Table 3, the following conclusions can be drawn: (1) LR-Net achieves good results because it classifies images with higher confidence. (2) OracleNet achieves the best performance, which is attributed to the effective separation of extensive noise present in the Oracle-MNIST dataset by the Texture–Structure Analysis Module within the model.

Table 3 Source and target accuracy (mean ± std%) on Oracle-MNIST dataset

Ablation study

To better understand the contribution of each module within the model, we conducted an ablation study, systematically removing or isolating specific modules and observing their impact on model performance. The focus of the ablation study was on the ADM, TSDM, and MLSPAM. Table 4 summarizes the study results, highlighting the importance of each component in achieving high performance and effective feature integration.

Table 4 Ablation study on Oracle-241 dataset

(1) Effectiveness of ADM (Model-A vs. Full Model). We first created Model-A, which does not utilize the ADM, to explore the role of this module. The comparison between Model-A and the Full Model shows a significant decline in performance, with a particularly notable decrease in accuracy by 12.4%. This result highlights the critical role of this module in enhancing data from the source and target domains. It adjusts source domain samples to be more similar to target domain samples through data augmentation methods, thereby enhancing the model’s generalization ability across different domains.

(2) Importance of TSDM (Model-B vs. Full Model). To evaluate the impact of the TSDM, Model-B was configured without this module. Compared to the Full Model, the absence of this module resulted in a performance decrease of 9.7%. This demonstrates that the module’s effective ability to separate and recombine structural and textural features is crucial for handling the complex nature of Oracle Bone Script. It also confirms its necessity for accurate feature extraction and domain adaptation.

(3) Contribution of MLSPAM (Model-C vs. Full Model). In Model-C, we removed the MLSPAM to isolate its impact. The performance of this model dropped by 16.8%, highlighting the importance of this module in focusing on features of varying scales and complexities. The attention mechanism of this module enhances the model’s sensitivity to key structural details, which are crucial for the classification and interpretation of Oracle Bone Script.

(4) Integrated Analysis of All Modules (Model-D vs. Model-C vs. Full Model). We also compared Model-D (without the ADM and TSDM) and Model-C (without the MLSPAM) with the Full Model to examine the interactions between these modules. The results show that while each module plays a significant role individually, their integration synergistically enhances performance. Model-C, lacking only the attention module, performed better than Model-D, but still underperformed compared to the Full Model, highlighting the compounded advantages of multiple modules working together.

Sensitivity analysis

In this section, we will explore the sensitivity of key parameters within the model to understand their impact on performance and to determine optimal settings. The sensitivity analysis is divided into two parts: parameters of the ADM and weight parameters in the total loss function.

Sensitivity analysis of N(X s), ΔP i, and other parameters in ADM

We analyzed the sensitivity of the number of adaptive control points N(Xs) and the displacement vectors ΔPi of control points i and other parameters.

(1) N(Xs): As illustrated in Fig. 5, the accuracy of the model initially increases with the number of control points N(Xs). When the number of control points is low, the model’s accuracy rapidly improves, indicating that increasing the number of control points significantly enhances the model’s ability to capture details in Oracle Bone Script images, thus improving recognition performance. However, as the number of control points continues to increase beyond a certain threshold, the increase in accuracy begins to slow down and starts to decline when it reaches about 20 control points. This trend reveals the nonlinear impact of increasing the number of control points on model performance.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

Sensitivity analysis of N(Xs), ΔPi and δ(X) on model accuracy.

The reason is that when there are fewer control points, each control point covers a larger area of the image, which limits the model’s ability to deform at detailed locations. In Oracle Bone Script images, many details such as the beginning and end of strokes and changes in angles are crucial for correct identification; therefore, an initial increase in control points significantly boosts recognition accuracy. As the number of control points increases, each point controls a smaller area of the image, allowing for finer deformations, but this may also lead to overfitting to local features while neglecting the overall structure, especially when there are too many control points. Overemphasis on irrelevant details (such as image noise or non-structured background parts) may affect the model’s generalization ability. Additionally, the increase in computational cost is a disadvantage of having too many control points.

(2) Δ(Pi): As shown in Fig. 5, adjusting the size of the displacement vectors directly affects the magnitude of deformation. Smaller displacement vectors may lead to insufficient adjustments, failing to overcome differences between domains, while overly large displacement vectors can cause excessive distortion of image details, reducing the model’s recognition accuracy. Experiments have shown that adjusting the displacement vectors within a specific range can achieve optimal model performance.

(3) α1, α2, β1, β2: As shown in Fig. 6, adjustments to α1, α2, β1, β2 influence accuracy.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

Sensitivity analysis of α1, α2, β1, β2 on model accuracy.

For α1 (edge magnitude coefficient), the increase initially enhances classification accuracy as the model more clearly identifies image edges and contours, which are crucial for recognizing structural features of Oracle Bone Script. However, when α1 becomes too high, accuracy may decline due to the overemphasis on edges potentially increasing image noise and affecting model generalization. An optimal value of α1 helps the model retain important structural information while avoiding noise and non-structural information interference. Excessive α1 might misinterpret minor or irrelevant variations as structural features, leading to classification errors.

For α2 (edge direction coefficient), increasing α2 also improves accuracy within a certain range because it helps the model more accurately capture edge directions, crucial for analyzing the shape and orientation of characters. However, too high α2 may make the model overly sensitive to image details, especially in edge directions, potentially disrupting the correct understanding of the overall structure. Proper adjustment of α2 optimizes the model’s interpretation of Oracle Bone Script stroke directions, aiding in the accuracy of character recognition. Overemphasis on edge directions might make the model too sensitive to natural deformations or slight distortions in the image, leading to misjudgments.

For β1 (local contrast coefficient), increasing β1 initially enhances model performance by emphasizing local contrast in the image, helping the model more clearly distinguish between characters and the background. However, if β1is too high, it may lead to overly strong contrast in local areas of the image, obscuring subtle structural details and reducing accuracy. An appropriate β1 helps the model identify key structural features in Oracle Bone Script images, especially under uneven lighting or varying image quality conditions. Uncontrolled contrast enhancement might cause image details to distort, especially in tiny spaces between characters, which could be incorrectly filled.

For β2 (noise suppression coefficient), increasing β2 helps reduce the impact of noise in the image, initially positively affecting accuracy. However, excessive noise suppression might lead to the loss of important details, especially where minor strokes and cracks in Oracle Bone Script may contain crucial information. Over-suppressing noise can reduce the model’s classification accuracy. Noise suppression controlled by β2 must balance between reducing noise and preserving important image details. Oracle Bone Script images often contain natural noise due to aging and damage; reducing this noise moderately can clarify the image, but excessive noise suppression might erase subtle traces carrying historical information.

Sensitivity analysis of λ 1, λ 2, λ 3, and λ 4 in total loss

The weights λ1, λ2, λ3, and λ4 balance the contributions of different components within the total loss function (Fig. 7). We methodically adjust these weights to study their impact on the training dynamics and the output of the model.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Sensitivity analysis of λ on model accuracy.

(1) Weight variations: Each λ is varied independently while keeping the baseline values of the other weights constant, to isolate their impacts. Weights λ1 and λ2, which are associated with the alignment of structural and textural features respectively, exhibit a bell-shaped influence on performance as they increase from 0.1 to 1.0. An optimal range emerges where the model achieves the best balance between alignment and overfitting. In contrast, λ3 and λ4, which are related to domain adaptability and classification accuracy, show less sensitivity in the medium range, but become crucial when deviating significantly from this range.

(2) Performance impact: We found that increasing λ3 slightly improves domain adaptability when set at lower values, but setting it too high can decrease overall accuracy, indicating a need to balance between specific domain adaptability and generalization. The impact of λ4 on classification performance is the most direct; it remains stable over a broad range but shows a significant decrease when either too high or too low, reflecting its direct influence on classification loss.

Through sensitivity analysis, we have identified the optimal settings for model parameters during practical deployment. These settings help the model automatically adjust its processing strategies when facing different types of Oracle Bone Script images, thus enhancing overall recognition accuracy and robustness.

Image visualization

To fully demonstrate the effectiveness and interpretability of our model, we have introduced several visualization methods to elucidate the model’s performance and its ability to handle complex transformations related to Oracle Bone Script. Here, the visualization effects of three modules are displayed.

(1) ADM: As shown in Fig. 8, the images before and after applying the ADM are displayed. The left side shows the original facsimile example before processing, and the right side shows the results of Adaptive Deformation after applying adaptive control points. Through this deformation, local features of the image (such as stroke bending, symbol spacing) are fine-tuned to better mimic the natural morphological changes that may be encountered in topographic samples.

Fig. 8: Visualization of Adaptive Deformation Module on Oracle-241 dataset.
Fig. 8: Visualization of Adaptive Deformation Module on Oracle-241 dataset.The alternative text for this image may have been generated using AI.
Full size image

In every two columns, the left pictures are original handprinted examples and the right pictures are elastic deformation examples.

This visualization helps understand how the module adjusts the image to fit the specific deformations of the target domain while preserving key structural details. This is crucial for enhancing the model’s adaptability and recognition accuracy on real-world data, especially when there is a significant morphological difference between the target and source domains.

(2) TSDM: As shown in Fig. 9, the effects of the TSDM are displayed. In each set of three-column images, the left side shows the original image, the middle shows the extracted structural features, and the right side shows the extracted texture features. These comparative images demonstrate how the module effectively separates texture information (such as cracks and surface wear) from structural information (such as character edges and strokes).

Fig. 9: Visualization of Texture–Structure Decoupling Module on Oracle-241 dataset.
Fig. 9: Visualization of Texture–Structure Decoupling Module on Oracle-241 dataset.The alternative text for this image may have been generated using AI.
Full size image

In every three columns, the left pictures are topographic examples, the mid pictures are structure examples and the right pictures are texture examples.

The separation of these features is crucial for enhancing the model’s ability to recognize characters against complex backgrounds. Decoupling texture from structure not only aids the model in more accurately identifying and interpreting Oracle Bone Script but also makes it more robust when dealing with variously degraded or damaged artifacts.

(3) MLSPAM: As shown in Fig. 10, the impact of the attention mechanism on the feature maps is displayed. In each set of three columns, the left side shows the original image, the middle column displays the effects of micro-level attention, and the right side shows macro-level attention. These images demonstrate how the attention mechanism focuses on key features within the image at different levels, such as the fine details of strokes and the overall layout of characters. This multi-level approach ensures that both detailed and global aspects of the characters are adequately emphasized, improving the model’s ability to interpret complex images accurately.

Fig. 10: Visualization of Multi-Level Structured Perceptual Attention Module on Oracle-241 dataset.
Fig. 10: Visualization of Multi-Level Structured Perceptual Attention Module on Oracle-241 dataset.The alternative text for this image may have been generated using AI.
Full size image

In every three columns, the left pictures are topographic examples, the mid pictures are micro level attention examples, and the right pictures are macro level attention examples.

The visualization of this module vividly illustrates how focusing on different levels of detail enhances the model’s understanding of Oracle Bone Script symbols. By adjusting the attention mechanism, the model is able to recognize and emphasize details crucial for classification, thereby maintaining high accuracy in complex or blurry images. This layered attention strategy is key to enhancing the model’s ability to deeply analyze image content.

These visualizations not only validate the model’s capability to handle complex transformations specific to the unique features of Oracle Bone Script but also demonstrate how each module contributes to comprehensively improving image quality and readability. This holistic approach ensures that the model is not only effective in identifying and interpreting the scripts but is also robust against the variations and imperfections commonly found in historical artifacts.

Feature visualization

To evaluate the effectiveness of our proposed model in mitigating domain discrepancies and enhancing feature discriminability, we utilize t-SNE to visualize features extracted from the source domain and adapted target domain of the Oracle Bone Script dataset. As shown in Fig. 11, we present a comparison of the feature distributions before and after domain adaptation. This visualization helps in assessing how well the model has aligned the features from both domains, crucial for ensuring that the learning transfers effectively across different data conditions and enhances the model’s ability to generalize to new, unseen data while maintaining high accuracy and robustness.

Fig. 11: Feature visualization on Oracle-241 dataset.
Fig. 11: Feature visualization on Oracle-241 dataset.The alternative text for this image may have been generated using AI.
Full size image

The left picture is before adaptation and right picture is after adaptation.

Before domain adaptation, the features extracted by the source model displayed a relatively dispersed pattern, indicating a significant domain shift between the source and target domains. However, after adjustments made using our proposed domain adaptation approach, features from different domains exhibited a more mixed and indistinguishable trend, powerfully demonstrating the effectiveness of our proposed method in reducing domain discrepancies. This blending of features from diverse domains not only validates the adaptability of the model but also its potential in handling domain-specific variations, crucial for real-world applications where domain variability is common.

Error analysis

In this section, we explore the scenarios in which our model, which integrates the ADM, TSDM, and MLSPAM, fails to accurately classify Oracle Bone Script images. Although our model generally performs better than traditional “Source-only” models, it is not without its shortcomings, particularly when dealing with complex artificial marks in images.

(1) ADM limitations. While the ADM is designed to more closely align source domain images with target domain images, it occasionally causes misalignment of features crucial for correct classification. For instance, some characters may undergo excessive deformation, resulting in the loss of key structural details needed to distinguish similar characters. As shown in Fig. 12a, this sometimes leads to misclassification of characters with subtle differences.

Fig. 12: Error analysis of our model.
Fig. 12: Error analysis of our model.The alternative text for this image may have been generated using AI.
Full size image

a shows the excessive deformation of images, b shows the degraded image, c shows the blurry image, and d comparison with similar correct cases.

(2) TSDM challenges. The TSDM effectively separates texture from structural information, which usually aids the recognition process. However, this module may struggle with images where texture features are severely degraded due to noise. As depicted in Fig. 12b, in such cases, the module may fail to accurately reconstruct fundamental structural information, leading to errors when recognizing characters that heavily rely on these details.

(3) MLSPAM errors. This module is designed to focus on relevant features across multiple scales to enhance the model’s ability to discriminate between different character categories. However, if there is a high level of blurring or similar distortions within the same category, this module can sometimes be overwhelmed. These situations can cause the attention mechanism to focus on incorrect aspects of the image, thus hindering accurate classification, as described in Fig. 12c.

(4) Comparison with similar correct cases. Figure 12d shows correctly classified examples of similar characters, where the structural features are more discernible despite the presence of noise, allowing the model to correctly identify it.

(5) General observations. Over 50% of topographic images exhibit severe deformations and noise, which still pose a challenge for our model. Although the ADM adjusts shapes and alignment, and the TSDM attempts to clarify the distinction between texture and structural features, the presence of blurred and obstructed images can lead to serious classification errors. Additionally, the similarity between certain characters and the appearance of characters as sub-components in others can also cause model confusion, thereby exacerbating the error rate.

Performance across varying training data scales

To assess OracleNet’s robustness under data scarcity, we conducted comparative experiments on the Oracle-241 dataset using varying percentages of its available training data. As shown in Table 5, recognition accuracy generally improved with more training data, demonstrating the model’s capacity to leverage increased supervision. Specifically, when utilizing only 25% of the training data, OracleNet achieved a target accuracy of 52.1 ± 0.5%. This performance progressively rose to 60.3 ± 0.4% with 50% of the data, and further to 63.5 ± 0.3% when 75% of the data was used. At 100% of the training data, OracleNet maintained its established high performance of 64.7 ± 0.3%. These results indicate that while more data naturally leads to better performance, OracleNet exhibits promising capabilities even with limited training resources, which is crucial for real-world historical script applications where data annotation is often scarce.

Table 5 OracleNet performance with varying training data scales on Oracle-241 dataset (target accuracy)

Robustness to varying image degradation levels

To explicitly quantify OracleNet’s resilience to image complexity, noise, and blurring—common issues in Oracle Bone Script images—we evaluated its performance on the Oracle-MNIST dataset under controlled synthetic degradation (Table 6). We introduced increasing levels of Gaussian blur noise to the test set, measuring the impact on target accuracy. With no degradation, OracleNet achieved 97.2 ± 0.2%. Under low, medium, and high levels of Gaussian blur, the accuracy gracefully declined to 96.5 ± 0.2%, 95.1 ± 0.3%, and 93.2 ± 0.5%, respectively. This progressive yet controlled reduction in accuracy highlights OracleNet’s robustness, demonstrating its ability to maintain strong recognition capabilities even when images are significantly compromised by the types of degradation often encountered in historical artifacts, largely attributed to the ADM and TSDM.

Table 6 OracleNet robustness to synthetic image degradation on Oracle-MNIST dataset (target accuracy)

Discussion

While OracleNet demonstrates significant advancements in Oracle Bone Script recognition, it is important to acknowledge its limitations and potential areas for future improvement.

Firstly, limitations in robustness to specific noise or deformations: While OracleNet exhibits robustness to various types of noise and deformation, it may still face challenges with extremely severe or specific types of noise or distortions not well represented in the training data. As observed in the error analysis, excessive blurring or highly unusual deformation patterns can still lead to misclassification. Enhancing robustness to these extreme conditions remains a direction for future research.

Secondly, challenges in distinguishing visually similar characters: Despite the MLSPAM, discriminating between visually highly similar Oracle Bone Script characters remains a persistent challenge. Subtle visual differences between closely related character categories, especially under image degradation, can still be difficult for the model to discern. Future work could explore incorporating finer-grained feature extraction and contrastive learning approaches to address this limitation.

Addressing these limitations will be the focus of our future research, aiming to further enhance the robustness, and accuracy of Oracle Bone Script recognition models.

In this study, we introduced OracleNet, a novel approach to Oracle Bone Script recognition, integrating the ADM, TSDM, and MLSPAM. OracleNet achieves significant performance improvements, demonstrating superior accuracy and robustness in Oracle Bone Script recognition. Specifically, OracleNet achieves state-of-the-art performance on the Oracle-241, OBC306 and Oracle-MNIST datasets.

These performance gains are attributed to the innovative design of OracleNet, which integrates three key modules: the ADM, enabling finer local deformation control and preserving semantic integrity; the TSDM, effectively separating texture and structural features to enhance recognition accuracy; and the MLSPAM, refining feature discrimination through macro and micro perspectives. The synergistic interaction of these modules allows OracleNet to effectively address the challenges of Oracle Bone Script recognition, including complex deformations, noise, and subtle structural variations. Extensive experimental results and ablation studies validate the effectiveness and contribution of each module in OracleNet.

Beyond improving recognition accuracy, the practical deployment of OracleNet in cultural heritage protection faces several key challenges. Technically, the model must contend with extreme and novel forms of degradation not fully captured in existing datasets, as well as the fine differentiation of highly similar characters that possess minimal structural variance. Non-technical challenges are equally critical: ethically, the application of AI must prioritize the integrity and authenticity of cultural artifacts, ensuring that automated interpretations supplement, rather than supplant, expert human scholarship, and avoid any form of misrepresentation. This necessitates transparent and explainable AI systems. Furthermore, for effective integration into archaeological and historical workflows, careful consideration must be given to the user interaction experience. Future developments should focus on creating intuitive interfaces that allow cultural heritage professionals to easily input data, review recognition results, provide feedback for refinement, and visualize the model’s decision-making process. Such user-centric design and a clear ethical framework are paramount to realizing the full practical and social value of OracleNet in aiding the preservation and study of Oracle Bone Script.

Looking ahead, future research directions include further enhancing the robustness of OracleNet to extreme levels of noise and deformation, particularly those not well-represented in current datasets; exploring the application of OracleNet or its modular design to other historical text recognition tasks, such as ancient scripts from other cultures or degraded document image; and investigating model compression techniques to facilitate deployment in resource-constrained environments. These directions aim to further improve the accuracy, robustness, and applicability of Oracle Bone Script recognition technology, contributing to the advancement of archaeological and historical studies.