Introduction

Oracle bone inscriptions, as the earliest known mature writing system in China, provide irreplaceable primary evidence for studying the history, culture, and script evolution of the Shang and Zhou dynasties1. These inscriptions have largely been preserved in the form of ink rubbings, which constitute the main corpus for contemporary research. However, due to the considerable age of these artefacts, rubbing images often suffer from significant degradation, such as ink bleeding, surface damage, and character blurring, which severely limits their usability in digital archiving, intelligent recognition, and in-depth scholarly analysis. Enhancing their clarity is therefore crucial for restoring structural details and stylistic features, improving recognition accuracy, and advancing the digital preservation and intelligent study of oracle bone inscriptions.

Manual restoration methods, such as hand-tracing, depend heavily on the subjective judgment of experts, are time-consuming, and often struggle to balance fidelity with stylistic consistency. In recent years, super-resolution (SR) techniques, which aim to reconstruct high-resolution (HR) images from low-resolution (LR) inputs, have demonstrated substantial potential in domains such as remote sensing2,3,4, medical imaging5,6,7, and cultural heritage conservation8,9,10. Yet, in practice, these oracle bone rubbing images often contain substantial noise, which can cause models to learn noise patterns rather than meaningful features. Moreover, oracle bone datasets typically contain only LR images, with no paired HR counterparts, which restricts the applicability of conventional supervised SR approaches11,12,13,14,15.

To overcome the paired-data limitation, research on real-world SR for unpaired data can be broadly divided into two categories: unpaired SR based on degradation modelling and complex-degradation SR using only HR data. The first category leverages independently collected HR and LR images to explicitly model the degradation process. Inspired by unpaired image-to-image translation methods such as CycleGAN16, early works used adversarial learning with cycle consistency to bridge the domain gap. Building on this, SCGAN17 introduced a semi-cyclic architecture with a dedicated degradation network to generate pseudo-LR images, thereby improving reconstruction consistency18,19,20,21. Wang et al.22 further developed a method leveraging a compact proxy space for unpaired image matching in SR. Other approaches23,24,25 trained stand-alone degradation models to synthesize realistic LR data for supervised training in the absence of genuine HR-LR pairs. The second category bypasses the need for LR data entirely by synthesizing degraded images directly from HR inputs using handcrafted or statistically defined degradation pipelines9,26,27,28,29. For example, HiFaceGAN26 employs a multi-stage adaptive SR architecture for perceptual enhancement; Real-ESRGAN27 uses a tailored degradation pipeline with GAN training for realistic artifacts; MM-RealSR28 introduces interactive modulation for controllable restoration; WFEN29 applies wavelet-based frequency fusion to reduce distortion; ConvSRGAN9 cascades enhanced residual modules to capture multi-scale textures. While these approaches offer valuable strategies for handling unpaired data and complex degradations, they remain suboptimal for oracle bone rubbings, particularly in suppressing noise and preserving the fine-grained texture of ancient characters. They also often introduce artifacts or structural distortions in character regions, thereby degrading recognition and analytical performance. These limitations motivate the development of domain-specific solutions.

To address the aforementioned challenges, we propose a super-resolution method specifically designed for oracle bone friction images, called OBISR, which effectively tackles the practical limitation of lacking paired HR and LR samples in image reconstruction methods. We introduce a two-stage reconstruction generator DTG based on the Transformer architecture30, and further develop an enhanced variant DTGEMA to improve the stability of the generation process by incorporating the EMA technique. To more accurately restore the finer structural details of oracle bone inscriptions, a custom local discriminator is devised, specifically tailored to the varying sizes and textural characteristics of rubbing images. In addition, an artifact loss function is introduced31 to measure the discrepancy between the outputs of DTG and DTGEMA and the input LR images. Comparison with state-of-the-art super-resolution methods demonstrates that our OBISR achieves significantly superior performance in oracle bone image restoration. It not only outperforms existing approaches on quantitative evaluation metrics including PSNR, SSIM32 and LPIPS33, while also delivering clearer and more structurally accurate visual results.

The main contributions of this work are as follows:

  • We propose OBISR, a super-resolution approach specifically designed for oracle bone rubbings, which effectively circumvents the need for paired training data while preserving intricate inscription features.

  • A two-stage Transformer-based generator is introduced, along with an enhanced variant developed using the EMA technique. Additionally, a custom local discriminator is designed to more accurately capture the structural and textural variations of rubbing images.

  • An artifact-aware loss function is further introduced, which suppresses character artifacts and distortions by comparing the outputs of the two generators with HR reference images.

  • Extensive experiments demonstrate that OBISR significantly outperforms existing state-of-the-art methods in oracle bone image restoration, achieving superior performance in both quantitative evaluation metrics and visual quality.

Methods

Model architecture

Inspired by the SCGAN17 framework, we propose a novel super-resolution model for oracle bone rubbing images, named OBISR. As illustrated in Fig. 1, the model adopts a dual half-cycle architecture, where two LR degradation generators (GHL and GSL), two super-resolution reconstruction generators (DTG and DTGEMA), and four discriminators (DP1, DP2, DP3, DP4) are jointly integrated into a multi-branch adversarial optimization framework.

Fig. 1: Network architecture of our proposed method.
figure 1

The dual half-cycle architecture employs generators GHL/GSL to degrade images and DTG/DTGEMA to reconstruct images, under the adversarial supervision of a set of discriminators.

In the first half-cycle, OBISR takes a HR oracle bone rubbing image IHR as the input to the degradation generator GHL. By incorporating a random noise vector z, GHL simulates real-world degradation patterns and generates a 4 × downsampled LR image ISL. This degraded image ISL is then fed into both reconstruction generators DTG and DTGEMA, resulting in two reconstructed HR images ISS and ISSEMA, respectively. To suppress artifacts and distortions around character regions, a hallucination-aware loss \({{\mathcal{L}}}_{{\rm{M}}}\) is computed based on the differences among IHR, ISS, and ISSEMA.

In the second half-cycle, a real LR oracle bone image ILR is input into the DTG generator, producing a reconstructed HR image ISR. This image is subsequently passed through the degradation generator GSL to yield a synthetic LR image ISSL.

These two half-cycle reconstruction sub-networks are connected via the DTG reconstruction generator, which helps mitigate the adverse effects caused by the domain gap between synthetic LR rubbing images and real LR rubbing images.

Low-resolution degradation generator

Both our degradation generators GHL and GSL adopt the same LR generator architecture. As shown in Fig. 2, the generator consists of an encoder-decoder structure. The encoder begins with a spectral normalization34 (SN) layer, followed by a 3 × 3 convolutional layer, a global average pooling35 (GAP) layer and six residual blocks. These residual blocks are responsible for extracting meaningful and representative features from the HR input. The decoder also includes six residual blocks. After the 2nd and 4th residual blocks, pixel unshuffling operations are applied to increase the spatial resolution of the feature maps by a factor of 2. Finally, the output is activated by either a ReLU36 or Tanh function, resulting in a degraded LR oracle bone rubbing image.

Fig. 2: Architecture of the low-resolution degradation generator.
figure 2

This structure is shared by both GHL and GSL in our proposed model.

This design enables the generator to simulate realistic degradation patterns observed in ancient oracle bone images.

Super-resolution reconstruction generator

As shown in Fig. 3, the proposed reconstruction generator is specifically designed for the oracle rubbing image super-resolution task. It consists of two modules: the DTG and the DTGEMA. These two generator modules share the same architecture as the super-resolution reconstruction generator. The main difference lies in the application of the EMA strategy to the DTGEMA generator’s parameters, resulting in more consistent and stable performance.

Fig. 3: Architecture of the super-resolution reconstruction generator.
figure 3

This shared network structure is used by both DTG and DTGEMA, with the latter employing an EMA update mechanism for its parameters for stabilized training.

EMA assigns greater weight to recent observations in order to smooth time-series data, effectively reducing short-term fluctuations and highlighting long-term trends. During training, the parameters of the DTG generator are dynamically updated using EMA to obtain a smoothed version, denoted as DTGEMA. Specifically, EMA performs an exponentially weighted average over the parameter trajectory of the generator, which significantly suppresses noise and parameter oscillations, thereby improving training stability. The update rule of EMA is defined as:

$${\theta }_{t}^{DT{G}_{EMA}}=\alpha {\theta }_{t}^{DTG}+\left(1-\alpha \right){\theta }_{t-1}^{DT{G}_{EMA}},$$
(1)

Compared to the DTG generator, DTGEMA demonstrates greater robustness in reducing the occurrence of random artifacts and improving image fidelity. Since DTGEMA is a smoothed variant of DTG’s parameters, it shares the same loss function. The outputs of both generators are fed into the artifact map module Mrefine31 to compute the artifact loss \({{\mathcal{L}}}_{{\rm{M}}}\), which further alleviates the loss of fine details during optimization. In the inference phase, both generators are executed separately, producing two distinct outputs. This dual-generator inference strategy allows for a comprehensive evaluation of each generator’s performance.

To effectively extract both global structures and local textures, the input ILR is first upsampled to 128 × 128. Then, a Laplacian decomposition module is applied to perform frequency separation, resulting in a low-frequency component ILR_LF of size 64 × 64 and a high-frequency component ILR_HF of size 128 × 128. As a near-lossless transformation, the Laplacian decomposition preserves structural integrity during this process. Specifically, the low-frequency component captures global contours and stroke layouts, while the high-frequency component retains fine textures and edges. This separation allows the network to process complementary information in a targeted manner. As shown in Fig. 4, both components clearly preserve their respective characteristics, which are later fused to produce structurally coherent and detail-rich reconstructions.

Fig. 4: Input and outputs of the Laplacian decomposition.
figure 4

Shows the pre-decomposition input (ILR*) alongside the resulting low-frequency (ILR_LF) and high-frequency (ILR_HF) components. The images have been resized for optimal visualization.

Subsequently, the low-frequency component is processed by a structure enhancement block to enhance the global representation and produce an initial super-resolved image \({I}_{{\text{SR}}^{* }}\). Then, the high-frequency component ILR_HF and \({I}_{{\text{SR}}^{* }}\) are jointly fed into a recurrent detail supplement, which progressively injects high-frequency details in a recursive manner.

Finally, a convolutional fusion module combines and refines the dual-path features to generate the final HR output ISR. By leveraging frequency separation and dual-branch design, this generator effectively enhances both structural restoration and detail fidelity, achieving high-quality super-resolution reconstruction for oracle bone rubbing images.

Discriminator

We design a discriminator architecture based on PatchGAN37, which judges the authenticity of local regions within the image instead of evaluating the entire image globally. Specifically, the discriminator divides the input image into multiple N × N receptive field patches and outputs the real/fake probability of each patch, enabling more precise perception of artifacts and local details. To accommodate different resolution inputs, as shown in Fig. 5, we design different network depths for 64 × 64 and 16 × 16 images, respectively.

Fig. 5: Architecture of the PatchD discriminator.
figure 5

The design features distinct network depths tailored for processing 64×64 and 16×16 input images, respectively.

For 64 × 64 input images, the discriminator consists of five feature extraction modules and one output layer: The first layer is a convolutional layer with a stride of 2, followed by a LeakyReLU activation function; The second to fourth layers adopt a structure of convolution → Batch Normalization → LeakyReLU, with the number of channels set to 128, 256 and 512, respectively; The fifth layer maintains 512 channels and uses a stride-1 convolution to preserve spatial resolution; The output layer uses a 4 × 4 convolutional kernel to produce a one-channel discrimination map, and a Sigmoid function is applied to obtain the realness probability of each local region.

For 16 × 16 input images, the network is relatively shallower and contains four feature extraction modules and one output layer: The first four layers share the same structure as in the 64 × 64 case, with the number of channels set to 64, 128, 256, and 512, respectively; The final layer uses a 2 × 2 convolutional kernel to output the discrimination map, followed by a Sigmoid function to get the probability values.

In implementation, we use 4 × 4 convolutional kernels as basic building blocks, apply stride-2 convolutions for spatial downsampling, and use stride-1 in the last two layers to maintain receptive field precision. LeakyReLU is used as the activation function, and Batch Normalization is adopted in all layers except the first.

The entire discriminator generates a “discrimination map” composed of outputs from multiple receptive field areas, thereby enhancing its ability to recognize local structures and artifacts in the image and improving both discrimination performance and stability.

G HL generator loss

This module takes HR images and a random noise vector Z as input, aiming to generate LR oracle rubbing images with realistic degradation characteristics. To approximate the degradation found in real LR oracle rubbing images, we combine adversarial loss38 and pixel loss to formulate the loss function for the GHL degradation generator. The loss is defined as:

$${{\mathcal{L}}}_{{G}_{HL}}=\alpha {{\mathcal{L}}}_{pix}^{{I}_{SL}}+\beta {{\mathcal{L}}}_{adv}^{{D}_{P1}},$$
(2)

where α and β are the weighting coefficients. The adversarial loss38\({{\mathcal{L}}}_{adv}^{{D}_{P1}}\) is defined as:

$$\begin{array}{lll}{{\mathcal{L}}}_{adv}^{{D}_{P1}}\;=\;{\mathbb{E}}\left[min\left(0,{D}_{P1}({I}_{LR})-1\right)\right]\\\qquad\quad\;\;\;+\,{\mathbb{E}}\left[min\left(0,-1-{D}_{P1}({I}_{SL})\right)\right],\end{array}$$
(3)

where the discriminator DP1 is trained to classify the real LR rubbing image ILR as 1 and the synthesized LR image ISL as 0. Although the training data is unpaired, the proposed dual half-cycle architecture constructs pseudo-paired paths, thereby establishing an internally consistent learning framework. The pixel loss \({{\mathcal{L}}}_{pix}^{{I}_{SL}}\) is defined as the 1 distance between the synthesized LR image ISL and the downsampled version of its corresponding input IHR, where average pooling is used to match the resolution of ISL. This design ensures internal consistency, as both ISL and its optimization target are derived from the same HR image.

D T G generator loss

To distinguish between GAN-generated pseudo-textures and real image details, we introduce the artifact mapping Mrefine31 to compute artifact loss, which helps regulate and stabilize the training process. The method seeks to learn an artifact probability map \(M\in {{\mathbb{P}}}^{H\times W\times 1}\), indicating the likelihood of each pixel ISR(i, j) being an artifact.

First, we compute the residual between the real HR image IHR and the generated image ISS from the DTG generator to extract high-frequency components:

$${\mathcal{R}}={I}_{HR}+{I}_{SS}.$$
(4)

Then, the local variance of the residual R is used to obtain the preliminary artifact map M:

$$\begin{array}{c}M\left(i,j\right)=var\left(R\left(i-\frac{n-1}{2}:i+\frac{n-1}{2},\right.\right.\\ \left.\left.j-\frac{n-1}{2}:j+\frac{n-1}{2}\right)\right),\end{array}$$
(5)

where var denotes the variance operator and n is the local window size. To prevent misclassification of local pixels, we further compute a stabilization coefficient σ from the residual R:

$$\sigma ={\left(var\left(R\right)\right)}^{\frac{1}{a}},$$
(6)

where a = 5. Residual maps R1 and R2 are then generated from the outputs of the DTG and DTGEMA generators respectively:

$${R}_{1}={I}_{HR}-DTG\left({I}_{SL}\right),$$
(7)
$${R}_{2}={I}_{HR}-DT{G}_{EMA}\left({I}_{SL}\right).$$
(8)

Finally, the preliminary artifact map M is refined using both residuals R1 and R2 to enable more accurate localization of artifact-prone regions. This refinement does not assume that the EMA-based generator consistently produces fewer artifacts. Instead, it employs a pixel-wise comparison: penalization is applied only when R1(i, j)R2(i, j), suggesting that ISS(generated by DTG) yields a higher reconstruction error at that location. When R1(i, j) < R2(i, j), it implies that ISS is already close to the target value at that pixel, and therefore no additional penalty is applied. This selective strategy helps suppress artifacts while preserving high-frequency details.

$$\begin{array}{c}{M}_{{\rm{refine}}}(i,j)=\\ \left\{\begin{array}{l}0,\quad \,\text{if}\,| {R}_{1}(i,j)| < | {R}_{2}(i,j)| \quad \\ \sigma \cdot M(i,j),\quad \,\text{if}\,| {R}_{1}(i,j)| \ge | {R}_{2}(i,j)| .\quad \end{array}\right.\end{array}$$
(9)

As shown in Fig. 6, the last two columns visualize the mask of R1 < R2 and the refined artifact map Mrefine, where well-preserved textures and sharp edges are effectively excluded from penalization.

Fig. 6: Visualization of the Artifact Map Generation Process.
figure 6

From left to right: Ground Truth (GT), DTG output (ISS), DTGEMA output (ISSEMA), residual R1 (Eq. 7), residual R2 (Eq. 8), the mask of |R1| < |R2|, and the final refined artifact map Mrefine (Eq. 9).

The DTG generator is shared across the forward and backward cycle processes to reconstruct HR images from LR rubbings. In the forward cycle, adversarial loss and cycle-consistency loss are used; in the backward cycle, adversarial loss and pixel loss are adopted. The total loss for the DTG generator is defined as:

$${{\mathcal{L}}}_{DTG}=\theta {{\mathcal{L}}}_{DTG}^{{I}_{SS}}+\gamma {{\mathcal{L}}}_{DTG}^{{I}_{SR}},$$
(10)

where θ and γ are the corresponding weights, and the expressions of \({{\mathcal{L}}}_{DTG}^{{I}_{SS}}\) and \({{\mathcal{L}}}_{DTG}^{{I}_{SR}}\) are defined as follows:

$${{\mathcal{L}}}_{DTG}^{{I}_{SS}}=\alpha \left({{\mathcal{L}}}_{cyc}^{{I}_{SS}}+{{\mathcal{L}}}_{M}\right){\mathcal{+}}\beta {{\mathcal{L}}}_{adv}^{{D}_{P3}},$$
(11)
$${{\mathcal{L}}}_{DTG}^{{I}_{SR}}=\alpha {{\mathcal{L}}}_{pix}^{{I}_{SR}}+\beta {{\mathcal{L}}}_{adv}^{{D}_{P4}}.$$
(12)

Here, the adversarial losses38\({{\mathcal{L}}}_{adv}^{{D}_{P3}}\) and \({{\mathcal{L}}}_{adv}^{{D}_{P4}}\) are defined as:

$$\begin{array}{lll}{{\mathcal{L}}}_{adv}^{{D}_{P3}}\;=\;{\mathbb{E}}\left[min\left(0,{D}_{P3}({I}_{HR})-1\right)\right]\\\qquad\quad\;\;\;+\,{\mathbb{E}}\left[min\left(0,-1-{D}_{P3}({I}_{SS})\right)\right],\end{array}$$
(13)
$$\begin{array}{lll}{{\mathcal{L}}}_{adv}^{{D}_{P4}}\;=\;{\mathbb{E}}\left[min\left(0,{D}_{P4}({I}_{HR})-1\right)\right]\\\qquad\quad\;\;\;+\,{\mathbb{E}}\left[min\left(0,-1-{D}_{P4}({I}_{SR})\right)\right].\end{array}$$
(14)

The cycle-consistency loss \({{\mathcal{L}}}_{cyc}^{{I}_{SS}}\)16 ensures the generator can preserve semantic information and accurately restore character details. The pixel loss \({{\mathcal{L}}}_{pix}^{{I}_{SR}}\) is computed between the recovered SR image ISR and the upsampled version of its corresponding input ILR, where upsampling is performed via bicubic interpolation. Although the training data is unpaired, both ISR and the optimization target originate from the same LR image, ensuring consistency within the LR → HR reconstruction path.

The artifact loss \({{\mathcal{L}}}_{{\rm{M}}}\) is derived from the artifact map Mrefine31, which is expressed as:

$${{\mathcal{L}}}_{M}={M}_{refine}.$$
(15)

D T G EMA generator loss

The EMA technique performs temporal smoothing on the parameters of the DTG generator to produce a more stable generator, DTGEMA. Compared with the DTG generator, the DTGEMA generator is more robust in suppressing stochastic artifacts, thereby improving the quality of generated images.

Since DTGEMA is optimized from DTG via parameter smoothing, it adopts the same loss function as the DTG generator. The output images from the generator are passed into the artifact mapping Mrefine31 to compute artifact loss, which further reduces the loss of real details during optimization. For a detailed description of the loss design for the DTG generator, please refer to Section 3.4.2 of this paper.

G SL generator loss

By learning DTG, the rubbing image ISR degraded into a LR rubbing image ISSL. The loss function of GSL is defined as:

$${{\mathcal{L}}}_{{G}_{SL}}=\alpha {{\mathcal{L}}}_{cyc}^{{I}_{SSL}}+\beta {{\mathcal{L}}}_{adv}^{{D}_{P2}}.$$
(16)

The adversarial loss38\({{\mathcal{L}}}_{adv}^{{D}_{P2}}\) is defined as:

$$\begin{array}{lll}{{\mathcal{L}}}_{adv}^{{D}_{P2}}\;=\;{\mathbb{E}}\left[min\left(0,{D}_{P2}({I}_{HR})-1\right)\right]\\\qquad\quad\;\;\;+\,{\mathbb{E}}\left[min\left(0,-1-{D}_{P2}({I}_{SSL})\right)\right],\end{array}$$
(17)

where the discriminator DP2 is trained to predict 1 for the real LR rubbing image ILR and 0 for the synthesized LR rubbing image ISSL.

The cycle-consistency loss16\({{\mathcal{L}}}_{{I}_{SSL}}^{cyc}\) is an 1 loss function that penalizes the difference between the degraded image ISSL and the real LR image ILR.

Results

Experimental settings

Our model is implemented based on PyTorch39 version 1.11.0, using Python 3.7 and an NVIDIA L40S GPU. The optimizer is Adam40 with parameters β1 = 0.9 and β2 = 0.999. The initial learning rate is set to 1 × 10−4 and decays to 1 × 10−5 every 10 epochs via a cosine annealing schedule. The batch size is set to 64.

The training and testing data used in our study are respectively sourced from two oracle bone image datasets: the OBC306 dataset41, released by Yinque Wenyuan, and the OBI41957 dataset, provided by the research team of the Key Laboratory of Oracle Bone Inscriptions Information Processing, Ministry of Education, Henan Province, China.

For training, we randomly select 3000 clean images from the OBI41957 dataset as HR reference images, and 3000 low-noise images from the OBC306 dataset as LR inputs.

To comprehensively assess model generalization, we construct three representative test sets:

  • OBI_BI (with noise): 100 clear images are randomly selected from the OBI41957 dataset, and LR images are generated via bicubic downsampling.

  • OBI_AR (without noise): 1000 hand-drawn oracle character images are selected from OBI41957, and LR images are produced through black-white inversion followed by region-based interpolation.

  • OBITrue (with real degradation): 1440 real oracle bone rubbing images are randomly sampled from the OBC306 dataset and used as low-resolution images with authentic degradation characteristics.

We employ both pixel-level and perceptual-level metrics to evaluate the quality of generated results. Specifically, we use PSNR, SSIM32, and VIF as fidelity-oriented metrics. Additionally, to approximate human visual perception, we incorporate two deep-feature-based perceptual similarity metrics: LPIPS33 and DISTS.

Based on these five metrics, we construct a normalized radar chart and additionally propose a “radar area” metric as an intuitive indicator for the overall performance of each method. This metric reflects both the balance and effectiveness of the model across multiple dimensions, facilitating a holistic comparison among different approaches. Since LPIPS and DISTS are inverse metrics, we apply a transformation of 1 − normalized value during normalization to ensure that all metrics are mapped to the [0, 1] range, and follow the principle that higher values indicate better performance. Finally, the area of the pentagonal radar chart is calculated using the vector cross product method.

To evaluate the perceptual quality of super-resolved images on the OBITrue dataset, which lacks paired high-resolution ground truth, we conducted a Mean Opinion Score (MOS) evaluation. Given the poor suitability of general no-reference quality metrics (e.g., NIQE, BRISQUE) for assessing the visual characteristics of ancient script imagery, we performed a blind subjective assessment involving 16 participants comprising domain experts and graduate students specializing in oracle bone epigraphy. All participants possessed specialized knowledge in this field. They were asked to rate the visual outputs of different methods based on three key criteria: image clarity, structural consistency, and semantic interpretability. Ratings were collected using a five-point Likert scale (1 = Very Poor, 5 = Excellent). The final MOS for each method was obtained by averaging scores across all participants and test samples.

Comparative study

To comprehensively evaluate the performance of the proposed OBISR model, we conduct comparative experiments against several representative state-of-the-art methods, including Bicubic interpolation, CycleGAN16, LRGAN23, HiFaceGAN26, MM-RealSR28, SCGAN17, WFEN29 and NeXtSRGAN14. For a fair comparison, we retrain all comparative methods rigorously to ensure their optimal performance on the target datasets.

Table 1 presents a quantitative comparison between our proposed OBISR model and various state-of-the-art methods on the OBI_BI and OBI_AR datasets, while Fig. 7 presents a comprehensive performance comparison of different methods on these two datasets using radar charts.On both OBI_BI and OBI_AR datasets, OBISR-DTGEMA achieved the largest radar chart area, demonstrating its superior comprehensive performance in structural reconstruction, detail restoration, and perceptual quality. Particularly on the OBI_AR dataset, OBISR-DTGEMA attained a radar area of 2.1888, outperforming OBISR-DTG across all metrics. Furthermore, it demonstrated significant advantages in key metrics such as VIF and PSNR over mainstream methods including NeXtSRGAN14 and WFEN29. These results demonstrate that our method exhibits enhanced generalization capability and robustness in the oracle bone rubbing super-resolution task.

Table 1 Quantitative results of comparative experiments on OBI_BI and OBI_AR datasets
Fig. 7: Normalized radar charts with integrated “radar area” metric.
figure 7

The charts on (a) OBI_BI and (b) OBI_AR datasets compare overall model performance, quantified by the pentagon area, after unifying all metrics (including transformed LPIPS/DISTS) to a “higher-is-better” scale.

Figure 8 shows the qualitative visual results of different methods on the OBI_BI dataset. It can be observed that methods such as Bicubic, CycleGAN16, LRGAN23, HiFaceGAN26, and WFEN29 exhibit significant defects in reconstructing stroke continuity and structural outlines. LRGAN23, MM-RealSR28, and SCGAN17 produce varying degrees of character distortion and even introduce non-existent noise. NeXtSRGAN14 is capable of generating generally clear character structures; however, it still exhibits certain shortcomings in stroke-level details, such as subtle gaps or unnatural protrusions, which compromise the overall structural fidelity. In contrast, the images generated by the DTG generator in our proposed OBISR method exhibit superior structural consistency and detail preservation, resulting in visual outputs that more closely resemble the ground truth high-resolution images.

Fig. 8: Qualitative results of different methods on the OBI_BI dataset.
figure 8

In the figure, OBISR-DTG and OBISR-DTGEMA denote the outputs generated by the DTG and DTGEMA generators within the OBISR model.

Figure 9 further presents the qualitative results on the OBI_AR and OBITrue datasets. Since the OBITrue dataset consists of real-world LR oracle rubbings and is an unpaired dataset, the corresponding ground truth images are not provided. We observe that both Bicubic and HiFaceGAN26 generate overly smooth and blurry results. CycleGAN16 suffers from stroke disconnection on OBI_AR and severe character deformation on OBITrue. Moreover, LRGAN23 introduce varying levels of noise and character artifacts. MM-RealSR28 tends to produce unnatural character structures, and its outputs on the OBI_AR dataset exhibit partially clear but largely blurred image results. Both SCGAN17 and NeXtSRGAN14 demonstrate substantial character shape distortion across both datasets. On the OBI_AR dataset, WFEN29 produces visually competitive results. However, evaluations on the OBITrue dataset indicate that its generated images exhibit slight blurring and white noise artifacts, which negatively affect the overall visual perception. In contrast, the proposed OBISR model demonstrates more stable reconstruction performance across both datasets, achieving superior overall quality in terms of image clarity, stroke continuity, and structural integrity.

Fig. 9: Qualitative results of different methods on the OBI_AR and OBITrue datasets.
figure 9

In the figure, OBISR-DTG and OBISR-DTGEMA denote the outputs generated by the DTG and DTGEMA generators within the OBISR model.

To further validate visual quality, Table 2 reports the Mean Opinion Scores (MOS) on the OBITrue dataset. It achieves superior mean MOS scores of 4.53 and 4.36 using the DTG and DTGEMA generators, exceeding those of other methods. This experimental evidence verifies that OBISR successfully enhances image sharpness and maintains the inscriptions’ structural and semantic integrity, demonstrating high effectiveness and robustness for reconstructing real-world rubbing images.

Table 2 Mean Opinion Scores (MOS) of different methods on the OBITrue dataset

Beyond comparative experiments, a series of ablation studies are conducted to further analyze the contribution of individual components in the proposed OBISR model, as described in the following subsections.

Impact of PatchD discriminator count on OBISR performance

To ensure the validity and isolation of each proposed module, this section and the subsequent ablation studies follow a unified experimental principle. The primary objective is to evaluate the individual and joint contributions of the key architectural components, including the DTG generator. To maintain consistency and ensure a fair comparison, all experimental variants are evaluated using a unified DTG generator, with the DTG module either activated or deactivated according to the specific configuration. When the DTG generator is disabled, the generator from SCGAN is employed to produce the inference results. The DTGEMA generator is excluded from this analysis to avoid introducing confounding variables, since its performance has already been validated in the comparative study.

As clearly shown in Table 3, our proposed OBISR model, configured with four PatchD discriminators (DP1, DP2, DP3 and DP4), achieves the top performance in PSNR, SSIM, and VIF. Compared to any other variant, OBISR demonstrates global superiority, attaining comprehensive advantages in all three dimensions. This validates the strong synergy and generalization ability of our designed PatchD discriminator architecture.

Table 3 The quantitative evaluation results of OBISR and its 15 variants with different discriminator configurations on the OBI_BI test set

Although certain variants come close to optimal performance in individual metrics, they generally fail to maintain consistent superiority across all metrics. In contrast, OBISR achieves top scores in all three metrics, indicating stable and overall optimal performance. This further corroborates the significant benefits of employing four PatchD discriminators in restoring image structures and details.

It is worth noting that, from the perspective of sub-module combinations, using individual PatchD discriminators or partial combinations may yield some improvements. However, their overall performance still lags behind that of OBISR. This suggests that localized improvements are insufficient for achieving comprehensive performance gains, and only the full integration of all four discriminators can fully unleash the model’s potential.

In summary, OBISR demonstrates outstanding performance in terms of image quality, structural consistency and visual fidelity, making it the optimal solution among all configurations.

Effect of discriminator depth on reconstruction performance

The use of 16 × 16 and 64 × 64 resolutions is based on the classical 4 × super-resolution setting, which aligns well with the characteristics of oracle bone rubbings, where character regions are generally small and localized. Accordingly, the discriminator depth in our framework is determined based on the spatial characteristics of the input resolution.

For 64 × 64 images, employing a deeper discriminator with six convolutional layers enables the network to capture finer local details and high-frequency artifacts more effectively. Increasing the depth beyond this point, however, would cause the feature maps to shrink below 1 × 1, making it infeasible to retain spatially localized discriminative capability.

For 16 × 16 images, further increasing depth beyond four would cause the feature maps to shrink excessively, leading to unstable training or ineffective convolution operations. To strike a balance between discriminative capacity and architectural feasibility, we thus adopt a six-layer PatchD discriminator for 64 × 64 inputs and a five-layer version for 16 × 16 inputs.

To evaluate the impact of discriminator depth on image quality, we conduct ablation studies across multiple depth configurations. As shown in Table 4, increasing the number of convolutional layers consistently improves PSNR, SSIM, and VIF scores across both the OBI_BI and OBI_AR datasets. These results validate the effectiveness of our resolution-aware depth design in enhancing reconstruction performance.

Table 4 Reconstruction performance comparison of different discriminator depth configurations in OBISR variants

Comparison of upsampling and downsampling strategies in the reconstruction generator

To systematically evaluate the individual contributions of each component in our model, we conduct ablation studies by progressively removing specific modules and analyzing their impact. Given the significant role of the PatchD discriminators in enhancing discriminative capability, which may obscure the effects of other design factors, we choose to base our analysis on the simplified version OBISR-K, where PatchD discriminators are not included. This allows us to focus on the impact of different upsampling and downsampling strategies on reconstruction performance.

Specifically, we design three structural variants based on OBISR-K by adopting different combinations of bicubic and bilinear interpolation for upsampling and downsampling, resulting in four configurations in total. We then conduct quantitative evaluations on two datasets: OBI_BI and OBI_AR. The results are presented in Table 5.

Table 5 Reconstruction performance comparison of different upsampling and downsampling strategies in OBISR-K variants (without PatchD discriminators)

Experimental results show that, on the OBI_BI dataset, the OBISR-K variant using bilinear for both upsampling and downsampling achieves the best performance in terms of PSNR, SSIM and VIF, indicating that OBISR-K is more effective in enhancing reconstruction quality under complex degradation with noise. On the OBI_AR dataset, OBISR-K also leads in PSNR (17.5109) and VIF (0.2421), demonstrating consistent and stable superiority.

In summary, although bicubic interpolation may retain structural information more effectively in certain scenarios, bilinear exhibits better overall reconstruction accuracy and visual fidelity, especially in severely degraded images affected by noise.

Performance comparison of structural module designs

Table 6 presents a comprehensive ablation study evaluating the individual and joint contributions of four core modules: EMA, \({{\mathcal{L}}}_{{\rm{M}}}\), DTG, and PatchD. Results on the OBI_BI dataset demonstrate that each module provides distinct improvements to image reconstruction quality. Specifically, EMA alone leads to consistent performance gains, while the artifact loss proves effective even in the absence of EMA, highlighting its independent ability to suppress visual artifacts. The DTG module enhances the recovery of structural details, and PatchD contributes positively when used individually.

Table 6 Ablation study of OBISR with extensive combinations of EMA, LM, DTG, and PatchD

Furthermore, combining these modules yields stronger performance than using them in isolation. In particular, PatchD demonstrates more significant improvements when paired with DTG, suggesting a strong synergy between structural guidance and local discrimination. The full OBISR model, which integrates all four components, achieves the best overall results on both OBI_BI and OBI_AR datasets. These findings confirm not only the effectiveness of each individual module but also the complementarity among them in improving visual fidelity and structural consistency.

Discussion

In this work, we propose OBISR, a generative adversarial network designed for super-resolution reconstruction of oracle rubbing images. Without relying on paired low- and high-resolution data, OBISR enhances structural and detail recovery through a Transformer-based two-stage generator DTG and its improved variant DTGEMA. By incorporating the artifact loss \({{\mathcal{L}}}_{{\rm{M}}}\) and the PatchD discriminator, our model effectively suppresses artifacts and character distortions. Experimental results demonstrate that OBISR outperforms existing methods in both quantitative metrics and qualitative visual assessments, showcasing its superior performance in oracle rubbing image super-resolution tasks. Future research could further optimize the model architecture to enhance its adaptability to real-world degradation patterns, thereby better supporting the digital preservation and scholarly study of oracle bone inscriptions.