Introduction

Dunhuang murals, preserved within the Mogao Caves of China, represent an extraordinary achievement of ancient Buddhist art and are a UNESCO World Heritage Site. The costume imagery depicted in these murals not only conveys artistic expression but also serves as vital historical evidence of cultural identity, social hierarchy, and ethnic interaction in ancient China. Costumes, as socio-cultural symbols, carry rich cultural and historical significance. Within the murals, these costume images reflect prevailing aesthetic trends and religious influences, highlighting the multicultural exchanges of the era.However, Dunhuang murals face severe preservation challenges, as shown in Fig. 1. Prolonged exposure to environmental factors such as sandstorms, temperature and humidity fluctuations, and light degradation has led to varying degrees of weathering, fading, and flaking. To better preserve this cultural heritage, digital inpainting technologies have emerged as a viable alternative1,2,3. These methods allow for high-fidelity visual reconstructions of mural details, offering valuable resources for researchers studying the evolution and cultural context of ancient costumes.

Fig. 1
figure 1

Display of damaged Dunhuang murals.

Currently, digital image inpainting approaches fall into three main categories, partial differential equation (PDE)-based methods, patch-based methods, and deep learning-based techniques. Traditional image inpainting methods typically rely on mathematical models to simulate the diffusion of pixel information. Bertalmio et al.4 introduced a pioneering approach using PDEs to propagate neighboring pixel values into the missing region. These diffusion-based algorithms perform well in repairing small-scale damage, such as lines or narrow gaps, but struggle to reconstruct semantically meaningful content in larger missing areas. Another classical category is patch-based inpainting. Criminisi et al.5 proposed a method that uses priority-based filling of missing regions by searching for similar patches in the undamaged portions of the image. This technique considers both texture and structure continuity during patch matching and synthesis. While more effective than diffusion methods for slightly larger missing regions, patch-based inpainting still falls short when the missing content cannot be found elsewhere in the image. Furthermore, the absence of high-level semantic understanding often leads to visual artifacts and unnatural transitions in complex scenes.

Recent advances in deep learning, especially convolutional neural networks (CNNs), have significantly enhanced image inpainting capabilities. CNN-based inpainting models learn to predict missing content using contextual information from the entire image. Early approaches like ContextEncoder6 and Partial Convolution (PartialConv)7 introduced encoder-decoder frameworks that infer missing regions through feature learning. Morotti et al.8 proposed the DIP-ST method, which combines Deep Image Prior (DIP) with Style Transfer (ST) techniques, offering an innovative solution for restoring large damaged areas in artworks. The key innovation lies in its style loss function based on Gram matrices, where a pre-trained VGG network extracts stylistic features from reference images to enforce stylistic coherence between inpainted regions and original artworks. Unlike traditional data-driven methods requiring large training sets, DIP-ST achieves semantic-level inpainting with only a single style reference image, making it particularly suitable for cultural heritage scenarios with scarce training data.Recently proposed LaMa9 introduce Fast Fourier Convolution (FFC)10 into image inpainting. FFC empowers the fully convolutional network to have a global receptive field in its early layers, and have the ability to produce robust repeating texture.The UFFC module proposed by Chu et al.11 signifies a pivotal advancement in frequency-domain learning for image inpainting. This research provides profound insights into the inherent limitations of conventional FFC in low-level vision tasks, establishing a novel “unbiased frequency learning” paradigm. The architecture ingeniously integrates learnable range transformation, absolute positional encoding, dynamic skip connections and adaptive clipping to resolve fundamental issues like spectrum shifting and aberrant spatial activation. This work fundamentally reveals the intrinsic relationship between frequency-domain activation functions and inpainting quality, offering new theoretical perspectives for frequency-domain neural network design.

Generative Adversarial Networks (GANs) further advanced this field by training a generator-discriminator pair, enabling the generation of photo-realistic restorations12. Models such as EdgeConnect13, StructureFlow14, and Foreground-aware inpainting use multi-stage architectures to first recover structural outlines or edges and then fill textures guided by this structure15. STNet16 innovatively constructs a dual-reconstruction network architecture for texture and structure by simulating professional mural restoration workflows. It computationally translates the “contour-first then coloring” manual process into deep learning frameworks through synergistic optimization between Texture Reconstruction Network (TR) and Structure Reconstruction Network (SR). The core innovation lies in the joint design of bidirectional gated fusion mechanisms and dual attention modules, which simultaneously maintain structural coherence and restore fine-grained textures, demonstrating particular efficacy in digital restoration of complex artifacts like traditional Chinese murals. Tu et al.17 proposed RGTGAN to address three core challenges in reference-based super-resolution(Ref-SR) for remote sensing through three innovations. Ma et al.18 proposed a Structure-Preserving Super-Resolution (SPSR) method that revolutionizes GAN-based image super-resolution through gradient guidance. The study innovatively constructs a dual-branch architecture: the gradient branch transforms LR gradient maps into HR versions via multi-level feature fusion to provide explicit structural priors, while the image branch integrates gradient information with enhanced RRDB modules for joint optimization of texture and geometry. Its breakthrough lies in the gradient-space constraint that enforces second-order pixel relationships through differential loss functions, effectively addressing structural distortions common in GAN-based methods.

More recently, transformer-based models19,20,21,22,23 and diffusion models (e.g., RePaint24 using Denoising Diffusion Probabilistic Models(DDPMs)25) for image restoration. DDPMs for image inpainting aim to add the noise to the texture of the image during the forward process and recover the masked regions with the unmasked ones of the texture via the reverse denoising process. Despite the meaningful semantics generation, the existing arts suffer from the semantic discrepancy between the masked and unmasked regions, since the semantically dense unmasked texture fails to be completely degraded while the masked regions turn to the pure noise in diffusion process, leading to the large discrepancy between them. Liu et al.26 aim to answer how the unmasked semantics guide the texture denoising process; together with how to tackle the semantic discrepancy, to facilitate the consistent and meaningful semantics generation. To this end, we propose a novel structure-guided diffusion model for image inpainting named StrDiffusion, to reformulate the conventional texture denoising process under the structure guidance to derive a simplified denoising objective for image inpainting. The diffusion modelhas been widely applied in the field of image generation in recent years27,28,29, and it has achieved good image generation quality.

At present, the digital restoration methods of murals are also based on the above three methods, among which the latter two are the most commonly used mural restoration algorithms. The pixel information diffusion-based method uses a partial differential diffusion model to transfer the information of the non-missing area of the Dunhuang murals to the missing area. For example, Huang et al.30 used a total variation model to restore the structural image corresponding to the Dunhuang mural image. The pixel information diffusion-based method can repair the missing area of the mural with a smaller size, but for the missing area with more complex content, such as the structural area of the mural containing different color blocks, this method may generate blurred content. The image block-based method mainly searches for one or more image blocks related to the missing mural content from the non-missing area of the mural, and fills them into the missing area of the mural after processing. This type of image block matching-based method can fill reasonable content into the missing area of the mural with a single and uniform color, or the mural area with a simple structure, but it cannot fill the missing area with content that does not exist in the non- missing area. However, the missing area of the Dunhuang mural often contains content that is difficult to obtain from the non-missing area.

Image restoration methods based on deep learning can achieve relatively good results in restoring murals. CNN has a strong ability to represent image content. It can extract features from missing mural images and generate the content of the missing area of the mural to obtain a complete mural image. Yu et al.31 applied the partial convolutional network PartialConv pre-trained on the Places2 dataset to the restoration of Dunhuang murals. Li and Lv et al.32 proposed a two-stage network based on line drawing color and contour content for the restoration of Dunhuang murals. In the Dunhuang cultural heritage protection project, Yu et al. used PartialConv and EdgeConnect to experiment with the restoration of Dunhuang mural images and achieved certain restoration effects. The TSBGNet proposed by Lian et al.33 establishes a texture-structure bidirectional generation framework that transcends the limitations of conventional inpainting methods in global consistency preservation and fine-grained texture recovery. The study innovatively integrates Pulse-Coupled Neural Networks (PCNN) with deep learning, developing the TE-FCMSPCNN module for texture optimization, while constructing a global interaction mechanism between structural and textural features through the Bidirectional Information Flow (BIF) module. Moreover, the proposed approach yields promising results on the Dunhuang Mogao Grottoes Mural dataset. Hou et al.34 propose a Dunhuang mural inpainting network (TSBGNet) integrating texture-geometric features, which achieves refined restoration through a dual-subnetwork architecture (PIN+REN). The dual-encoder structure extracts geometric features from line drawings and texture features from damaged images, where the Mamba-enhanced encoding module significantly improves long-range dependency modeling via bidirectional scanning. The innovative DMSFM module employs multi-scale feature fusion with dynamic gated attention, effectively addressing structural discontinuities in complex textures. This approach demonstrates exceptional detail restoration while preserving artistic style consistency in mural conservation, with visual superiority in caisson patterns and drapery details. Ren and Zhao et al. selected GAN and diffusion as the basic models for improved design based on the content characteristics of Dunhuang mural images35,36,37. Ren proposed a GAN model that combines a parallel dual convolution feature extraction deep generator with a ternary heterogeneous joint discriminator to achieve effective restoration of mural images. Zhao took advantage of the diffusion model and proposed a new mural image restoration architecture that combines a heterogeneous UNet structure with two key modules to address the specific challenges of restoring ancient murals. Traditional convolution operations are limited by the local receptive field and are difficult to model long-distance structural dependencies, which sometimes leads to texture breaks or semantic inconsistencies in the mural restoration results. Although non-local attention mechanisms such as Transformer or diffusion models attempt to alleviate this problem, they have high computational complexity and high overhead, making them difficult to put into practical use.

Based on the challenges in restoring Dunhuang mural costumes—such as complex textures, large-area damage, and long-range structural dependencies—this paper proposes an improved deep learning framework for mural image inpainting. The main contributions of this work can be summarized as follows. First, we develop a hybrid framework that integrates the EdgeConnect model with FFC, enabling effective restoration of large missing areas with complex textures in Dunhuang mural costume images. Second, frequency-domain information is incorporated into the edge prediction stage through FFC modules, which significantly enhances the model’s ability to capture global structures while ensuring better edge consistency and texture realism. Third, we construct a high-quality Dunhuang mural costume dataset with both regular and irregular masks for training and testing, and conduct comprehensive experimental evaluations, including ablation studies, which demonstrate superior performance over existing state-of-the-art methods. This work provides both a practical and theoretical basis for advancing the digital restoration of heritage artwork, particularly in culturally significant but structurally complex mural images.

Methods

Overview of the proposed framework

This paper proposes a key improvement to the downsampling module of the first-stage generator (G1) in the EdgeConnect model, as shown in Fig. 2. In the original EdgeConnect architecture, the G1 stage uses two layers of ordinary convolutional layers with stride (stride=2) to downsample the input features, and gradually reduces the resolution of the feature map from H × W to \(\frac{H}{4}\times \frac{W}{4}\) through the local receptive field of the 3 × 3 convolution kernel. Although this traditional convolution operation based on the spatial domain has advantages in computational efficiency, its inherent local characteristics limit the model’s ability to model the global structure and frequency domain features of the image, especially when processing images such as Dunhuang murals with complex long-range dependencies (such as crack textures spanning dozens of pixels) and special frequency domain features (such as high-frequency particles of mineral pigments).

Fig. 2
figure 2

The improved model proposed in this paper.

To overcome this limitation, it is proposed to replace these two layers of ordinary convolution with Fast Fourier Convolution modules, as shown in Table 1

Table 1 Structural parameters of the improved G1 generator downsampling stage

The core innovation of the FFC module is to decompose the traditional spatial convolution into parallel local and global paths: the local path retains 1 × 1 convolution for feature transformation in the channel dimension, while the global path maps the features to the frequency domain for processing through two-dimensional Fast Fourier transform (FFT), as shown in Table 2.

Table 2 Global branch contains Spectral Transform parameter table

Specifically, for the input feature map \(x\in {{\mathbb{R}}}^{H\times W\times C}\), the FFC module first adjusts the channel dimension through 1 × 1 convolution, and then performs a 2D FFT transform with zero phase filling on the adjusted features to convert them into a complex spectrum \(X\in {{\mathbb{C}}}^{H\times W\times C{\prime} }\). In the frequency domain, the learnable complex weight matrix \(W\in {{\mathbb{C}}}^{H\times W\times C{\prime} \times C{\prime\prime} }\) implements global filtering Y = X W through element-by-element complex multiplication. Finally, the processed spectrum is converted back to the spatial domain through the inverse FFT transform and cropped to the target resolution (\(\frac{H}{2}\times \frac{W}{2}\)). The entire process maintains strict frequency-space domain symmetry, and the W2 ≤ 1 norm of the complex weights is constrained by spectral normalization to ensure training stability.

Network architecture and key components

EdgeConnect is a two-stage image inpainting framework designed to improve restoration quality by explicitly modeling the structure. The model breaks down the image restoration process into two sub-tasks: edge prediction and image filling, as shown in Fig. 3

Fig. 3
figure 3

The EdgeConnect Framework adopted in this paper.

The first stage uses an edge generation network to predict the edge map of the missing regions in the image. A U-Net generator is used for this task, where the visible parts of the original image and the mask are utilized to predict the structural boundaries of the damaged areas. The primary goal of this stage is to infer reasonable structural contours from the incomplete context, ensuring that the generated edges align with the geometric features of the intact regions. The loss function in this stage combines L1 loss and adversarial loss, which helps ensure that the generated edges are both geometrically consistent and visually plausible. The network is trained using a GAN structure, with a discriminator used to distinguish between real and synthetic edge images, further enhanced by feature matching loss to improve training stability and edge coherence.

The second stage uses the edge map generated in the first stage to guide the completion of the image content. This is done using another U-Net generator. To achieve more natural and realistic restoration results, the loss function in this stage is more complex, including pixel-level L1 loss, perceptual loss based on a pre-trained VGG network, style loss, and image-level adversarial loss. These losses ensure that the restored region seamlessly integrates with the surrounding area in terms of texture, color, and semantics, thus completing the image’s texture and color information.

To address the global dependency issue in mural restoration, this study introduces FFC into the edge-guided framework. Specifically, in the first stage of the EdgeConnect model, the standard convolutions in the downsampling are replaced by the FFCs. The FFC module divides features into two branches: Local (spatial domain) and Global (frequency domain), as shown in Fig. 4. The Global branch incorporates Fourier Transform (Real FFT2d + Convolution + IFFT) to achieve frequency-domain modeling, which enhances the model’s ability to capture global structural information.

Fig. 4
figure 4

Structure of the Fourier convolution module.

Its core process can be divided into three key stages: frequency domain transformation, frequency domain filtering, and inverse transformation. Specifically, for the input feature map \({\bf{x}}\in {{\mathbb{R}}}^{H\times W\times C}\), a two-dimensional Fast Fourier Transform (2D FFT) is first performed on it to convert the spatial domain information into the frequency domain. The mathematical expression of Fourier transform is shown in Eq. (1).

$$F(x)(u,v,c)=\mathop{\sum }\limits_{h=0}^{H-1}\mathop{\sum }\limits_{w=0}^{W-1}x(h,w,c){e}^{-2\pi i\left(\frac{hu}{H}+\frac{wv}{W}\right)}$$
(1)

Where u and v represent the horizontal and vertical frequency component indices in the frequency domain, respectively, and c is the number of input channels. After transformation, the input feature map is encoded into a complex spectrum \(F(x)\in {{\mathbb{C}}}^{H\times W\times C}\). Its real and imaginary parts correspond to the cosine and sine components of the frequency domain amplitude, respectively.

To ensure the effectiveness of frequency domain filtering, the input needs to be zero-padding to match the target output size of the network layer, usually padded to the same resolution as the input to avoid spectral aliasing. In the frequency domain filtering stage, a learnable complex weight matrix is designed \(W\in {{\mathbb{R}}}^{H\times W\times C\times {C}^{{\prime} }}\) as a frequency domain filter, and each spatial position (u, v)(u, v) and channel combination \((c,{c}^{{\prime} })(c,{c}^{{\prime} })\) corresponds to an independent complex parameter. The filtering operation is implemented by complex element-by-element multiplication, as shown in Eq. (2)

$${\rm{Y}}\left({\rm{u}},{\rm{v}},{{\rm{c}}}^{{\prime} }\right)=\mathop{\sum }\limits_{{\rm{c}}=1}^{{\rm{c}}}{\rm{\,F}}({\rm{x}})({\rm{u}},{\rm{v}},{\rm{c}})\cdot {\rm{W}}\left({\rm{u}},{\rm{v}},{\rm{c}},{{\rm{c}}}^{{\prime} }\right)$$
(2)

To optimize training stability, the complex weights (W) are split into real Wreal and imaginary parts Wreal, which are initialized using the He normal distribution and updated independently through gradient descent. After the frequency domain multiplication, an Inverse Fast Fourier Transform (IFFT) is applied to convert the result back to the spatial domain, as shown in Eq. (3).

$${\rm{y}}=\mathrm{Re}\,\left({{\rm{F}}}^{-1}({\rm{Y}})\right)\in {{\rm{R}}}^{{\rm{H}}\times {\rm{W}}\times {{\rm{C}}}^{{\prime} }}$$
(3)

Only the real part of the inverse transform is retained, as the input image is real, and the imaginary part is zero. This paired FFT and IFFT operation ensures that the output feature map’s spatial size matches the input, seamlessly integrating into the existing network architecture.

Dataset construction

To ensure the success of deep learning-based restoration, the Dunhuang mural costume dataset must be comprehensive, accurate, consistent, and reliable.

The dataset is derived from the “Complete Collection of Chinese Dunhuang Murals,” which includes a wide variety of mural images from different historical periods and artistic styles. This diversity provides a rich sample pool that comprehensively covers the characteristics of Dunhuang murals. The murals were digitized using high-resolution scanning equipment capable of capturing fine visual details. This enhances the authenticity of the data and provides accurate representations of damaged areas, which are essential for model training and evaluation. All images are standardized to a resolution of 256 × 256 pixels, and filenames are assigned in a continuous 8-digit format starting from 00000000.png. This normalization ensures a uniform training input and facilitates reproducibility, as shown in Fig. 5.The image sources are from the Dunhuang Academy’s officially published collection, making the dataset trustworthy and credible for academic and practical applications. The final dataset contains 49973 training images and 5589 test images.

Fig. 5
figure 5

Example of mural dataset construction.

Two types of mask datasets are commonly used in image inpainting: regular and irregular masks. While regular masks often take the form of centered rectangles or geometric shapes, irregular masks are more representative of real-world damage scenarios.

Given the irregular nature of damage in Dunhuang murals, this study employs non-uniform, free-form masks to simulate real-world deterioration. We use two widely adopted public datasets: the irregular mask set provided by Nvidia and a free-hand drawing mask dataset. These resources offer diverse shapes and distributions, as shown in Fig. 6, closely approximating authentic mural damage and ensuring the robustness of our model during training.

Fig. 6
figure 6

Example of mural dataset construction.

Loss function

The loss function of the model consists of two parts, which come from the G1 edge generation stage and the G2 image restoration stage.

G1 uses GAN network training to encourage the generated edge map to be indistinguishable from the real edge map, so the adversarial loss is as shown in Eq. (4), C is the true edge map, \(\hat{C}\) is the edge map output by the generator, D is the discriminator.

$${{\mathcal{L}}}_{{\rm{adv}}}^{{\rm{edge}}}={{\mathbb{E}}}_{(I,C)}[\log D(C)]+{{\mathbb{E}}}_{(I,\hat{C})}[\log (1-D(\hat{C}))]$$
(4)

The feature matching loss is used to stabilize the training by comparing the feature representations of the generated image and the real image in the middle layer of the discriminator, as shown in Eq. (5).

$${{\mathcal{L}}}_{FM}^{edge}={\mathbb{E}}\left[\mathop{\sum }\limits_{i=1}^{L}\frac{1}{{N}_{i}}{\left\Vert {D}^{(i)}(C)-{D}^{(i)}(\hat{C})\right\Vert }_{1}\right]$$
(5)

The image restoration stage uses the following loss function combinations. Reconstruction loss (L1 Loss) represents the pixel difference between the generated image and the original image in the restoration area, as shown in Eq. (6), M where is the mask, I is the original image, and \(\hat{I}\) is the generated image.

$${{\mathcal{L}}}_{L1}^{{\rm{image}}}=\parallel M\odot (I-\hat{I}){\parallel }_{1}$$
(6)

In addition, perceptual loss is used to compare high-level features extracted from a pre-trained VGG network. This ensures that the generated image retains perceptual similarity to the original image,as shown in Eq. (7).

$${{\mathcal{L}}}_{{\rm{perc}}}^{{\rm{image}}}=\sum _{i}{\left\Vert {\phi }_{i}(I)-{\phi }_{i}(\hat{I})\right\Vert }_{1}$$
(7)

The style loss is shown in Eq. (8), and the consistency of image style is calculated through Gram Matrix.

$${{\mathcal{L}}}_{{\rm{style}}}^{{\rm{image}}}=\sum _{i}{\left\Vert G\left({\phi }_{i}(I)\right)-G\left({\phi }_{i}(\hat{I})\right)\right\Vert }_{1}$$
(8)

This stage also uses GAN to encourage the generated images to be more realistic. The adversarial loss is shown in Eq. (9), C where is the original image, is the \(\hat{C}\) repaired image output by the generator, D and is the discriminator.

$${{\mathcal{L}}}_{{\rm{adv}}}^{{\rm{image}}}={{\mathbb{E}}}_{(I,C)}[\log D(C)]+{{\mathbb{E}}}_{(I,\hat{C})}[\log (1-D(\hat{C}))]$$
(9)

The final loss function is a weighted sum of the individual losses,as shown in Eq. (10).

$${{\mathcal{L}}}_{{\rm{total}}}={\lambda }_{1}{{\mathcal{L}}}_{{{\rm{l1}}}_{{\rm{pixel}}}}+{\lambda }_{2}{{\mathcal{L}}}_{{\rm{perc}}}+{\lambda }_{3}{{\mathcal{L}}}_{{\rm{style}}}+{\lambda }_{4}{{\mathcal{L}}}_{{\rm{adv}}}$$
(10)

Training details

All experiments in this article were conducted on a Lenovo Legion laptop (2023), with an NVIDIA GeForce RTX 4060 GPU and a 13th Gen Intel(R) Core(TM) i7-13650HX 2.60 GHz processor.

For model training, the input image size is 256 × 256 pixels, with a batch size of 16. The Adam optimizer is used, with beta parameters set to 0.1 and 0.9. The learning rate for both the generator and discriminator are both set to 0.0001, and the network underwent approximately 10,000 iterations.

Results

To validate the effectiveness of the proposed method and demonstrate the indispensable role of the Fourier Convolution module in improving mural restoration performance, random masks were added to the 5589 mural test images for restoration tests. These tests were compared with several classical image inpainting models, including DeepFillv138, DeepFillv239, GLNet40, and PCNet41. Additionally, ablation experiments were conducted using the baseline model with standard convolutions for comparison. The results were analyzed both subjectively and objectively from the perspective of image restoration quality.

Comparison experiment

In this experiment, different types of masks were selected to simulate various forms of damage, including localized small-area loss, stripe-like crack patterns, and large-scale damage affecting nearly the entire image. The restoration results produced by each model under these conditions are shown in Fig. 7. From the comparison results, it can be observed that DeepFillv1 and DeepFillv2 tend to produce noticeable artifacts or edge distortions in areas with complex textures, such as floral borders and garment folds. GLNet exhibits severe blurring and incomplete texture reconstruction when faced with large-area damage. PCNet shows some improvements but still suffers from limitations in color blending and fine detail preservation.

Fig. 7
figure 7

Comparison of results of different repair models for small area damage.

In the first row, which contains intricate floral patterns along the edges, DeepFill-based methods produce blurry boundaries or unnatural color patches, while PCNet fails to reconstruct the petal contours. In contrast, our method successfully restores the sharp outlines of the black petals and the surrounding green texture, resulting in a reconstruction closer to the original appearance. The second row features a carpet-like border pattern composed of straight geometric shapes and texture variations. GLNet and DeepFillv2 both produce boundary breaks or structural misalignments, while our method demonstrates clear advantages in maintaining texture continuity and edge sharpness. The fourth row presents a particularly challenging case involving not only the garment but also the facial area of a Buddhist figure. While our method performs relatively well in reconstructing the facial contour and avoids obvious artifacts or ghosting, it displays noticeable color distortion in the skin tones. The restored facial region appears colder in hue compared to the original, which compromises the realism of the facial expression. This suggests a current limitation of the model in semantically rich regions like human faces, where the global texture-guided mechanism may fall short. Future improvements are needed to enhance its semantic awareness and perceptual fidelity. In the last two rows, the damage is extensive. Both the lower garment and the red architectural columns in the background involve complex color gradients. All baseline models suffer from texture blurring or inconsistent color reconstruction, making it difficult to restore the hierarchical layering and color structures of Dunhuang mural costumes. Our method, by contrast, demonstrates superior capability in preserving such visual details even under large-area occlusions.

Structural Similarity Index (SSIM) measures the similarity between two images in terms of structure, texture, and brightness, with higher values indicating more similarity. Peak Signal-to-Noise Ratio (PSNR) quantifies the pixel difference between the restored and original images, with higher values indicating better restoration quality. The Mean Absolute Error (MAE) is a commonly used metric for measuring the difference between predicted values and ground truth. In the context of image restoration tasks, a smaller MAE indicates that the restored result more closely approximates the original image.

As shown in Table 3, DeepFillv1 and DeepFillv2 both achieve PSNR values around 21; however, their SSIM scores are relatively low. Notably, although DeepFillv2 exhibits slightly better performance in terms of MAE, its SSIM is only 0.7352, which indicates a lack of structural consistency. Meanwhile, GLNet and PCNet demonstrate moderate performance across all three metrics, with PCNet achieving the lowest PSNR and SSIM values of 20.18 and 0.7051, respectively. By contrast, after incorporating the FFC module into EdgeConnect, the restoration performance is significantly enhanced. Specifically, PSNR increases to 23.51, SSIM rises to 0.9651, and MAE decreases to 0.0663, thereby achieving the best results among all compared methods.

Table 3 Comparison of repair indicators of different models

In addition to its advantages in quantitative metrics, our method exhibits significantly better subjective visual performance in restoring the details of Dunhuang mural costume images. Complex ornamental patterns and facial features of the Buddhist figures are more accurately preserved, retaining critical visual elements. This fidelity is especially valuable for subsequent art historical research, such as costume typology studies and iconographic interpretation. Moreover, the high-quality reconstructions generated by our model offer substantial support for digital presentation, cultural heritage preservation, and public dissemination. They enable viewers to more fully appreciate the visual richness and aesthetic depth of Dunhuang art, bridging the gap between damaged originals and modern digital experiences.

Ablation experiment

Figure 8 presents visual comparisons of six representative samples used in the ablation study, evaluating the effectiveness of integrating the FFC module. Each row shows, from left to right: the ground truth, masked image, baseline model (without FFC), and our full model. While the baseline model demonstrates basic structure recovery, it often fails to restore fine textures, precise contours, and large-area consistency. In contrast, the complete model with FFC consistently achieves superior visual quality. Notable observations include:

Fig. 8
figure 8

Comparison of the repair effects of this method and the baseline.

In the first row, the repetitive geometric pattern near the edge is only coarsely reconstructed by the baseline, with unclear borders and faded colors. Our method restores sharper edges and more natural color transitions. The second row features thin black lines on a green background. The baseline blurs these boundaries and flattens the overall texture, whereas the full model guided by frequency-domain information maintains clearer separation. The fourth row contains facial features of a Buddhist figure. The baseline suffers from blurry eyes, asymmetric mouth lines, and a stiff expression. Our model better reconstructs facial structure and expression, with natural color layering and edge flow. The sixth row involves large-scale color transitions. The baseline shows jagged boundaries and color inconsistency, while the FFC-enhanced model yields smoother gradients and more coherent texture integration.

In summary, the incorporation of FFC significantly improves the model’s capacity to capture long-range dependencies, semantic consistency, and global structural coherence, especially in challenging regions such as facial features, repeating patterns, and color-rich backgrounds.

From the perspective of restoration metrics, our improved method also performed slightly better, as shown in Table 4.

Table 4 Comparison table of evaluation indicators of the repair effect of this method and baseline

From the perspective of cultural heritage restoration, the improvements offered by the full model are not limited to image quality metrics but are fundamentally significant in preserving key artistic and semantic elements. For instance, in the fourth row, the restored facial features—such as the eye contours, mouth lines, and overall expression—are critical to the recognizability and sacredness of the Buddhist figure. In the fifth row, the smooth color transitions reflect the aesthetic principles of Dunhuang mural painting, where color layering and symbolic tones play essential narrative and stylistic roles. By contrast, the baseline model, although capable of reconstructing coarse structures, fails to capture culturally important details in regions like faces, ornaments, and garments. Such deficiencies may lead to misinterpretation in subsequent iconographic or art historical analysis. The full model, enhanced with FFC, demonstrates stronger semantic coherence and structural awareness, reducing the risk of technical distortion in digital restoration. Therefore, it is better suited for high-fidelity digital preservation, academic research, and virtual museum exhibition of cultural heritage artifacts.

Discussion

Incorporating frequency-domain information, FFC modules provide a global receptive field even at shallow network levels, enabling the model to restore fine details while maintaining edge naturalness and color consistency. The global modeling capability of FFC makes it more robust to irregular masks, especially when dealing with complex mural structures and large-area damage. Experimental results demonstrate that the improved method outperforms traditional approaches in restoring fine textures and complex regions, leading to significantly enhanced restoration quality.

This paper presents an improved image inpainting method for Dunhuang mural costume restoration based on the EdgeConnect model, enhanced with FFC. By replacing the standard downsampling convolutions in the first-stage generator (G1) with FFC modules, the model achieves significant advances in mural restoration. Embedding Fourier convolution into the edge-guided framework offers multiple advantages. First, the FFC module, by performing frequency-domain operations, extends the receptive field across the entire image and directly models long-range spatial dependencies. Second, the learnable frequency-domain filters adaptively focus on important frequency components, reinforcing high-frequency features such as cracks and smoothing low-frequency regions such as flat color areas. These characteristics greatly enhance the model’s practicality and effectiveness for mural inpainting.

Despite the promising results, some limitations remain. One direction for future work is handling extreme damage scenarios, with the aim of improving precision and structural continuity for large missing regions with complex textures. Another is developing unsupervised and adaptive learning strategies to reduce dependency on high-quality training datasets and to enhance generalization under varying conditions through self-optimization capabilities. With the continuous advancement of computational power and deep learning techniques, these challenges are expected to be addressed more effectively, contributing to more accurate and efficient solutions for image restoration tasks, especially in cultural heritage preservation.