Digital restoration of ancient Jiangnan murals via proxy learning and structural guidance

Yang, Cheng; Liu, Yuxuan; Cai, Yi

doi:10.1038/s40494-026-02369-y

Download PDF

Article
Open access
Published: 25 March 2026

Digital restoration of ancient Jiangnan murals via proxy learning and structural guidance

Cheng Yang^1,2,
Yuxuan Liu¹ &
Yi Cai²

npj Heritage Science volume 14, Article number: 182 (2026) Cite this article

623 Accesses
Metrics details

Abstract

This article proposes a structure guided proxy restoration (SGPR) workflow to address the problem of the lack of real paired data in the digital restoration of ancient murals in Jiangnan. This method adopts a dual track strategy, training a generative model (ArtBooth) on a Chinese classical painting dataset through proxy learning to master artistic styles, and combining it with a non training selective feature extraction (SFE) algorithm to directly extract structural information from damaged murals. The two are integrated through an optimized multi-condition control network (OptCtrl) to achieve high fidelity repair. Experiments on simulated and real murals show that this method outperforms existing mainstream methods in both objective indicators and expert blind evaluations. This study provides a data-efficient new paradigm for intelligent restoration of cultural heritage, breaking through the limitations of traditional reliance on professional training data.

Ancient mural inpainting via structure information guided two-branch model

Article Open access 20 June 2023

Mural inpainting via two-stage generative adversarial network

Article Open access 19 May 2025

Progressive generative mural image restoration based on adversarial structure learning

Article Open access 30 June 2025

Introduction

Ancient murals are irreplaceable records of human civilization, offering profound insights into the artistic achievements, social life, religious beliefs, and historical events of bygone eras. While the magnificent mural sites such as the Dunhuang Grottoes and the Yongle Palace have received considerable attention, the mural heritage scattered throughout the Jiangnan region of China, particularly within the ancestral halls, temples, and traditional residences in Zhejiang Province, possesses unique artistic and historical value that has not been fully explored and faces urgent preservation threats.

Unlike the arid regions of northern China, the perennially hot and humid climate of the Jiangnan region poses a continuous and severe threat to murals primarily composed of earth, wood, and lime. Field surveys reveal that a vast number of surviving murals exhibit alarming deterioration, with a high coverage of damage, and many are on the verge of disappearing^1,2. As shown in Fig. 1, multiple types of damage, such as cracks, pigment layer flaking, fading, mold, and water stains, often intertwine on the same mural. This complex, composite pattern of deterioration, coupled with the inherent irreversibility, high cost, and technical difficulty of physical restoration, makes the conservation of Jiangnan murals a challenging and urgent task^3,4,5,6.

**Fig. 1: Common composite deteriorations in a Jiangnan mural.**

In the face of these difficulties, digital restoration, as a crucial non-invasive preservation method, shows immense potential^7,8,9,10,11. However, when applying advanced deep learning techniques to the actual restoration of Jiangnan murals, we must confront two intertwined and severe core challenges.

The first challenge is the contamination of structural information caused by dirty input. In damaged mural images, deteriorations like cracks, mold spots, and stains are often visually indistinguishable from the original artistic lines. Standard feature extraction algorithms cannot intelligently differentiate between the two, resulting in severely polluted structural maps. Using such a map, which is mixed with a large amount of non-artistic features, as guidance for restoration will cause the model to either reproduce the damages or generate distorted, illogical content due to incorrect guidance, ultimately leading to restoration failure.

The second, and more fundamental, challenge is comprehensive data scarcity. Firstly, unlike the large-scale murals in northern China, which are concentrated within extensive cave complexes, the murals of the Jiangnan region are primarily dispersed among private collectors and local communities, and high-quality scanned data is scarce. These murals were mostly created by folk artists from different dynasties with varied styles, lacking a unified artistic paradigm, which makes it impractical to build a generalizable training set from the mural data itself. The most crucial point is that the original state of these artworks cannot be verified and lacks authoritative restoration evidence. This comprehensive lack of source data in terms of quality, quantity, stylistic consistency, and supervisory signals completely blocks the application path for traditional deep learning methods, compelling us to seek a solution that does not rely on training data from the murals themselves.

The digital restoration of cultural heritage such as ancient murals is an interdisciplinary field of computer vision and heritage conservation science, with the goal of carefully preserving the style and structural integrity of the original artwork while restoring degradation^12,13,14. This field is constantly advancing with advanced technologies such as image restoration, generative modeling, and style aware synthesis^15,16,17. However, when applying general algorithms directly to such cultural heritage, there are still two fundamental challenges: achieving high style fidelity when real training data is scarce, and ensuring structural accuracy when the input image itself is damaged due to complex degradation.

Early approaches to image inpainting relied on mathematical models, such as Bertalmio et al.¹⁸ proposing a diffusion-based approach and Criminisi et al.¹⁹ developing an exemplar-based method. Although these technologies are effective for small defects, they are usually unable to generate semantically reasonable repair content for the large-scale and unstructured defects common in murals.

The emergence of deep learning, especially the use of Generative Adversarial Networks by Yu et al.²⁰ to learn complex prior knowledge of images from data, has significantly improved image restoration techniques. To enhance structural coherence, subsequent works introduced structure-guided pipelines that first predict a structural map of the missing region and then use it to guide content generation²¹. However, when this paradigm is applied to degraded works of art, features such as cracks and stains in the source image are often mistaken for artistic structural lines, leading to unintentional replication damage to the model. This vulnerability highlights the need for a mechanism to distinguish artistic features from degradation.

More recently, Denoising Diffusion Probabilistic Models²² and their latent-space variants like Stable Diffusion²³ have become the state-of-the-art for high-fidelity image generation. Its potential application in image restoration has also received widespread attention^24,25. Despite their powerful generative capabilities, these models, pre-trained on vast general-purpose datasets, typically lack familiarity with the specific artistic conventions of a niche domain like Jiangnan murals. This domain gap can prevent them from presenting repair results with the required style fidelity, often resulting in perceived generic or inconsistent outcomes.

Adapting large-scale generative models to specific artistic styles with limited data is a core challenge. Although the Dreambooth method proposed by Ruiz et al.²⁶ is good at model personalization for specific objects, it has limited effect in dealing with abstract artistic styles. Abstract artistic styles are more suitable to be defined by the statistical feature set proposed by Gatys et al.²⁷. The parameter efficient fine tuning technology such as LoRA developed by Hu et al.²⁸ provides a lightweight solution, but these methods still have shortcomings in dealing with style content entanglement²⁹.

For restoration to be structurally accurate, the generative process must be precisely guided. The ControlNet framework proposed by Zhang et al.³⁰ provides a powerful spatial condition control capability for the pre-training diffusion model, but its performance is highly dependent on the quality of the input condition graph. In the context of mural restoration, conditional maps are extracted from degradation sources, and ControlNet will faithfully reproduce any degradation features in these maps.

This paper proposes an innovative workflow named Structurally Guided Proxy Restoration (SGPR). Its core idea is to completely decouple the style learning and target restoration processes through a paradigm shift, thereby bypassing the dependency on mural data. Our main contributions can be summarized in three points.

(1)
A data-efficient restoration path based on proxy learning is proposed. Since mural data is unavailable, we turn to proxy data that shares a common style but is more accessible. By applying transfer learning on a massive dataset of classical Chinese paintings, we train the ArtBooth model to deeply understand the artistic style of traditional Chinese painting, thereby acquiring the necessary stylistic prior for restoration.
(2)
A robust, deterioration-resistant Selective Feature Extraction (SFE) algorithm is designed. To solve the dirty input problem, we designed the training-free SFE algorithm. Based on the engineering principles of multi-scale analysis and cross-modal validation, this algorithm can intelligently identify and separate the structural lines of the original artistic content directly from the damaged mural, effectively filtering out deterioration interference.
(3)
A controllable restoration engine with high fidelity in both style and structure is built and optimized. We integrate the ArtBooth model, which possesses a stylistic prior, with an optimized dual-condition control network (OptCtrl). This engine utilizes the clean structural maps provided by the SFE algorithm to precisely constrain the generation process, ensuring that the restoration results achieve a high level of fidelity in both structural and stylistic dimensions.

Methods

To address the severe interference from composite deteriorations in ancient Jiangnan murals and the problem of scarce training data, we have designed and implemented a systematic restoration method named SGPR. As shown in Fig. 2, this method aims to achieve two core objectives: first, to robustly extract the original, cleaner structural information of the mural from a degraded image full of interference; and second, to use this structural information to accurately guide a generative model, equipped with specific artistic style priors, to restore the details and appearance of the mural with high fidelity.

**Fig. 2: The overall architecture of the SGPR workflow.**

Proxy-learning-based stylistic prior model (ArtBooth)

To circumvent the lack of mural training data, we propose a transfer strategy based on proxy learning. The theoretical foundation of this strategy is that Jiangnan murals and traditional Chinese painting share significant commonalities and a lineage in their artistic language. Therefore, our goal is not to learn the content of specific painting subjects, but to capture and learn the underlying, transferable artistic features they share, such as lines, brushwork, coloration, and composition. By learning these essential features, the model can acquire a powerful stylistic prior, laying a solid foundation for the subsequent restoration of murals that are different in content but stylistically consistent.

To this end, we designed the ArtBooth algorithm, which fine-tunes the Stable Diffusion XL (SDXL) model to achieve deep learning of the aforementioned artistic features. The core innovation of ArtBooth lies in its introduction of a style-aware regularization loss based on the Sliced Wasserstein Distance (SWD) within the personalized fine-tuning framework of Dreambooth. Compared to traditional methods based on the Gram matrix, which only focus on the second-order statistical correlations between feature channels (primarily reflecting texture), SWD can more comprehensively measure the similarity in the geometric structure of the overall multi-scale feature distributions between the generated image and the target style. This makes it highly suitable for capturing more complex artistic essentials such as the dynamic forms of brushwork, the harmonious distribution of colors, and the intrinsic structure of composition.

To support the training of the ArtBooth model, we have specifically constructed a proxy dataset of classical Chinese painting. This dataset system is collected from multiple authoritative and copyrighted public resources such as the Palace Museum, and contains a total of 6403 representative works covering the Five Dynasties to the Qing Dynasty, covering six major painting disciplines: landscape, flowers and birds, figures, architecture, animals, and calligraphy. After undergoing pre-processing processes such as CLIP (Contrastive Language-Image Pre-training) and ResNet joint classification, image normalization, and quality control, a total of 76,450 high-quality image blocks (including 38,225 1024 $\times$ 1024 and 38,225 512 $\times$ 512 image blocks) were generated through rolling block partitioning, which together constitute the basic training set. In addition to its role in data filtering, the CLIP model also provides semantic understanding of artistic content. To interpret these semantic features, we employ differential attention techniques to visualize CLIP’s patch-text alignment, generating heatmaps that highlight regions associated with key concepts such as “mural structure” or “degradation.” These visualizations (see Supplementary Fig. 1) confirm CLIP’s ability to distinguish artistic content from damage, thereby providing semantic priors that support subsequent restoration stages. On this basis, we further constructed two key subsets: one is the style regularization set, which includes 600 representative works selected by experts as style references for ArtBooth style loss; The second is to evaluate the reference set, which includes 5000 images sampled from high-quality patches to support objective evaluation of generated quality.

As a high-quality style proxy, this dataset enables the model to learn transferable artistic features from rich traditional Chinese painting, providing necessary style priors for the core method. Based on this design, the total loss function of ArtBooth consists of a weighted sum of the instance reconstruction loss and the style-aware regularization loss, as shown in Eq. (1):

$${{\mathscr{L}}}_{\text{ArtBooth}}={{\mathscr{L}}}_{\text{instance}}+{\rm{\lambda }}\cdot {{\mathscr{L}}}_{\text{style}},$$

(1)

where ${{\mathscr{L}}}_{\text{instance}}$ follows the standard training objective of diffusion models, responsible for ensuring basic consistency between the generated content and the training data. ${{\mathscr{L}}}_{\text{style}}$ is the key component of ArtBooth, designed to minimize the SWD between the feature distributions of the generated image ${{\rm{I}}}_{\text{gen}}$ and a reference image ${{\rm{I}}}_{\text{ref}}$ randomly sampled from a style regularization set (a set of high-quality Chinese painting reference samples) across multiple U-Net layers. As shown in Eq. (2):

$${{\mathscr{L}}}_{\text{style}}\left({{\rm{I}}}_{\mathrm{gen}},{{\rm{I}}}_{\mathrm{ref}}\right)={\sum }_{{\rm{l}}\in {\rm{L}}}{{\rm{w}}}_{{\rm{l}}}\cdot {{\mathbb{E}}}_{{\rm{\theta }}\sim {{\rm{S}}}^{{{\rm{c}}}_{{\rm{l}}}-1}}\left[{{\rm{W}}}_{{\rm{p}}}\left({{\rm{\theta }}}^{{\rm{T}}}{{\rm{\phi }}}_{{\rm{l}}}({{\rm{I}}}_{\mathrm{gen}}),{{\rm{\theta }}}^{{\rm{T}}}{{\rm{\phi }}}_{{\rm{l}}}({{\rm{I}}}_{\mathrm{ref}})\right)\right],$$

(2)

Here, ${{\rm{\phi }}}_{{\rm{l}}}(\cdot )$ denotes the feature map extracted at layer ${\rm{l}}$ of the U-Net, ${\rm{\theta }}$ is a projection direction randomly sampled from the unit hyper sphere ${{\rm{S}}}^{{{\rm{c}}}_{{\rm{l}}}-1}$, and ${{\rm{W}}}_{{\rm{p}}}$ represents the computation of the 1D Wasserstein distance. This design drives the model beyond static texture imitation to deeply learn the dynamic and structural essence that constitutes the artistic style of Chinese painting, ultimately providing a robust and high-quality stylized generative engine for the restoration task.

Deterioration-resistant structural feature extraction (SFE)

The SFE algorithm is designed to effectively handle the specific deterioration patterns observed in Jiangnan murals. It is an engineering method based on image processing principles and requires no model training.

The mural data we process is commonly characterized by relatively well-preserved core artistic structures (e.g., contours, clothing folds), but the overall picture is contaminated by large areas of non-structural damage such as mold, water stains, and fading. Based on this observation, the primary task of SFE is set to separate and extract the original artistic lines hidden beneath the surface contamination. This task differs in nature from restoring structural losses caused by large-scale physical flaking (common in northern murals).

The implementation of the algorithm is based on a core assumption: under multi-scale analysis, genuine artistic structural features exhibit higher consistency and predictability compared to random deterioration marks. As shown in Fig. 3, the algorithm’s workflow is as follows, while Fig. 4, provides a schematic illustration of how SFE selectively distinguishes and enhances the underlying artistic features from noisy and contaminated regions, visually supporting the principle of feature consistency across scales.

1.
Multi-resolution feature extraction: The input damaged mural image ${\rm{I}}$ is processed into both high-resolution (HR, e.g., 1024 pixels) and low-resolution (LR, e.g., 512 pixels) versions. Then, Canny edge features (${{\rm{E}}}_{{\rm{LR}}},{{\rm{E}}}_{{\rm{HR}}}$) and Lineart features (${{\rm{L}}}_{{\rm{LR}}},{{\rm{L}}}_{{\rm{HR}}}$) are computed on these two resolutions, respectively. Low-resolution processing helps suppress fine deterioration features and focus on the overall structure, while high-resolution processing preserves more detail.
2.
High-confidence envelope mask generation: This step aims to fuse the consensus information from two different feature extractors at low resolution. First, the low-resolution edge map ${{\rm{E}}}_{{\rm{LR}}}$ and lineart map ${{\rm{L}}}_{{\rm{LR}}}$ are upsampled to high resolution, denoted as ${{\rm{E}}}_{{\rm{LR}}}^{{\rm{up}}}$ and ${{\rm{L}}}_{{\rm{LR}}}^{{\rm{up}}}$. The upsampling process from low resolution to high resolution uses bilinear interpolation. This method achieves a good balance between computational efficiency and visual quality, effectively preserving structural information while minimizing artifacts. Our SFE algorithm uses a fixed threshold for Canny edge detection, with a low threshold of 100 and a high threshold of 200. This parameter configuration has been empirically studied and can effectively extract structural features from Jiangnan mural images while suppressing noise interference.

Next, a slight morphological dilation is applied to both, and their intersection is computed. In the morphological dilation operation, we use 3 × 3 rectangular structural elements to enhance edge connectivity. Finally, a morphological closing operation is performed on the intersection result, this closing operation also uses a 3 × 3 rectangular structural element, and small connected noise regions are removed to obtain the final envelope mask ${{\rm{M}}}_{{\rm{envelope}}}$. The regions marked by this mask represent features identified by both detectors at low resolution, and are therefore considered to have a high probability of belonging to the original artistic structure.
3.
Mask-guided feature refinement: This ${{\rm{M}}}_{{\rm{envelope}}}$ is used to purify the feature maps extracted from the high-resolution image. For the Lineart map, we use ${{\rm{M}}}_{{\rm{envelope}}}$ to weight-modulate the high-resolution ${{\rm{L}}}_{{\rm{HR}}}$. Features within the envelope are preserved or slightly enhanced, while features outside are significantly suppressed. The final modulated line draft L is shown in Eq. (3):

$${{\rm{L}}}_{\mathrm{final}}=({\rm{\gamma }}\cdot {{\rm{L}}}_{\mathrm{HR}}\odot {{\rm{M}}}_{\mathrm{envelope}})+({\rm{\delta }}\cdot {{\rm{L}}}_{\mathrm{HR}}\odot (1-{{\rm{M}}}_{\mathrm{envelope}})),$$

(3)

where $\odot$ denotes element-wise multiplication. The modulation weights were set to ${\rm{\gamma }}=1.2$ and ${\rm{\delta }}=0.3$. We evaluated the values of these two parameters from 0 to 2 in a separate validation set with a step size of 0.1. The final value was chosen because they collectively produce the best performance, as measured by the structural coherence and noise suppression in the output linear part ${{\rm{L}}}_{{\rm{final}}}$. For the Canny edge map, the process is more direct: ${{\rm{M}}}_{{\rm{envelope}}}$ is used to filter the high-resolution ${{\rm{E}}}_{{\rm{HR}}}$, retaining only the edges within the mask. The final filtered edge map ${{\rm{E}}}_{{\rm{canny}}}$ is given by the following Eq. (4):

$${{\rm{E}}}_{\mathrm{canny}}={{\rm{E}}}_{\mathrm{HR}}\odot {{\rm{M}}}_{\mathrm{envelope}}.$$

(4)

**Fig. 3: Flowchart of the selective feature extraction (SFE) algorithm.**

**Fig. 4: Selective feature extraction (SFE) illustration.**

Multi-condition controlled restoration via optimized ControlNet (OptCtrl)

After obtaining the optimized structural maps from the SFE algorithm, we precisely guide the ArtBooth restoration model using a multi-condition control mechanism based on ControlNet, as proposed by Zhang et al.³⁰. We further optimized the training of ControlNet to adapt it to the ArtBooth model and the output of the SFE algorithm, referring to this optimized version as OptCtrl.

We propose a dual-condition control strategy that aims to combine two complementary types of structural information to jointly guide the restoration. The SFE-optimized Canny edge map, ${{\rm{E}}}_{{\rm{canny}}}$, provides comprehensive micro-structural constraints, ensuring the accuracy of object boundaries and spatial layout. The optimized Lineart map, ${{\rm{L}}}_{{\rm{final}}}$, is better at capturing artistic line expressions, which helps in restoring the brushwork texture and stylistic features of the mural.

The optimization training of OptCtrl adopted a separate strategy, where independent adapters were trained for Canny and Lineart conditions. During training, we froze the weights of the pre-trained ArtBooth model and initialized with publicly available pre-trained ControlNet weights. The training data came from our constructed proxy dataset of Chinese paintings. For each training image ${{\rm{x}}}_{0}$, we computed its corresponding optimized Canny map ${{\rm{E}}}_{{\rm{canny}}}$ and Lineart map ${{\rm{L}}}_{{\rm{final}}}$ using the SFE algorithm to form a training pair. The training objective is to minimize the noise prediction error under a given text condition ${{\rm{c}}}_{{\rm{text}}}$ and structural condition (using Canny as an example). As shown in the following Eq. (5):

$${{\mathscr{L}}}_{\text{OptCtrl}-\text{Canny}}={{\mathbb{E}}}_{{{\rm{x}}}_{0},{\rm{\epsilon }},{\rm{t}},{{\rm{c}}}_{{\rm{text}}},{{\rm{E}}}_{{\rm{canny}}}}\left[{\left\Vert{\rm{\epsilon }}-{{\rm{\epsilon }}}_{{\rm{\theta }}}\left({{\rm{z}}}_{{\rm{t}}},{\rm{t}},{{\rm{c}}}_{\text{text}},{{\rm{E}}}_{{\rm{canny}}}\right)\right\Vert}_{2}^{2}\right],$$

(5)

${{\mathscr{L}}}_{\text{OptCtrl}-\text{Canny}}$ is the loss function of the model used by OptCtrl to reconstruct the original image from Canny-processed images. Among them,${{\rm{z}}}_{{\rm{t}}}$ is the noise image at time step t,${\rm{\epsilon }}$ is the added noise, ${{\rm{c}}}_{\text{text}}$ is the text condition, and ${{\rm{E}}}_{{\rm{canny}}}$ is the optimized Canny edge image, used to reconstruct the original image. During inference, the independently trained OptCtrl-Canny and OptCtrl-Lineart models are loaded in parallel, acting simultaneously on ArtBooth’s U-Net. In each denoising step, the modulation signals from the two OptCtrl adapters are fused to jointly guide the generation process.

Results

Dataset

To comprehensively evaluate the effectiveness of the proposed SGPR method, we conducted systematic experiments using two types of datasets. The first type is a real mural dataset, which selected 11 representative mural samples from the Songxi residential mural group in Jinhua City for testing. These murals are quintessential examples from the Ming and Qing dynasties, ensuring that our validation is grounded in historically and artistically significant cases. These high-precision digital images are sourced from the Zhejiang Provincial Cultural Relics Bureau and cover typical diseases such as mold, alkali, fading, and cracks in the humid environment of Jiangnan, ensuring the authenticity and representativeness of the verification data. These samples were jointly composed of an expert team from the Digital Culture Innovation Research Institute and the Handicraft College of Zhejiang Vocational College of Arts, and the restoration process strictly followed the principles of historical authenticity, aesthetic integrity, and operational traceability; The second type is a simulated degradation dataset, which is constructed by randomly selecting 100 images from a specialized dataset of traditional Chinese paintings and applying controlled degradation operations (e.g., such as synthetic stains, structural aging, and multilevel noise) to emulate typical deterioration patterns of Jiangnan murals, thereby producing accurately paired data for rigorous quantitative validation.

Evaluation metrics

We employed a combination of quantitative and qualitative evaluation methods. Quantitative metrics included Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM) as proposed by Wang et al., and Learned Perceptual Image Patch Similarity (LPIPS) developed by Zhang et al.³¹. The qualitative assessment was completed through blind review by five experts, consisting of three senior cultural relic protection experts from the Zhejiang Provincial Cultural Relics Bureau and two art history professors from the China Academy of Art.

Comparison methods

As there are currently no open-source restoration algorithms specifically designed for the deterioration characteristics of Jiangnan murals, we selected representative open-source general-purpose restoration methods and different ablation versions of our method as baselines. This choice is based on the principle of reproducibility and also considers that the deterioration patterns of murals from different regions (e.g., Dunhuang) differ significantly from the target of this study. The standard ControlNet (StdCtrl) used in the benchmark method is its publicly released, unadjusted original version, in order to accurately evaluate the independent contribution of our proposed OptCtrl component. The comparison methods are as follows:

(1)
Img2Img (ArtBooth): Using the ArtBooth model for light image re-painting.
(2)
For inpainting, we employed the general-purpose inpainting model PowerPaint developed by Zhuang et al.³².
(3)
ArtBooth+StdCtrl: The ArtBooth model combined with the standard, unoptimized ControlNet.
(4)
SGPR (Our Method): The complete ArtBooth + OptCtrl + SFE solution.

All comparative experiments were conducted in a unified experimental environment using the same inference parameters (30 sampling steps, CFG ratio 8.0, DPM++2M sampler) to ensure fairness in the comparison.

Validation on simulated data

On the simulated degradation dataset, we conducted ablation studies to verify the contribution of each component. The quantitative results are shown in Table 1.

Table 1 Quantitative results of the restoration workflow ablation study on simulated data

Full size table

The quantitative results clearly show that SGPR significantly leads on all metrics. The step-by-step improvements in the ablation study validate the rationale of our method’s design: ArtBooth provides the necessary stylistic prior, while the combination of SFE and OptCtrl achieves robust suppression of deteriorations and precise control over the structure. The visual comparison (see Fig. 5) further confirms that our method comes closest to the original image in removing simulated deteriorations, reconstructing structure, and restoring artistic style.

**Fig. 5: Visual comparison of different methods on the simulated degradation dataset.**

Application on real murals

We applied the SGPR method to real mural samples from Songxi village. The quantitative results, shown in Table 2, compare the quantitative differences between the restoration results of different methods and the expert’s manual digital restoration reference.

Table 2 Quantitative comparison of real Songxi mural restoration results with the expert restoration reference

Full size table

The quantitative metrics again show that SGPR’s results are closest to the expert restoration reference. Visual comparisons (see Figs. 6 and 7) reveal that baseline methods suffer from structural distortions (Img2Img), uncontrollable content (Inpaint), or residual deterioration marks (ArtBooth+StdCtrl). In contrast, SGPR not only effectively removes complex deteriorations but also generates natural details consistent with the original style.

**Fig. 6: Qualitative comparison of restoration results on a real mural from Songxi village.**

**Fig. 7: Zoomed-in detail view from the restoration results shown in Fig. 6.**

In addition, in order to comprehensively evaluate the repair results, this study adopted a qualitative evaluation method. The five experts conducting blind review have over 10 years of professional experience in mural preservation or traditional art history. The evaluation follows the following strict procedures:

The evaluation adopts the following rigorous process:

(1)
Select 11 samples from the Songxi Residential Mural Group in Jinhua City, covering different periods, styles, and typical degradation types, as the test set.
(2)
Each sample generates three sets of restoration results, namely: the original degraded image, the traditional expert manual restoration result, and the output result of our proposed SGPR method.
(3)
After randomization, anonymization, and standardized processing of all repair results, each expert independently reviewed all 33 repair results (11 images × 3 versions/image).

Experts rated the restoration results based on the Likert five-point scale from four dimensions: degradation removal, structural coherence, style consistency, and overall visual quality. The final score is obtained using the comprehensive scoring method, which calculates the arithmetic mean of each method on all experts and all indicators, with the average score rounded to one decimal place, and the specific results are shown in Table 3.

Table 3 Results of expert blind review evaluation

Full size table

The evaluation results indicate that our proposed SGPR method has shown significant effectiveness in removing damage, and its overall repair quality is close to expert level. The effectiveness of this method in digital restoration of Jiangnan murals has been demonstrated.

Discussion

Despite the promising results, our study has several limitations that open clear avenues for future research. First, while the training-free SFE algorithm proved effective for the deteriorations studied, its reliance on handcrafted rules means its generalization capability to entirely new or unseen patterns of damage warrants further investigation. Second, our proposed method is primarily designed to restore degraded content rather than to inpaint large areas of complete information loss, such as those caused by significant physical flaking. As SFE cannot extract structural guidance from missing regions, restoration in these areas defaults to the generative prior of the ArtBooth model, which may not align with the lost content. This defines the method’s scope and aligns with the principle of conservative restoration, avoiding unsubstantiated reconstruction.

Although the method proposed in this study demonstrates excellent performance in most mural restoration scenarios, it still faces certain limitations. As shown in Fig. 8, “Resting Among the Rocks” presents an extreme case of degradation, where large areas of paint have flaked off, resulting in severe loss of the original structural content, representing a typical scenario of scarce prior information. In situations where structural information is almost completely missing, the structured edge extraction module used in this study cannot provide effective semantic guidance, causing the model to rely excessively on its internal priors during restoration. As a result, the final outcome falls short in terms of structural coherence and stylistic consistency. Such extreme cases highlight the inherent challenges of mural restoration when sufficient structural priors are absent and point to an important direction for future research: how to achieve more robust edge inference and content reconstruction under conditions of severe information loss, thereby expanding the applicable boundaries of digital restoration methods. Finally, the current study was conducted on monochrome murals; adapting the SGPR framework to colored murals, which involves the complex task of restoring both structure and chrominance in a stylistically consistent manner, remains a significant open challenge.

Future work will proceed in several key directions to address these limitations. A primary goal is to move beyond the handcrafted SFE by developing a learning-based, end-to-end feature extraction network. Such a model could potentially learn to distinguish artistic strokes from deterioration marks more robustly and adaptively across a wider range of mural types. To better address large missing regions, we plan to investigate the integration of multi-modal data, such as information from infrared reflectography or multispectral imaging, which can often reveal underlying sketches or pigment traces invisible to the naked eye. Furthermore, extending the framework to accommodate colored murals is a key priority. This will likely involve exploring methods for decoupled structure-and-color control to manage the added complexity. Ultimately, we envision integrating these advancements into a human-in-the-loop interactive restoration system. Such a platform would empower conservation experts to guide, refine, and validate the AI-driven restoration process, creating a powerful synergy between the efficiency of automated algorithms and the indispensable knowledge of human specialists.

This paper addresses the critical challenges of digital restoration for ancient Jiangnan murals, which suffer from complex composite deterioration and a severe lack of domain-specific training data. We proposed SGPR, which is an integrated workflow. By synergistically combining a style-aware generative model trained on proxy data (ArtBooth), a deterioration-resistant structural feature extractor (SFE), and an optimized control mechanism (OptCtrl), our method successfully resolves the core issues of structural information contamination and insufficient style preservation. Extensive experiments on both simulated and authentic murals have validated the superior performance of our method in deterioration removal, structure recovery, and artistic style preservation.

The significance of this research extends beyond the specific application. The proxy learning + structure purification paradigm presented here offers a robust and data-efficient blueprint for the intelligent restoration of other types of cultural heritage that face similar data scarcity constraints. By decoupling style acquisition from structurally-guided synthesis, our work provides a viable pathway for applying state-of-the-art generative AI in heritage science, where the cost or impossibility of acquiring large-scale paired data has long been a major impediment. Ultimately, this study contributes an effective computational solution for the preservation of endangered mural heritage and makes a valuable methodological contribution to the broader field of digital humanities.

Data availability

The mural image data that support the findings of this study are the property of the Cultural Heritage Bureau of Zhejiang Province and are subject to copyright restrictions. Due to these restrictions, the data are not publicly available. Access to the data can be requested by contacting the corresponding author, who will facilitate the application process with the Bureau.

Code availability

The code developed for this study is available from the corresponding author upon reasonable request.

References

Huang, M. & Yang, J. A study on the conservation and restoration of murals in traditional Jiangnan architecture: Taking the practice of rectifying the screen wall and restoring murals at Chen Wangdao’s former residence as an example. Oriental Miscellany 112–120 (2022).
Xu, T. Evaluation of Surface Reinforcement And Crack Grouting Materials for Murals in Humid Southern Regions (Zhejiang University, 2019).
Sheng, X. A Preliminary Study on Novel Adhesive Agents for Murals—Targeting Humid Southern Regions (China Academy of Art, 2017).
Deng, X. & Yu, Y. Ancient mural inpainting via structure information guided two-branch model. Herit. Sci. 11, 131 (2023).
Article Google Scholar
Lv, C., Li, Z., Shen, Y., Li, J. & Zheng, J. Separafill: two generators connected mural image restoration based on generative adversarial network with skip connect. Herit. Sci. 10, 135 (2022).
Article Google Scholar
Lyu, Q., Zhao, N., Song, J., Yang, Y. & Gong, Y. Mural inpainting via two-stage generative adversarial network. npj Herit. Sci. 13, 188 (2025).
Article Google Scholar
Li, Y. et al. Efficient and explicit modelling of image hierarchies for image restoration. In Proc. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada. 18278–18289 (IEEE, 2023).
Zeng, Y., Fu, J., Chao, H. & Guo, B. Aggregated contextual transformations for high-resolution image inpainting. IEEE Trans. Visualiz. Comput. Graph. 29, 3266–3280 (2023).
Article Google Scholar
Yang, J., Ruhaiyem, N. I. R. & Zhou, C. A 3M-Hybrid model for the restoration of unique giant murals: a case study on the murals of Yongle Palace. IEEE Access 13, 38809–38824 (2025).
Article Google Scholar
Zheng, C., Cham, T. J., Cai, J. & Phung, D. Bridging global context interactions for high-fidelity image completion. In Proc. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11502–11512 (IEEE, 2022).
Quan, W. et al. Image Inpainting with Local and Global Refinement. IEEE Trans. Image Process. 31, 2405–2420 (2022).
Article PubMed Google Scholar
Li, Y., Zhang, C., Li, Y., Sui, D. & Guo, M. An improved mural image restoration method based on diffusion model. Herit. Sci. 13, 347 (2025).
Article CAS Google Scholar
Tian, B. et al. HASM: restoration of Dunhuang murals guided by spatial structure prediction combined with line drawing assistance. J. Univ. Electron. Sci. Technol. China 54, 582–591 (2025).
Google Scholar
Xu, H., Kang, J. & Zhang, J. Feature-aware based digital mural restoration method. Comput. Sci. 49, 217–223 (2022).
Google Scholar
Cao, J. et al. Two-stage mural image restoration using an edge-constrained attention mechanism. PLoS ONE 19 (2024).
Shi, J. et al. Image inpainting guided by image smoothing structure. J. Front. Comput. Sci. Technol. 19, 2149–2160 (2025).
Google Scholar
Qiang, Z. et al. A review of deep learning-based image inpainting methods. J. Image Graph. 24, 447–463 (2019).
Google Scholar
Bertalmio, M., Sapiro, G., Caselles, V. & Ballester, C. Image inpainting. In Proc. 27th Annual Conference on Computer Graphics and Interactive Techniques (417–424) (ACM Press/Addison-Wesley Publishing Co., USA, 2000).
Criminisi, A., Pérez, P. & Toyama, K. Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process. 13, 1200–1212 (2004).
Article PubMed Google Scholar
Yu, J. et al. Generative image inpainting with contextual attention. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, (5505–5514) (Computer Vision Foundation/IEEE Computer Society, Salt Lake City, UT, USA, 2018).
Nazeri, K., Ng, E., Joseph, T., Qureshi, F. Z. & Ebrahimi, M. EdgeConnect: structure guided image inpainting using edge prediction. In Proc. 2019 IEEE/CVF International Conference on Computer Vision Workshops (3265–3274) (IEEE, Seoul, Korea (South), 2019).
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).
Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent diffusion models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (10674–10685) (IEEE, New Orleans, LA, USA, 2022).
Wang, Z., Zhang, J., Ji, Z., Bai, J. & Shan, S. CCLAP: controllable Chinese landscape painting generation via latent diffusion model. In Proc. IEEE International Conference on Multimedia and Expo (2117–2122) (IEEE, Brisbane, Australia, 2023).
Xu, Z. et al. MuralDiff: diffusion for ancient murals restoration on large-scale pre-training. IEEE Trans. Emerg. Top. Comput. Intell. 8, 2169–2181 (2024).
Article Google Scholar
Ruiz, N. et al. DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (22500–22510) (IEEE, Vancouver, BC, Canada, 2023).
Gatys, L. A., Ecker, A. S. & Bethge, M. Image style transfer using convolutional neural networks. In Proc. 2016 IEEE Conference on Computer Vision and Pattern Recognition (2414–2423) (IEEE Computer Society, Las Vegas, NV, USA, 2016).
Hu, E. J. et al. LoRA: low-rank adaptation of large language models. In Proc. Tenth International Conference on Learning Representations (OpenReview.net, Virtual Event, 2022).
Gu, Y. et al. Mix-of-Show: decentralized low-rank adaptation for multi-concept customization of diffusion models. in Advances in Neural Information Processing Systems (eds Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M. & Levine, S.) Vol. 36, pp. 1–12 (NeurIPS, New Orleans, LA, USA, 2023).
Zhang, L., Rao, A. & Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proc. IEEE/CVF International Conference on Computer Vision (3813–3824) (IEEE, Paris, France, 2023).
Zhang, R., Isola, P., Efros, A. A., Shechtman, E. & Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (586-595) (Computer Vision Foundation/IEEE Computer Society, Salt Lake City, UT, USA, 2018).
Zhuang, J., Zeng, Y., Liu, W., Yuan, C. & Chen, K. A task is worth one word: learning with task prompts for high-quality versatile image inpainting. in Computer Vision - ECCV 2024 (eds Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T. & Varol, G.) Vol. 15116, 195–211 (Lecture Notes in Computer Science, Springer, Milan, Italy, 2024).

Download references

Acknowledgements

This work was supported by the Scientific Research Foundation of Hangzhou City University under Grant No.X-202502.

Author information

Authors and Affiliations

School of Art and Archaeology, Hangzhou City University, Hangzhou, Zhejiang Province, China
Cheng Yang & Yuxuan Liu
College of Computer Science and Technology, Zhejiang University, Hangzhou, Zhejiang Province, China
Cheng Yang & Yi Cai

Authors

Cheng Yang
View author publications
Search author on:PubMed Google Scholar
Yuxuan Liu
View author publications
Search author on:PubMed Google Scholar
Yi Cai
View author publications
Search author on:PubMed Google Scholar

Contributions

C.Y. and Y.C. conceived the research idea and conceptualized the study. C.Y. and Y.C. designed the experimental research methodology. Y.L. and Y.C. completed the drafting and preparation of the manuscript. C.Y., Y.L., and Y.C. supervised the manuscript preparation process and provided critical feedback. All authors discussed the results, contributed to manuscript revisions, and approved the final version.

Corresponding author

Correspondence to Yuxuan Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Figure 1 (download DOCX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Yang, C., Liu, Y. & Cai, Y. Digital restoration of ancient Jiangnan murals via proxy learning and structural guidance. npj Herit. Sci. 14, 182 (2026). https://doi.org/10.1038/s40494-026-02369-y

Download citation

Received: 19 September 2025
Accepted: 04 February 2026
Published: 25 March 2026
Version of record: 25 March 2026
DOI: https://doi.org/10.1038/s40494-026-02369-y

Digital restoration of ancient Jiangnan murals via proxy learning and structural guidance

Abstract

Similar content being viewed by others

Ancient mural inpainting via structure information guided two-branch model

Mural inpainting via two-stage generative adversarial network

Progressive generative mural image restoration based on adversarial structure learning

Introduction

Methods

Proxy-learning-based stylistic prior model (ArtBooth)

Deterioration-resistant structural feature extraction (SFE)

Multi-condition controlled restoration via optimized ControlNet (OptCtrl)